2023年8月9日 16:15:27go评论169阅读模式

英文:

Replace/Merge operations in vectors using CUDA Thrust

问题

我有两个操作可以使用CUDA Thrust来操作设备向量中的元素。哪些方法可以更高效地实现这两个功能？

批量替换向量中的部分值为另一个向量的值。示例如下：
```
arr1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
arr2 = [1, 1, 1, 2, 2, 2]
// 批量替换 {4, 5, 6} 和 {10, 11, 12}：
arr1 = [1, 2, 3, 1, 1, 1, 7, 8, 9, 2, 2, 2]
```
在我的情况下，我总是有 size(arr1) / size(arr2) = 2。

我们从索引 1 * batch 和 3 * batch 开始替换 arr1 中的值。

在大多数情况下，我需要替换奇数批次的项目。也需要处理一般批次的情况。
交替合并两个向量的索引。

在R语言的问题中提出了同样的问题。
```
arr1 = [1, 2, 3, 4, 5, 6]
arr2 = [1, 1, 1, 2, 2, 2]
// 合并 arr1 和 arr2：
arr3 = [1, 2, 3, 1, 1, 1, 4, 5, 6, 2, 2, 2]
```
replace_copy_if 可能适用，但我不知道如何与 fancy iterators 结合使用。此外，一些博客显示 replace_copy_if 比 copy_if 更慢。

英文:

I have two operations for manipulating elements in device vectors using CUDA Thrust. Which methods can implement these two functions more efficiently?

Replace part of values of a vector in batch with the values from another vector. Example is shown below:
```
arr1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
arr2 = [1, 1, 1, 2, 2, 2]
// After replacing {4, 5, 6} and {10, 11, 12} in batch = 3:
arr1 = [1, 2, 3, 1, 1, 1, 7, 8, 9, 2, 2, 2]
```
In my case, I always have size(arr1) / size(arr2) = 2.

We replace the values in arr1 from the index 1 * batch and 3 * batch.

In most cases, I need to replace items in odd batches. The case for general batches is also needed.
Merge two vectors alternating indexes.

A same question is raised in How to merge 2 vectors alternating indexes?, but is for R language.
```
arr1 = [1, 2, 3, 4, 5, 6]
arr2 = [1, 1, 1, 2, 2, 2]
//After merging arr1 and arr2:
arr3 = [1, 2, 3, 1, 1, 1, 4, 5, 6, 2, 2, 2]
```
replace_copy_if may work, but I don't know how to combine it with fancy iterators. Additionally, some blogs show that replace_copy_if is slower than copy_if.

答案1

得分: 3

这个操作将arr2中的值散布到arr1中，所以我们可以使用thrust::scatter。可以使用thrust::transform_iterator从thrust::counting_iterator（类似于std::ranges::views::iota）计算出值被散布到的索引。对于一般情况下，批次的indices作为另一个输入的thrust::device_vector给出，你可以使用以下代码：

const auto indices_ptr = indices.data();
const auto scatter_indices_it = thrust::make_transform_iterator(
    thrust::make_counting_iterator(0),
    [batch, indices_ptr]
    __host__ __device__ (int idx) {
        const int batch_id = idx / batch;
        const int elem_id = idx % batch;
        return indices_ptr[batch_id] * batch + elem_id;
    });

对于只想散布到奇数批次的特殊情况，你应该使用以下代码：

const auto scatter_indices_it = thrust::make_transform_iterator(
    thrust::make_counting_iterator(0),
    [batch]
    __host__ __device__ (int idx) {
        const int batch_id = idx / batch;
        const int elem_id = idx % batch;
        return (batch_id * 2 + 1) * batch + elem_id;
    });

散布操作本身很简单：

thrust::scatter(
    arr2.cbegin(), arr2.cend(),
    scatter_indices_it,
    arr1.begin());

这样做肯定有多种可能的方法。我首先想到的是使用thrust::merge_by_key将两个向量合并，其中键使用与上述散布索引类似的方案生成：

const auto merge_keys1_it = thrust::make_transform_iterator(
    thrust::make_counting_iterator(0),
    [batch]
    __host__ __device__ (int idx) {
        const int batch_id = idx / batch;
        return batch_id * 2;
    });
const auto merge_keys2_it = thrust::make_transform_iterator(
    thrust::make_counting_iterator(0),
    [batch]
    __host__ __device__ (int idx) {
        const int batch_id = idx / batch;
        return batch_id * 2 + 1;
    });
thrust::merge_by_key(
    merge_keys1_it, merge_keys1_it + arr1.size(),
    merge_keys2_it, merge_keys2_it + arr2.size(),
    arr1.cbegin(),
    arr2.cbegin(),
    thrust::make_discard_iterator(),
    arr3.begin());

这个方法可以工作，并且相对优雅，但由于合并算法的复杂性和操作的规则性，可能不是性能上的理想选择。一个（可能）更高效的方法是创建一个复杂的迭代器，它负责整个交错操作：

const auto arr1_ptr = arr1.data();
const auto arr2_ptr = arr2.data();
const auto batch_interleave_it = thrust::make_transform_iterator(
    thrust::make_counting_iterator(0),
    [batch, arr1_ptr, arr2_ptr]
    __host__ __device__ (int idx) -&gt; int {
        const int batch_id = idx / batch;
        const int batch_el = idx % batch;
        const int in_idx = (batch_id / 2) * batch + batch_el;
        const bool even_batch = (batch_id % 2 == 0);
        return even_batch ? arr1_ptr[in_idx] : arr2_ptr[in_idx];
    });

现在可以将这个迭代器提供给管道中的下一个算法进行内核融合，或者使用以begin和end迭代器为参数的构造函数初始化一个新的向量。如果未来多次使用交错的arr3（读取），最好将其放入一个新的向量中，而不是重用迭代器，因为如果batch不是warp大小（32）的倍数，迭代器不允许连续的全局内存访问。

完整的源代码：

由于使用了设备lambda，nvcc需要-extended-lambda选项。自CUDA 12以来，由于某种原因，还需要-rdc=true选项。

#include <iostream>
#include <iterator>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/scatter.h>
#include <thrust/merge.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
template <typename T>
void print(const thrust::device_vector<T> &vec) {
    thrust::host_vector<int> h_vec(vec);
    std::ostream_iterator<int> out_it(std::cout, ", ");
    thrust::copy(h_vec.cbegin(), h_vec.cend(),
                 out_it);
    std::cout << '\n';
}
void part1() {
    constexpr int h_arr1[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
    constexpr int h_arr2[] = {1, 1, 1, 2, 2, 2};
    constexpr int batch = 3;
    thrust::device_vector<int> arr1(std::cbegin(h_arr1), std::cend(h_arr1));
    thrust::device_vector<int> arr2(std::cbegin(h_arr2), std::cend(h_arr2));
    
    assert(arr1.size() == 2 * arr2.size());
#if 1
    // specialized version where arr2 is scattered to "uneven"
    // batches
    const auto scatter_indices_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            const int elem_id = idx % batch;
            return (batch_id * 2 + 1) * batch + elem_id;
        });
#else
    // more general version where arr2 is scattered to batches given
    // in indices
    constexpr int h_indices[] = {1, 3};
    thrust::device_vector<int> indices(
        std::cbegin(h_indices), std::cend(h_indices));
    const auto indices_ptr = indices.data();
    const auto scatter_indices_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch, indices_ptr]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            const int elem_id = idx % batch;
            return indices_ptr[batch_id] * batch + elem_id;
        });
#endif
    
    thrust::scatter(
        arr2.cbegin(), arr2.cend(),
        scatter_indices_it,
        arr1.begin());
    print(arr1);
}
void part2() {
    constexpr int h_arr1[] = {1, 2, 3, 4, 5, 6};
    constexpr int h_arr2[] = {1, 1, 1, 2, 2, 2};
    constexpr int batch = 3;
    thrust::device_vector<int> arr1(std::cbegin(h_arr1), std::cend(h_arr1));
    thrust::device_vector<int> arr2(std::cbegin(h_arr2), std::cend(h_arr2));
    const auto out_size = arr1.size() + arr2.size();
    assert(arr1.size() == arr2.size());
#if 1
    const auto arr1_ptr = arr1.data();
    const auto arr2_ptr = arr2.data();
    const auto batch_interleave_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch, arr1_ptr, arr2_ptr]
        __host__ __device__ (int idx) -> int {
            const int batch_id = idx / batch;
            const int batch_el = idx % batch;
            const int in_idx = (batch_id / 2) * batch + batch_el;
            const bool even_batch = (batch_id % 2 == 0);
            return even_batch ? arr1_ptr[in_idx] : arr2_ptr[in_idx];
        });
    // only put into vector if really needed, better to feed the
    // iterator directly to the next operation for kernel fusion
    thrust::device_vector<int> arr3(
        batch_interleave_it, batch_interleave_it + out_size);
#else
    thrust::device_vector<int> arr3(out_size);
    const auto merge_keys1_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            return batch_id * 2;
        });
    const auto merge_keys2_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            return batch_id * 2 + 1;
        });
    
    thrust::merge_by_key(
        merge_keys1_it, merge_keys1_it + arr1.size(),
        merge_keys2_it, merge_keys2_it + arr2.size(),
        arr1.cbegin(),
        arr2.cbegin(),
        thrust::make_discard_iterator(),
        arr3.begin());
#endif
    print(arr3);
}
int main(void) {
    part1();
    part2();
}

英文:

This operation scatters the values in arr2 into arr1, so let's use thrust::scatter. The indices to which the values are scattered can be calculated with a thrust::transform_iterator from a thrust::counting_iterator (like std::ranges::views::iota). For the general case where the batch indices are given as another input thrust::device_vector, you can use

const auto indices_ptr = indices.data();
const auto scatter_indices_it = thrust::make_transform_iterator(
    thrust::make_counting_iterator(0),
    [batch, indices_ptr]
    __host__ __device__ (int idx) {
        const int batch_id = idx / batch;
        const int elem_id = idx % batch;
        return indices_ptr[batch_id] * batch + elem_id;
    });

while in the specific case where you just want to scatter to the odd batches, you should rather use

const auto scatter_indices_it = thrust::make_transform_iterator(
    thrust::make_counting_iterator(0),
    [batch]
    __host__ __device__ (int idx) {
        const int batch_id = idx / batch;
        const int elem_id = idx % batch;
        return (batch_id * 2 + 1) * batch + elem_id;
    });

The scatter operation itself is easy:

thrust::scatter(
    arr2.cbegin(), arr2.cend(),
    scatter_indices_it,
    arr1.begin());

There are certainly multiple possible ways of doing this. The one that came to mind first for me was merging the two vectors with thrust::merge_by_key, where the keys are generated using a similar scheme as above for the scatter indices:

const auto merge_keys1_it = thrust::make_transform_iterator(
    thrust::make_counting_iterator(0),
    [batch]
    __host__ __device__ (int idx) {
        const int batch_id = idx / batch;
        return batch_id * 2;
    });
const auto merge_keys2_it = thrust::make_transform_iterator(
    thrust::make_counting_iterator(0),
    [batch]
    __host__ __device__ (int idx) {
        const int batch_id = idx / batch;
        return batch_id * 2 + 1;
    });
thrust::merge_by_key(
    merge_keys1_it, merge_keys1_it + arr1.size(),
    merge_keys2_it, merge_keys2_it + arr2.size(),
    arr1.cbegin(),
    arr2.cbegin(),
    thrust::make_discard_iterator(),
    arr3.begin());

This works and is relatively elegant, but probably not ideal for performance due to the complexity of the merge algorithm and the regularity of the operation. A (probably) more performant ansatz would be creating a fancy iterator that takes care of the whole interleaving operation:

const auto arr1_ptr = arr1.data();
const auto arr2_ptr = arr2.data();
const auto batch_interleave_it = thrust::make_transform_iterator(
    thrust::make_counting_iterator(0),
    [batch, arr1_ptr, arr2_ptr]
    __host__ __device__ (int idx) -&gt; int {
        const int batch_id = idx / batch;
        const int batch_el = idx % batch;
        const int in_idx = (batch_id / 2) * batch + batch_el;
        const bool even_batch = (batch_id % 2 == 0);
        return even_batch ? arr1_ptr[in_idx] : arr2_ptr[in_idx];
    });

This iterator can now be fed to the next algorithm in your pipeline for kernel fusion or used to initialize a new vector using the constructor which takes an begin and an end iterator. If the interleaved arr3 is used (read from) multiple times in future it, you should probably put it into a new vector instead of reusing the iterator as the iterator doesn't allow for coalesced global memory access if batch isn't a multiple of warp size (32).

Complete source code:

Due to the use of device lambdas, nvcc needs -extended-lambda. Since CUDA 12 it also needs -rdc=true for some reason.

#include &lt;iostream&gt;
#include &lt;iterator&gt;
#include &lt;thrust/host_vector.h&gt;
#include &lt;thrust/device_vector.h&gt;
#include &lt;thrust/scatter.h&gt;
#include &lt;thrust/merge.h&gt;
#include &lt;thrust/iterator/transform_iterator.h&gt;
#include &lt;thrust/iterator/counting_iterator.h&gt;
#include &lt;thrust/iterator/discard_iterator.h&gt;
template &lt;typename T&gt;
void print(const thrust::device_vector&lt;T&gt; &amp;vec) {
    thrust::host_vector&lt;int&gt; h_vec(vec);
    std::ostream_iterator&lt;int&gt; out_it(std::cout, &quot;, &quot;);
    thrust::copy(h_vec.cbegin(), h_vec.cend(),
                 out_it);
    std::cout &lt;&lt; &#39;\n&#39;;
}
void part1() {
    constexpr int h_arr1[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
    constexpr int h_arr2[] = {1, 1, 1, 2, 2, 2};
    constexpr int batch = 3;
    thrust::device_vector&lt;int&gt; arr1(std::cbegin(h_arr1), std::cend(h_arr1));
    thrust::device_vector&lt;int&gt; arr2(std::cbegin(h_arr2), std::cend(h_arr2));
    
    assert(arr1.size() == 2 * arr2.size());
#if 1
    // specialized version where arr2 is scattered to &quot;uneven&quot;
    // batches
    const auto scatter_indices_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            const int elem_id = idx % batch;
            return (batch_id * 2 + 1) * batch + elem_id;
        });
#else
    // more general version where arr2 is scattered to batches given
    // in indices
    constexpr int h_indices[] = {1, 3};
    thrust::device_vector&lt;int&gt; indices(
        std::cbegin(h_indices), std::cend(h_indices));
    const auto indices_ptr = indices.data();
    const auto scatter_indices_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch, indices_ptr]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            const int elem_id = idx % batch;
            return indices_ptr[batch_id] * batch + elem_id;
        });
#endif
    
    thrust::scatter(
        arr2.cbegin(), arr2.cend(),
        scatter_indices_it,
        arr1.begin());
    print(arr1);
}
void part2() {
    constexpr int h_arr1[] = {1, 2, 3, 4, 5, 6};
    constexpr int h_arr2[] = {1, 1, 1, 2, 2, 2};
    constexpr int batch = 3;
    thrust::device_vector&lt;int&gt; arr1(std::cbegin(h_arr1), std::cend(h_arr1));
    thrust::device_vector&lt;int&gt; arr2(std::cbegin(h_arr2), std::cend(h_arr2));
    const auto out_size = arr1.size() + arr2.size();
    assert(arr1.size() == arr2.size());
#if 1
    const auto arr1_ptr = arr1.data();
    const auto arr2_ptr = arr2.data();
    const auto batch_interleave_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch, arr1_ptr, arr2_ptr]
        __host__ __device__ (int idx) -&gt; int {
            const int batch_id = idx / batch;
            const int batch_el = idx % batch;
            const int in_idx = (batch_id / 2) * batch + batch_el;
            const bool even_batch = (batch_id % 2 == 0);
            return even_batch ? arr1_ptr[in_idx] : arr2_ptr[in_idx];
        });
    // only put into vector if really needed, better to feed the
    // iterator directly to the next operation for kernel fusion
    thrust::device_vector&lt;int&gt; arr3(
        batch_interleave_it, batch_interleave_it + out_size);
#else
    thrust::device_vector&lt;int&gt; arr3(out_size);
    const auto merge_keys1_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            return batch_id * 2;
        });
    const auto merge_keys2_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            return batch_id * 2 + 1;
        });
    
    thrust::merge_by_key(
        merge_keys1_it, merge_keys1_it + arr1.size(),
        merge_keys2_it, merge_keys2_it + arr2.size(),
        arr1.cbegin(),
        arr2.cbegin(),
        thrust::make_discard_iterator(),
        arr3.begin());
#endif
    print(arr3);
}
int main(void) {
    part1();
    part2();
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用CUDA Thrust进行向量的替换/合并操作

问题

答案1

完整的源代码：

Complete source code:

如何使用模板和/或使用结构体？

如何在单线程系统上停止阻塞任务？

构造函数或代码有问题？

Could golang implement this interview question of getting array summary without for/while if/else?

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。