2023年8月9日 16:15:27go评论151阅读模式

英文:

Replace/Merge operations in vectors using CUDA Thrust

问题

使用CUDA Thrust来操作设备向量中的元素有两种操作。哪种方法可以更高效地实现这两个功能？
1. 批量替换一个向量的部分值为另一个向量的值。示例如下：
```
arr1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
arr2 = [1, 1, 1, 2, 2, 2]
// 批量替换 {4, 5, 6} 和 {10, 11, 12}，批量大小为 3：
arr1 = [1, 2, 3, 1, 1, 1, 7, 8, 9, 2, 2, 2]
```
在我的情况下，通常有 size(arr1) / size(arr2) = 2。

我们从索引 1 * batch 和 3 * batch 开始替换 arr1 中的值。

在大多数情况下，我需要替换奇数批次的项目，但也需要处理一般批次的情况。
合并两个向量的交替索引。

在R语言中如何合并2个向量的交替索引？中提出了同样的问题，但是是针对R语言的。
```
arr1 = [1, 2, 3, 4, 5, 6]
arr2 = [1, 1, 1, 2, 2, 2]
// 合并 arr1 和 arr2 后：
arr3 = [1, 2, 3, 1, 1, 1, 4, 5, 6, 2, 2, 2]
```
replace_copy_if 可能可以工作，但我不知道如何与高级迭代器结合使用。此外，一些博客显示 replace_copy_if 比 copy_if 更慢。

英文:

I have two operations for manipulating elements in device vectors using CUDA Thrust. Which methods can implement these two functions more efficiently?

Replace part of values of a vector in batch with the values from another vector. Example is shown below:
```
arr1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
arr2 = [1, 1, 1, 2, 2, 2]
// After replacing {4, 5, 6} and {10, 11, 12} in batch = 3:
arr1 = [1, 2, 3, 1, 1, 1, 7, 8, 9, 2, 2, 2]
```
In my case, I always have size(arr1) / size(arr2) = 2.

We replace the values in arr1 from the index 1 * batch and 3 * batch.

In most cases, I need to replace items in odd batches. The case for general batches is also needed.
Merge two vectors alternating indexes.

A same question is raised in How to merge 2 vectors alternating indexes?, but is for R language.
```
arr1 = [1, 2, 3, 4, 5, 6]
arr2 = [1, 1, 1, 2, 2, 2]
//After merging arr1 and arr2:
arr3 = [1, 2, 3, 1, 1, 1, 4, 5, 6, 2, 2, 2]
```
replace_copy_if may work, but I don't know how to combine it with fancy iterators. Additionally, some blogs show that replace_copy_if is slower than copy_if.

答案1

得分: 3

此操作将arr2中的值分散到arr1中，因此让我们使用thrust::scatter。可以使用thrust::transform_iterator从thrust::counting_iterator（类似于std::ranges::views::iota）计算要将值分散到的索引。对于一般情况，其中批处理indices作为另一个输入的情况thrust::device_vector给出，您可以使用

const auto indices_ptr = indices.data();
const auto scatter_indices_it = thrust::make_transform_iterator(
    thrust::make_counting_iterator(0),
    [batch, indices_ptr]
    __host__ __device__ (int idx) {
        const int batch_id = idx / batch;
        const int elem_id = idx % batch;
        return indices_ptr[batch_id] * batch + elem_id;
    });

而在只想要将值分散到奇数批次的特定情况下，您应该使用

const auto scatter_indices_it = thrust::make_transform_iterator(
    thrust::make_counting_iterator(0),
    [batch]
    __host__ __device__ (int idx) {
        const int batch_id = idx / batch;
        const int elem_id = idx % batch;
        return (batch_id * 2 + 1) * batch + elem_id;
    });

scatter操作本身很简单：

thrust::scatter(
    arr2.cbegin(), arr2.cend(),
    scatter_indices_it,
    arr1.begin());

这肯定有多种可能的方法。我首先想到的方法是使用thrust::merge_by_key将这两个向量合并，其中键是使用与上述分散索引类似的方案生成的：

const auto merge_keys1_it = thrust::make_transform_iterator(
    thrust::make_counting_iterator(0),
    [batch]
    __host__ __device__ (int idx) {
        const int batch_id = idx / batch;
        return batch_id * 2;
    });
const auto merge_keys2_it = thrust::make_transform_iterator(
    thrust::make_counting_iterator(0),
    [batch]
    __host__ __device__ (int idx) {
        const int batch_id = idx / batch;
        return batch_id * 2 + 1;
    });
thrust::merge_by_key(
    merge_keys1_it, merge_keys1_it + arr1.size(),
    merge_keys2_it, merge_keys2_it + arr2.size(),
    arr1.cbegin(),
    arr2.cbegin(),
    thrust::make_discard_iterator(),
    arr3.begin());

这个方法有效且相对优雅，但由于合并算法的复杂性和操作的规则性，性能可能不理想。一个（可能）性能更好的方法是创建一个复杂的迭代器，它负责整个交错操作：

const auto arr1_ptr = arr1.data();
const auto arr2_ptr = arr2.data();
const auto batch_interleave_it = thrust::make_transform_iterator(
    thrust::make_counting_iterator(0),
    [batch, arr1_ptr, arr2_ptr]
    __host__ __device__ (int idx) -> int {
        const int batch_id = idx / batch;
        const int batch_el = idx % batch;
        const int in_idx = (batch_id / 2) * batch + batch_el;
        const bool even_batch = (batch_id % 2 == 0);
        return even_batch ? arr1_ptr[in_idx] : arr2_ptr[in_idx];
    });

现在可以将此迭代器提供给管道中的下一个算法以进行内核融合，或者用于使用接受begin和end迭代器的构造函数初始化新向量。如果将交错的arr3在将来多次使用（读取），则应该将其放入新的向量中，而不是重复使用迭代器，因为如果batch不是warp大小的倍数（32），迭代器不允许合并的全局内存访问。

完整的源代码：

由于使用了设备lambda函数，nvcc需要-extended-lambda。自CUDA 12以来，由于某种原因还需要-rdc=true。

#include <iostream>
#include <iterator>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/scatter.h>
#include <thrust/merge.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
template <typename T>
void print(const thrust::device_vector<T> &vec) {
    thrust::host_vector<int> h_vec(vec);
    std::ostream_iterator<int> out_it(std::cout, ", ");
    thrust::copy(h_vec.cbegin(), h_vec.cend(),
                 out_it);
    std::cout << '\n';
}
void part1() {
    constexpr int h_arr1[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
    constexpr int h_arr2[] = {1, 1, 1, 2, 2, 2};
    constexpr int batch = 3;
    thrust::device_vector<int> arr1(std::cbegin(h_arr1), std::cend(h_arr1));
    thrust::device_vector<int> arr2(std::cbegin(h_arr2), std::cend(h_arr2));
    
    assert(arr1.size() == 2 * arr2.size());
#if 1
    // specialized version where arr2 is scattered to "uneven"
    // batches
    const auto scatter_indices_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            const int elem_id = idx % batch;
            return (batch_id * 2 + 1) * batch + elem_id;
        });
#else
    // more general version where arr2 is scattered to batches given
    // in indices
    constexpr int h_indices[] = {1, 3};
    thrust::device_vector<int> indices(
        std::cbegin(h_indices), std::cend(h_indices));
    const auto indices_ptr = indices.data();
    const auto scatter_indices_it = thrust::make_transform_iterator(
        thrust::make_count
<details>
<summary>英文:</summary>
1. This operation scatters the values in `arr2` into `arr1`, so let&#39;s use `thrust::scatter`. The indices to which the values are scattered can be calculated with a `thrust::transform_iterator` from a `thrust::counting_iterator` (like `std::ranges::views::iota`). For the general case where the batch `indices` are given as another input `thrust::device_vector`, you can use
   ```c++
   const auto indices_ptr = indices.data();
   const auto scatter_indices_it = thrust::make_transform_iterator(
       thrust::make_counting_iterator(0),
       [batch, indices_ptr]
       __host__ __device__ (int idx) {
           const int batch_id = idx / batch;
           const int elem_id = idx % batch;
           return indices_ptr[batch_id] * batch + elem_id;
       });

while in the specific case where you just want to scatter to the odd batches, you should rather use

const auto scatter_indices_it = thrust::make_transform_iterator(
    thrust::make_counting_iterator(0),
    [batch]
    __host__ __device__ (int idx) {
        const int batch_id = idx / batch;
        const int elem_id = idx % batch;
        return (batch_id * 2 + 1) * batch + elem_id;
    });

The scatter operation itself is easy:

thrust::scatter(
    arr2.cbegin(), arr2.cend(),
    scatter_indices_it,
    arr1.begin());

There are certainly multiple possible ways of doing this. The one that came to mind first for me was merging the two vectors with thrust::merge_by_key, where the keys are generated using a similar scheme as above for the scatter indices:

const auto merge_keys1_it = thrust::make_transform_iterator(
    thrust::make_counting_iterator(0),
    [batch]
    __host__ __device__ (int idx) {
        const int batch_id = idx / batch;
        return batch_id * 2;
    });
const auto merge_keys2_it = thrust::make_transform_iterator(
    thrust::make_counting_iterator(0),
    [batch]
    __host__ __device__ (int idx) {
        const int batch_id = idx / batch;
        return batch_id * 2 + 1;
    });
thrust::merge_by_key(
    merge_keys1_it, merge_keys1_it + arr1.size(),
    merge_keys2_it, merge_keys2_it + arr2.size(),
    arr1.cbegin(),
    arr2.cbegin(),
    thrust::make_discard_iterator(),
    arr3.begin());

This works and is relatively elegant, but probably not ideal for performance due to the complexity of the merge algorithm and the regularity of the operation. A (probably) more performant ansatz would be creating a fancy iterator that takes care of the whole interleaving operation:

const auto arr1_ptr = arr1.data();
const auto arr2_ptr = arr2.data();
const auto batch_interleave_it = thrust::make_transform_iterator(
    thrust::make_counting_iterator(0),
    [batch, arr1_ptr, arr2_ptr]
    __host__ __device__ (int idx) -&gt; int {
        const int batch_id = idx / batch;
        const int batch_el = idx % batch;
        const int in_idx = (batch_id / 2) * batch + batch_el;
        const bool even_batch = (batch_id % 2 == 0);
        return even_batch ? arr1_ptr[in_idx] : arr2_ptr[in_idx];
    });

This iterator can now be fed to the next algorithm in your pipeline for kernel fusion or used to initialize a new vector using the constructor which takes an begin and an end iterator. If the interleaved arr3 is used (read from) multiple times in future it, you should probably put it into a new vector instead of reusing the iterator as the iterator doesn't allow for coalesced global memory access if batch isn't a multiple of warp size (32).

Complete source code:

Due to the use of device lambdas, nvcc needs -extended-lambda. Since CUDA 12 it also needs -rdc=true for some reason.

#include &lt;iostream&gt;
#include &lt;iterator&gt;
#include &lt;thrust/host_vector.h&gt;
#include &lt;thrust/device_vector.h&gt;
#include &lt;thrust/scatter.h&gt;
#include &lt;thrust/merge.h&gt;
#include &lt;thrust/iterator/transform_iterator.h&gt;
#include &lt;thrust/iterator/counting_iterator.h&gt;
#include &lt;thrust/iterator/discard_iterator.h&gt;
template &lt;typename T&gt;
void print(const thrust::device_vector&lt;T&gt; &amp;vec) {
    thrust::host_vector&lt;int&gt; h_vec(vec);
    std::ostream_iterator&lt;int&gt; out_it(std::cout, &quot;, &quot;);
    thrust::copy(h_vec.cbegin(), h_vec.cend(),
                 out_it);
    std::cout &lt;&lt; &#39;\n&#39;;
}
void part1() {
    constexpr int h_arr1[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
    constexpr int h_arr2[] = {1, 1, 1, 2, 2, 2};
    constexpr int batch = 3;
    thrust::device_vector&lt;int&gt; arr1(std::cbegin(h_arr1), std::cend(h_arr1));
    thrust::device_vector&lt;int&gt; arr2(std::cbegin(h_arr2), std::cend(h_arr2));
    
    assert(arr1.size() == 2 * arr2.size());
#if 1
    // specialized version where arr2 is scattered to &quot;uneven&quot;
    // batches
    const auto scatter_indices_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            const int elem_id = idx % batch;
            return (batch_id * 2 + 1) * batch + elem_id;
        });
#else
    // more general version where arr2 is scattered to batches given
    // in indices
    constexpr int h_indices[] = {1, 3};
    thrust::device_vector&lt;int&gt; indices(
        std::cbegin(h_indices), std::cend(h_indices));
    const auto indices_ptr = indices.data();
    const auto scatter_indices_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch, indices_ptr]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            const int elem_id = idx % batch;
            return indices_ptr[batch_id] * batch + elem_id;
        });
#endif
    
    thrust::scatter(
        arr2.cbegin(), arr2.cend(),
        scatter_indices_it,
        arr1.begin());
    print(arr1);
}
void part2() {
    constexpr int h_arr1[] = {1, 2, 3, 4, 5, 6};
    constexpr int h_arr2[] = {1, 1, 1, 2, 2, 2};
    constexpr int batch = 3;
    thrust::device_vector&lt;int&gt; arr1(std::cbegin(h_arr1), std::cend(h_arr1));
    thrust::device_vector&lt;int&gt; arr2(std::cbegin(h_arr2), std::cend(h_arr2));
    const auto out_size = arr1.size() + arr2.size();
    assert(arr1.size() == arr2.size());
#if 1
    const auto arr1_ptr = arr1.data();
    const auto arr2_ptr = arr2.data();
    const auto batch_interleave_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch, arr1_ptr, arr2_ptr]
        __host__ __device__ (int idx) -&gt; int {
            const int batch_id = idx / batch;
            const int batch_el = idx % batch;
            const int in_idx = (batch_id / 2) * batch + batch_el;
            const bool even_batch = (batch_id % 2 == 0);
            return even_batch ? arr1_ptr[in_idx] : arr2_ptr[in_idx];
        });
    // only put into vector if really needed, better to feed the
    // iterator directly to the next operation for kernel fusion
    thrust::device_vector&lt;int&gt; arr3(
        batch_interleave_it, batch_interleave_it + out_size);
#else
    thrust::device_vector&lt;int&gt; arr3(out_size);
    const auto merge_keys1_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            return batch_id * 2;
        });
    const auto merge_keys2_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            return batch_id * 2 + 1;
        });
    
    thrust::merge_by_key(
        merge_keys1_it, merge_keys1_it + arr1.size(),
        merge_keys2_it, merge_keys2_it + arr2.size(),
        arr1.cbegin(),
        arr2.cbegin(),
        thrust::make_discard_iterator(),
        arr3.begin());
#endif
    print(arr3);
}
int main(void) {
    part1();
    part2();
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用CUDA Thrust进行矢量的替换/合并操作

问题

答案1

完整的源代码：

Complete source code:

如何减缓标签上打印进度值的显示速度？

Problems in implementing adaptive thresholding using CUDA

Compact Language Detector 2的detect方法是线程安全的吗？

ImGui::SetNextWindowSizeConstraints() 不按预期工作

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。