使用CUDA Thrust进行矢量的替换/合并操作

huangapple go评论111阅读模式
英文:

Replace/Merge operations in vectors using CUDA Thrust

问题

  1. 使用CUDA Thrust来操作设备向量中的元素有两种操作。哪种方法可以更高效地实现这两个功能?

    1. 批量替换一个向量的部分值为另一个向量的值。示例如下:
    arr1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
    arr2 = [1, 1, 1, 2, 2, 2]
    // 批量替换 {4, 5, 6} 和 {10, 11, 12},批量大小为 3:
    arr1 = [1, 2, 3, 1, 1, 1, 7, 8, 9, 2, 2, 2]
    

    在我的情况下,通常有 size(arr1) / size(arr2) = 2

    我们从索引 1 * batch3 * batch 开始替换 arr1 中的值。

    在大多数情况下,我需要替换奇数批次的项目,但也需要处理一般批次的情况。

  2. 合并两个向量的交替索引。

    R语言中如何合并2个向量的交替索引?中提出了同样的问题,但是是针对R语言的。

    arr1 = [1, 2, 3, 4, 5, 6]
    arr2 = [1, 1, 1, 2, 2, 2]
    // 合并 arr1 和 arr2 后:
    arr3 = [1, 2, 3, 1, 1, 1, 4, 5, 6, 2, 2, 2]
    

    replace_copy_if 可能可以工作,但我不知道如何与高级迭代器结合使用。此外,一些博客显示 replace_copy_ifcopy_if 更慢。

英文:

I have two operations for manipulating elements in device vectors using CUDA Thrust. Which methods can implement these two functions more efficiently?

  1. Replace part of values of a vector in batch with the values from another vector. Example is shown below:

    arr1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
    arr2 = [1, 1, 1, 2, 2, 2]
    // After replacing {4, 5, 6} and {10, 11, 12} in batch = 3:
    arr1 = [1, 2, 3, 1, 1, 1, 7, 8, 9, 2, 2, 2]
    

    In my case, I always have size(arr1) / size(arr2) = 2.

    We replace the values in arr1 from the index 1 * batch and 3 * batch.

    In most cases, I need to replace items in odd batches. The case for general batches is also needed.

  2. Merge two vectors alternating indexes.

    A same question is raised in How to merge 2 vectors alternating indexes?, but is for R language.

    arr1 = [1, 2, 3, 4, 5, 6]
    arr2 = [1, 1, 1, 2, 2, 2]
    //After merging arr1 and arr2:
    arr3 = [1, 2, 3, 1, 1, 1, 4, 5, 6, 2, 2, 2]
    

    replace_copy_if may work, but I don't know how to combine it with fancy iterators. Additionally, some blogs show that replace_copy_if is slower than copy_if.

答案1

得分: 3

  1. 此操作将arr2中的值分散到arr1中,因此让我们使用thrust::scatter。可以使用thrust::transform_iteratorthrust::counting_iterator(类似于std::ranges::views::iota)计算要将值分散到的索引。对于一般情况,其中批处理indices作为另一个输入的情况thrust::device_vector给出,您可以使用

    const auto indices_ptr = indices.data();
    
    const auto scatter_indices_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch, indices_ptr]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            const int elem_id = idx % batch;
            return indices_ptr[batch_id] * batch + elem_id;
        });
    

    而在只想要将值分散到奇数批次的特定情况下,您应该使用

    const auto scatter_indices_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            const int elem_id = idx % batch;
            return (batch_id * 2 + 1) * batch + elem_id;
        });
    

    scatter操作本身很简单:

    thrust::scatter(
        arr2.cbegin(), arr2.cend(),
        scatter_indices_it,
        arr1.begin());
    
  2. 这肯定有多种可能的方法。我首先想到的方法是使用thrust::merge_by_key将这两个向量合并,其中键是使用与上述分散索引类似的方案生成的:

    const auto merge_keys1_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            return batch_id * 2;
        });
    const auto merge_keys2_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            return batch_id * 2 + 1;
        });
    thrust::merge_by_key(
        merge_keys1_it, merge_keys1_it + arr1.size(),
        merge_keys2_it, merge_keys2_it + arr2.size(),
        arr1.cbegin(),
        arr2.cbegin(),
        thrust::make_discard_iterator(),
        arr3.begin());
    

    这个方法有效且相对优雅,但由于合并算法的复杂性和操作的规则性,性能可能不理想。一个(可能)性能更好的方法是创建一个复杂的迭代器,它负责整个交错操作:

    const auto arr1_ptr = arr1.data();
    const auto arr2_ptr = arr2.data();
    
    const auto batch_interleave_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch, arr1_ptr, arr2_ptr]
        __host__ __device__ (int idx) -> int {
            const int batch_id = idx / batch;
            const int batch_el = idx % batch;
            const int in_idx = (batch_id / 2) * batch + batch_el;
            const bool even_batch = (batch_id % 2 == 0);
            return even_batch ? arr1_ptr[in_idx] : arr2_ptr[in_idx];
        });
    

    现在可以将此迭代器提供给管道中的下一个算法以进行内核融合,或者用于使用接受beginend迭代器的构造函数初始化新向量。如果将交错的arr3在将来多次使用(读取),则应该将其放入新的向量中,而不是重复使用迭代器,因为如果batch不是warp大小的倍数(32),迭代器不允许合并的全局内存访问。

完整的源代码:

由于使用了设备lambda函数,nvcc需要-extended-lambda。自CUDA 12以来,由于某种原因还需要-rdc=true

#include <iostream>
#include <iterator>

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>

#include <thrust/scatter.h>
#include <thrust/merge.h>

#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>

template <typename T>
void print(const thrust::device_vector<T> &vec) {
    thrust::host_vector<int> h_vec(vec);
    std::ostream_iterator<int> out_it(std::cout, ", ");
    thrust::copy(h_vec.cbegin(), h_vec.cend(),
                 out_it);
    std::cout << '\n';
}

void part1() {
    constexpr int h_arr1[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
    constexpr int h_arr2[] = {1, 1, 1, 2, 2, 2};
    constexpr int batch = 3;

    thrust::device_vector<int> arr1(std::cbegin(h_arr1), std::cend(h_arr1));
    thrust::device_vector<int> arr2(std::cbegin(h_arr2), std::cend(h_arr2));
    
    assert(arr1.size() == 2 * arr2.size());

#if 1
    // specialized version where arr2 is scattered to "uneven"
    // batches
    const auto scatter_indices_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            const int elem_id = idx % batch;
            return (batch_id * 2 + 1) * batch + elem_id;
        });
#else
    // more general version where arr2 is scattered to batches given
    // in indices
    constexpr int h_indices[] = {1, 3};
    thrust::device_vector<int> indices(
        std::cbegin(h_indices), std::cend(h_indices));
    const auto indices_ptr = indices.data();

    const auto scatter_indices_it = thrust::make_transform_iterator(
        thrust::make_count

<details>
<summary>英文:</summary>

1. This operation scatters the values in `arr2` into `arr1`, so let&#39;s use `thrust::scatter`. The indices to which the values are scattered can be calculated with a `thrust::transform_iterator` from a `thrust::counting_iterator` (like `std::ranges::views::iota`). For the general case where the batch `indices` are given as another input `thrust::device_vector`, you can use

   ```c++
   const auto indices_ptr = indices.data();

   const auto scatter_indices_it = thrust::make_transform_iterator(
       thrust::make_counting_iterator(0),
       [batch, indices_ptr]
       __host__ __device__ (int idx) {
           const int batch_id = idx / batch;
           const int elem_id = idx % batch;
           return indices_ptr[batch_id] * batch + elem_id;
       });

while in the specific case where you just want to scatter to the odd batches, you should rather use

const auto scatter_indices_it = thrust::make_transform_iterator(
    thrust::make_counting_iterator(0),
    [batch]
    __host__ __device__ (int idx) {
        const int batch_id = idx / batch;
        const int elem_id = idx % batch;
        return (batch_id * 2 + 1) * batch + elem_id;
    });

The scatter operation itself is easy:

thrust::scatter(
    arr2.cbegin(), arr2.cend(),
    scatter_indices_it,
    arr1.begin());
  1. There are certainly multiple possible ways of doing this. The one that came to mind first for me was merging the two vectors with thrust::merge_by_key, where the keys are generated using a similar scheme as above for the scatter indices:

    const auto merge_keys1_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            return batch_id * 2;
        });
    const auto merge_keys2_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            return batch_id * 2 + 1;
        });
    thrust::merge_by_key(
        merge_keys1_it, merge_keys1_it + arr1.size(),
        merge_keys2_it, merge_keys2_it + arr2.size(),
        arr1.cbegin(),
        arr2.cbegin(),
        thrust::make_discard_iterator(),
        arr3.begin());
    

    This works and is relatively elegant, but probably not ideal for performance due to the complexity of the merge algorithm and the regularity of the operation. A (probably) more performant ansatz would be creating a fancy iterator that takes care of the whole interleaving operation:

    const auto arr1_ptr = arr1.data();
    const auto arr2_ptr = arr2.data();
    
    const auto batch_interleave_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch, arr1_ptr, arr2_ptr]
        __host__ __device__ (int idx) -&gt; int {
            const int batch_id = idx / batch;
            const int batch_el = idx % batch;
            const int in_idx = (batch_id / 2) * batch + batch_el;
            const bool even_batch = (batch_id % 2 == 0);
            return even_batch ? arr1_ptr[in_idx] : arr2_ptr[in_idx];
        });
    

    This iterator can now be fed to the next algorithm in your pipeline for kernel fusion or used to initialize a new vector using the constructor which takes an begin and an end iterator. If the interleaved arr3 is used (read from) multiple times in future it, you should probably put it into a new vector instead of reusing the iterator as the iterator doesn't allow for coalesced global memory access if batch isn't a multiple of warp size (32).

Complete source code:

Due to the use of device lambdas, nvcc needs -extended-lambda. Since CUDA 12 it also needs -rdc=true for some reason.

#include &lt;iostream&gt;
#include &lt;iterator&gt;

#include &lt;thrust/host_vector.h&gt;
#include &lt;thrust/device_vector.h&gt;

#include &lt;thrust/scatter.h&gt;
#include &lt;thrust/merge.h&gt;

#include &lt;thrust/iterator/transform_iterator.h&gt;
#include &lt;thrust/iterator/counting_iterator.h&gt;
#include &lt;thrust/iterator/discard_iterator.h&gt;

template &lt;typename T&gt;
void print(const thrust::device_vector&lt;T&gt; &amp;vec) {
    thrust::host_vector&lt;int&gt; h_vec(vec);
    std::ostream_iterator&lt;int&gt; out_it(std::cout, &quot;, &quot;);
    thrust::copy(h_vec.cbegin(), h_vec.cend(),
                 out_it);
    std::cout &lt;&lt; &#39;\n&#39;;
}

void part1() {
    constexpr int h_arr1[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
    constexpr int h_arr2[] = {1, 1, 1, 2, 2, 2};
    constexpr int batch = 3;

    thrust::device_vector&lt;int&gt; arr1(std::cbegin(h_arr1), std::cend(h_arr1));
    thrust::device_vector&lt;int&gt; arr2(std::cbegin(h_arr2), std::cend(h_arr2));
    
    assert(arr1.size() == 2 * arr2.size());

#if 1
    // specialized version where arr2 is scattered to &quot;uneven&quot;
    // batches
    const auto scatter_indices_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            const int elem_id = idx % batch;
            return (batch_id * 2 + 1) * batch + elem_id;
        });
#else
    // more general version where arr2 is scattered to batches given
    // in indices
    constexpr int h_indices[] = {1, 3};
    thrust::device_vector&lt;int&gt; indices(
        std::cbegin(h_indices), std::cend(h_indices));
    const auto indices_ptr = indices.data();

    const auto scatter_indices_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch, indices_ptr]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            const int elem_id = idx % batch;
            return indices_ptr[batch_id] * batch + elem_id;
        });
#endif
    
    thrust::scatter(
        arr2.cbegin(), arr2.cend(),
        scatter_indices_it,
        arr1.begin());

    print(arr1);
}

void part2() {
    constexpr int h_arr1[] = {1, 2, 3, 4, 5, 6};
    constexpr int h_arr2[] = {1, 1, 1, 2, 2, 2};
    constexpr int batch = 3;

    thrust::device_vector&lt;int&gt; arr1(std::cbegin(h_arr1), std::cend(h_arr1));
    thrust::device_vector&lt;int&gt; arr2(std::cbegin(h_arr2), std::cend(h_arr2));
    const auto out_size = arr1.size() + arr2.size();

    assert(arr1.size() == arr2.size());

#if 1
    const auto arr1_ptr = arr1.data();
    const auto arr2_ptr = arr2.data();

    const auto batch_interleave_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch, arr1_ptr, arr2_ptr]
        __host__ __device__ (int idx) -&gt; int {
            const int batch_id = idx / batch;
            const int batch_el = idx % batch;
            const int in_idx = (batch_id / 2) * batch + batch_el;
            const bool even_batch = (batch_id % 2 == 0);
            return even_batch ? arr1_ptr[in_idx] : arr2_ptr[in_idx];
        });

    // only put into vector if really needed, better to feed the
    // iterator directly to the next operation for kernel fusion
    thrust::device_vector&lt;int&gt; arr3(
        batch_interleave_it, batch_interleave_it + out_size);
#else
    thrust::device_vector&lt;int&gt; arr3(out_size);

    const auto merge_keys1_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            return batch_id * 2;
        });
    const auto merge_keys2_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            return batch_id * 2 + 1;
        });
    
    thrust::merge_by_key(
        merge_keys1_it, merge_keys1_it + arr1.size(),
        merge_keys2_it, merge_keys2_it + arr2.size(),
        arr1.cbegin(),
        arr2.cbegin(),
        thrust::make_discard_iterator(),
        arr3.begin());
#endif

    print(arr3);
}

int main(void) {
    part1();
    part2();
}

huangapple
  • 本文由 发表于 2023年8月9日 16:15:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76865810-2.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定