使用CUDA Thrust进行向量的替换/合并操作

huangapple go评论98阅读模式
英文:

Replace/Merge operations in vectors using CUDA Thrust

问题

我有两个操作可以使用CUDA Thrust来操作设备向量中的元素。哪些方法可以更高效地实现这两个功能?

  1. 批量替换向量中的部分值为另一个向量的值。示例如下:

    arr1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
    arr2 = [1, 1, 1, 2, 2, 2]
    // 批量替换 {4, 5, 6} 和 {10, 11, 12}:
    arr1 = [1, 2, 3, 1, 1, 1, 7, 8, 9, 2, 2, 2]
    

    在我的情况下,我总是有 size(arr1) / size(arr2) = 2

    我们从索引 1 * batch3 * batch 开始替换 arr1 中的值。

    在大多数情况下,我需要替换奇数批次的项目。也需要处理一般批次的情况。

  2. 交替合并两个向量的索引。

    R语言的问题中提出了同样的问题。

    arr1 = [1, 2, 3, 4, 5, 6]
    arr2 = [1, 1, 1, 2, 2, 2]
    // 合并 arr1 和 arr2:
    arr3 = [1, 2, 3, 1, 1, 1, 4, 5, 6, 2, 2, 2]
    

    replace_copy_if 可能适用,但我不知道如何与 fancy iterators 结合使用。此外,一些博客显示 replace_copy_ifcopy_if 更慢。

英文:

I have two operations for manipulating elements in device vectors using CUDA Thrust. Which methods can implement these two functions more efficiently?

  1. Replace part of values of a vector in batch with the values from another vector. Example is shown below:

    arr1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
    arr2 = [1, 1, 1, 2, 2, 2]
    // After replacing {4, 5, 6} and {10, 11, 12} in batch = 3:
    arr1 = [1, 2, 3, 1, 1, 1, 7, 8, 9, 2, 2, 2]
    

    In my case, I always have size(arr1) / size(arr2) = 2.

    We replace the values in arr1 from the index 1 * batch and 3 * batch.

    In most cases, I need to replace items in odd batches. The case for general batches is also needed.

  2. Merge two vectors alternating indexes.

    A same question is raised in How to merge 2 vectors alternating indexes?, but is for R language.

    arr1 = [1, 2, 3, 4, 5, 6]
    arr2 = [1, 1, 1, 2, 2, 2]
    //After merging arr1 and arr2:
    arr3 = [1, 2, 3, 1, 1, 1, 4, 5, 6, 2, 2, 2]
    

    replace_copy_if may work, but I don't know how to combine it with fancy iterators. Additionally, some blogs show that replace_copy_if is slower than copy_if.

答案1

得分: 3

  1. 这个操作将arr2中的值散布到arr1中,所以我们可以使用thrust::scatter。可以使用thrust::transform_iteratorthrust::counting_iterator(类似于std::ranges::views::iota)计算出值被散布到的索引。对于一般情况下,批次的indices作为另一个输入的thrust::device_vector给出,你可以使用以下代码:

    const auto indices_ptr = indices.data();
    
    const auto scatter_indices_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch, indices_ptr]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            const int elem_id = idx % batch;
            return indices_ptr[batch_id] * batch + elem_id;
        });
    

    对于只想散布到奇数批次的特殊情况,你应该使用以下代码:

    const auto scatter_indices_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            const int elem_id = idx % batch;
            return (batch_id * 2 + 1) * batch + elem_id;
        });
    

    散布操作本身很简单:

    thrust::scatter(
        arr2.cbegin(), arr2.cend(),
        scatter_indices_it,
        arr1.begin());
    
  2. 这样做肯定有多种可能的方法。我首先想到的是使用thrust::merge_by_key将两个向量合并,其中键使用与上述散布索引类似的方案生成:

    const auto merge_keys1_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            return batch_id * 2;
        });
    const auto merge_keys2_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            return batch_id * 2 + 1;
        });
    thrust::merge_by_key(
        merge_keys1_it, merge_keys1_it + arr1.size(),
        merge_keys2_it, merge_keys2_it + arr2.size(),
        arr1.cbegin(),
        arr2.cbegin(),
        thrust::make_discard_iterator(),
        arr3.begin());
    

    这个方法可以工作,并且相对优雅,但由于合并算法的复杂性和操作的规则性,可能不是性能上的理想选择。一个(可能)更高效的方法是创建一个复杂的迭代器,它负责整个交错操作:

    const auto arr1_ptr = arr1.data();
    const auto arr2_ptr = arr2.data();
    
    const auto batch_interleave_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch, arr1_ptr, arr2_ptr]
        __host__ __device__ (int idx) -> int {
            const int batch_id = idx / batch;
            const int batch_el = idx % batch;
            const int in_idx = (batch_id / 2) * batch + batch_el;
            const bool even_batch = (batch_id % 2 == 0);
            return even_batch ? arr1_ptr[in_idx] : arr2_ptr[in_idx];
        });
    

    现在可以将这个迭代器提供给管道中的下一个算法进行内核融合,或者使用以beginend迭代器为参数的构造函数初始化一个新的向量。如果未来多次使用交错的arr3(读取),最好将其放入一个新的向量中,而不是重用迭代器,因为如果batch不是warp大小(32)的倍数,迭代器不允许连续的全局内存访问。

完整的源代码:

由于使用了设备lambda,nvcc需要-extended-lambda选项。自CUDA 12以来,由于某种原因,还需要-rdc=true选项。

#include <iostream>
#include <iterator>

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>

#include <thrust/scatter.h>
#include <thrust/merge.h>

#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>

template <typename T>
void print(const thrust::device_vector<T> &vec) {
    thrust::host_vector<int> h_vec(vec);
    std::ostream_iterator<int> out_it(std::cout, ", ");
    thrust::copy(h_vec.cbegin(), h_vec.cend(),
                 out_it);
    std::cout << '\n';
}

void part1() {
    constexpr int h_arr1[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
    constexpr int h_arr2[] = {1, 1, 1, 2, 2, 2};
    constexpr int batch = 3;

    thrust::device_vector<int> arr1(std::cbegin(h_arr1), std::cend(h_arr1));
    thrust::device_vector<int> arr2(std::cbegin(h_arr2), std::cend(h_arr2));
    
    assert(arr1.size() == 2 * arr2.size());

#if 1
    // specialized version where arr2 is scattered to "uneven"
    // batches
    const auto scatter_indices_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            const int elem_id = idx % batch;
            return (batch_id * 2 + 1) * batch + elem_id;
        });
#else
    // more general version where arr2 is scattered to batches given
    // in indices
    constexpr int h_indices[] = {1, 3};
    thrust::device_vector<int> indices(
        std::cbegin(h_indices), std::cend(h_indices));
    const auto indices_ptr = indices.data();

    const auto scatter_indices_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch, indices_ptr]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            const int elem_id = idx % batch;
            return indices_ptr[batch_id] * batch + elem_id;
        });
#endif
    
    thrust::scatter(
        arr2.cbegin(), arr2.cend(),
        scatter_indices_it,
        arr1.begin());

    print(arr1);
}

void part2() {
    constexpr int h_arr1[] = {1, 2, 3, 4, 5, 6};
    constexpr int h_arr2[] = {1, 1, 1, 2, 2, 2};
    constexpr int batch = 3;

    thrust::device_vector<int> arr1(std::cbegin(h_arr1), std::cend(h_arr1));
    thrust::device_vector<int> arr2(std::cbegin(h_arr2), std::cend(h_arr2));
    const auto out_size = arr1.size() + arr2.size();

    assert(arr1.size() == arr2.size());

#if 1
    const auto arr1_ptr = arr1.data();
    const auto arr2_ptr = arr2.data();

    const auto batch_interleave_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch, arr1_ptr, arr2_ptr]
        __host__ __device__ (int idx) -> int {
            const int batch_id = idx / batch;
            const int batch_el = idx % batch;
            const int in_idx = (batch_id / 2) * batch + batch_el;
            const bool even_batch = (batch_id % 2 == 0);
            return even_batch ? arr1_ptr[in_idx] : arr2_ptr[in_idx];
        });

    // only put into vector if really needed, better to feed the
    // iterator directly to the next operation for kernel fusion
    thrust::device_vector<int> arr3(
        batch_interleave_it, batch_interleave_it + out_size);
#else
    thrust::device_vector<int> arr3(out_size);

    const auto merge_keys1_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            return batch_id * 2;
        });
    const auto merge_keys2_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            return batch_id * 2 + 1;
        });
    
    thrust::merge_by_key(
        merge_keys1_it, merge_keys1_it + arr1.size(),
        merge_keys2_it, merge_keys2_it + arr2.size(),
        arr1.cbegin(),
        arr2.cbegin(),
        thrust::make_discard_iterator(),
        arr3.begin());
#endif

    print(arr3);
}

int main(void) {
    part1();
    part2();
}
英文:
  1. This operation scatters the values in arr2 into arr1, so let's use thrust::scatter. The indices to which the values are scattered can be calculated with a thrust::transform_iterator from a thrust::counting_iterator (like std::ranges::views::iota). For the general case where the batch indices are given as another input thrust::device_vector, you can use

    const auto indices_ptr = indices.data();
    
    const auto scatter_indices_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch, indices_ptr]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            const int elem_id = idx % batch;
            return indices_ptr[batch_id] * batch + elem_id;
        });
    

    while in the specific case where you just want to scatter to the odd batches, you should rather use

    const auto scatter_indices_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            const int elem_id = idx % batch;
            return (batch_id * 2 + 1) * batch + elem_id;
        });
    

    The scatter operation itself is easy:

    thrust::scatter(
        arr2.cbegin(), arr2.cend(),
        scatter_indices_it,
        arr1.begin());
    
  2. There are certainly multiple possible ways of doing this. The one that came to mind first for me was merging the two vectors with thrust::merge_by_key, where the keys are generated using a similar scheme as above for the scatter indices:

    const auto merge_keys1_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            return batch_id * 2;
        });
    const auto merge_keys2_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            return batch_id * 2 + 1;
        });
    thrust::merge_by_key(
        merge_keys1_it, merge_keys1_it + arr1.size(),
        merge_keys2_it, merge_keys2_it + arr2.size(),
        arr1.cbegin(),
        arr2.cbegin(),
        thrust::make_discard_iterator(),
        arr3.begin());
    

    This works and is relatively elegant, but probably not ideal for performance due to the complexity of the merge algorithm and the regularity of the operation. A (probably) more performant ansatz would be creating a fancy iterator that takes care of the whole interleaving operation:

    const auto arr1_ptr = arr1.data();
    const auto arr2_ptr = arr2.data();
    
    const auto batch_interleave_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch, arr1_ptr, arr2_ptr]
        __host__ __device__ (int idx) -&gt; int {
            const int batch_id = idx / batch;
            const int batch_el = idx % batch;
            const int in_idx = (batch_id / 2) * batch + batch_el;
            const bool even_batch = (batch_id % 2 == 0);
            return even_batch ? arr1_ptr[in_idx] : arr2_ptr[in_idx];
        });
    

    This iterator can now be fed to the next algorithm in your pipeline for kernel fusion or used to initialize a new vector using the constructor which takes an begin and an end iterator. If the interleaved arr3 is used (read from) multiple times in future it, you should probably put it into a new vector instead of reusing the iterator as the iterator doesn't allow for coalesced global memory access if batch isn't a multiple of warp size (32).

Complete source code:

Due to the use of device lambdas, nvcc needs -extended-lambda. Since CUDA 12 it also needs -rdc=true for some reason.

#include &lt;iostream&gt;
#include &lt;iterator&gt;

#include &lt;thrust/host_vector.h&gt;
#include &lt;thrust/device_vector.h&gt;

#include &lt;thrust/scatter.h&gt;
#include &lt;thrust/merge.h&gt;

#include &lt;thrust/iterator/transform_iterator.h&gt;
#include &lt;thrust/iterator/counting_iterator.h&gt;
#include &lt;thrust/iterator/discard_iterator.h&gt;

template &lt;typename T&gt;
void print(const thrust::device_vector&lt;T&gt; &amp;vec) {
    thrust::host_vector&lt;int&gt; h_vec(vec);
    std::ostream_iterator&lt;int&gt; out_it(std::cout, &quot;, &quot;);
    thrust::copy(h_vec.cbegin(), h_vec.cend(),
                 out_it);
    std::cout &lt;&lt; &#39;\n&#39;;
}

void part1() {
    constexpr int h_arr1[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
    constexpr int h_arr2[] = {1, 1, 1, 2, 2, 2};
    constexpr int batch = 3;

    thrust::device_vector&lt;int&gt; arr1(std::cbegin(h_arr1), std::cend(h_arr1));
    thrust::device_vector&lt;int&gt; arr2(std::cbegin(h_arr2), std::cend(h_arr2));
    
    assert(arr1.size() == 2 * arr2.size());

#if 1
    // specialized version where arr2 is scattered to &quot;uneven&quot;
    // batches
    const auto scatter_indices_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            const int elem_id = idx % batch;
            return (batch_id * 2 + 1) * batch + elem_id;
        });
#else
    // more general version where arr2 is scattered to batches given
    // in indices
    constexpr int h_indices[] = {1, 3};
    thrust::device_vector&lt;int&gt; indices(
        std::cbegin(h_indices), std::cend(h_indices));
    const auto indices_ptr = indices.data();

    const auto scatter_indices_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch, indices_ptr]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            const int elem_id = idx % batch;
            return indices_ptr[batch_id] * batch + elem_id;
        });
#endif
    
    thrust::scatter(
        arr2.cbegin(), arr2.cend(),
        scatter_indices_it,
        arr1.begin());

    print(arr1);
}

void part2() {
    constexpr int h_arr1[] = {1, 2, 3, 4, 5, 6};
    constexpr int h_arr2[] = {1, 1, 1, 2, 2, 2};
    constexpr int batch = 3;

    thrust::device_vector&lt;int&gt; arr1(std::cbegin(h_arr1), std::cend(h_arr1));
    thrust::device_vector&lt;int&gt; arr2(std::cbegin(h_arr2), std::cend(h_arr2));
    const auto out_size = arr1.size() + arr2.size();

    assert(arr1.size() == arr2.size());

#if 1
    const auto arr1_ptr = arr1.data();
    const auto arr2_ptr = arr2.data();

    const auto batch_interleave_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch, arr1_ptr, arr2_ptr]
        __host__ __device__ (int idx) -&gt; int {
            const int batch_id = idx / batch;
            const int batch_el = idx % batch;
            const int in_idx = (batch_id / 2) * batch + batch_el;
            const bool even_batch = (batch_id % 2 == 0);
            return even_batch ? arr1_ptr[in_idx] : arr2_ptr[in_idx];
        });

    // only put into vector if really needed, better to feed the
    // iterator directly to the next operation for kernel fusion
    thrust::device_vector&lt;int&gt; arr3(
        batch_interleave_it, batch_interleave_it + out_size);
#else
    thrust::device_vector&lt;int&gt; arr3(out_size);

    const auto merge_keys1_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            return batch_id * 2;
        });
    const auto merge_keys2_it = thrust::make_transform_iterator(
        thrust::make_counting_iterator(0),
        [batch]
        __host__ __device__ (int idx) {
            const int batch_id = idx / batch;
            return batch_id * 2 + 1;
        });
    
    thrust::merge_by_key(
        merge_keys1_it, merge_keys1_it + arr1.size(),
        merge_keys2_it, merge_keys2_it + arr2.size(),
        arr1.cbegin(),
        arr2.cbegin(),
        thrust::make_discard_iterator(),
        arr3.begin());
#endif

    print(arr3);
}

int main(void) {
    part1();
    part2();
}

huangapple
  • 本文由 发表于 2023年8月9日 16:15:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76865810.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定