英文:
Replace/Merge operations in vectors using CUDA Thrust
问题
我有两个操作可以使用CUDA Thrust来操作设备向量中的元素。哪些方法可以更高效地实现这两个功能?
-
批量替换向量中的部分值为另一个向量的值。示例如下:
arr1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] arr2 = [1, 1, 1, 2, 2, 2] // 批量替换 {4, 5, 6} 和 {10, 11, 12}: arr1 = [1, 2, 3, 1, 1, 1, 7, 8, 9, 2, 2, 2]
在我的情况下,我总是有
size(arr1) / size(arr2) = 2
。我们从索引
1 * batch
和3 * batch
开始替换arr1
中的值。在大多数情况下,我需要替换奇数批次的项目。也需要处理一般批次的情况。
-
交替合并两个向量的索引。
在R语言的问题中提出了同样的问题。
arr1 = [1, 2, 3, 4, 5, 6] arr2 = [1, 1, 1, 2, 2, 2] // 合并 arr1 和 arr2: arr3 = [1, 2, 3, 1, 1, 1, 4, 5, 6, 2, 2, 2]
replace_copy_if
可能适用,但我不知道如何与 fancy iterators 结合使用。此外,一些博客显示replace_copy_if
比copy_if
更慢。
英文:
I have two operations for manipulating elements in device vectors using CUDA Thrust. Which methods can implement these two functions more efficiently?
-
Replace part of values of a vector in batch with the values from another vector. Example is shown below:
arr1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] arr2 = [1, 1, 1, 2, 2, 2] // After replacing {4, 5, 6} and {10, 11, 12} in batch = 3: arr1 = [1, 2, 3, 1, 1, 1, 7, 8, 9, 2, 2, 2]
In my case, I always have
size(arr1) / size(arr2) = 2
.We replace the values in
arr1
from the index1 * batch
and3 * batch
.In most cases, I need to replace items in odd batches. The case for general batches is also needed.
-
Merge two vectors alternating indexes.
A same question is raised in How to merge 2 vectors alternating indexes?, but is for
R
language.arr1 = [1, 2, 3, 4, 5, 6] arr2 = [1, 1, 1, 2, 2, 2] //After merging arr1 and arr2: arr3 = [1, 2, 3, 1, 1, 1, 4, 5, 6, 2, 2, 2]
replace_copy_if
may work, but I don't know how to combine it with fancy iterators. Additionally, some blogs show thatreplace_copy_if
is slower thancopy_if
.
答案1
得分: 3
-
这个操作将
arr2
中的值散布到arr1
中,所以我们可以使用thrust::scatter
。可以使用thrust::transform_iterator
从thrust::counting_iterator
(类似于std::ranges::views::iota
)计算出值被散布到的索引。对于一般情况下,批次的indices
作为另一个输入的thrust::device_vector
给出,你可以使用以下代码:const auto indices_ptr = indices.data(); const auto scatter_indices_it = thrust::make_transform_iterator( thrust::make_counting_iterator(0), [batch, indices_ptr] __host__ __device__ (int idx) { const int batch_id = idx / batch; const int elem_id = idx % batch; return indices_ptr[batch_id] * batch + elem_id; });
对于只想散布到奇数批次的特殊情况,你应该使用以下代码:
const auto scatter_indices_it = thrust::make_transform_iterator( thrust::make_counting_iterator(0), [batch] __host__ __device__ (int idx) { const int batch_id = idx / batch; const int elem_id = idx % batch; return (batch_id * 2 + 1) * batch + elem_id; });
散布操作本身很简单:
thrust::scatter( arr2.cbegin(), arr2.cend(), scatter_indices_it, arr1.begin());
-
这样做肯定有多种可能的方法。我首先想到的是使用
thrust::merge_by_key
将两个向量合并,其中键使用与上述散布索引类似的方案生成:const auto merge_keys1_it = thrust::make_transform_iterator( thrust::make_counting_iterator(0), [batch] __host__ __device__ (int idx) { const int batch_id = idx / batch; return batch_id * 2; }); const auto merge_keys2_it = thrust::make_transform_iterator( thrust::make_counting_iterator(0), [batch] __host__ __device__ (int idx) { const int batch_id = idx / batch; return batch_id * 2 + 1; }); thrust::merge_by_key( merge_keys1_it, merge_keys1_it + arr1.size(), merge_keys2_it, merge_keys2_it + arr2.size(), arr1.cbegin(), arr2.cbegin(), thrust::make_discard_iterator(), arr3.begin());
这个方法可以工作,并且相对优雅,但由于合并算法的复杂性和操作的规则性,可能不是性能上的理想选择。一个(可能)更高效的方法是创建一个复杂的迭代器,它负责整个交错操作:
const auto arr1_ptr = arr1.data(); const auto arr2_ptr = arr2.data(); const auto batch_interleave_it = thrust::make_transform_iterator( thrust::make_counting_iterator(0), [batch, arr1_ptr, arr2_ptr] __host__ __device__ (int idx) -> int { const int batch_id = idx / batch; const int batch_el = idx % batch; const int in_idx = (batch_id / 2) * batch + batch_el; const bool even_batch = (batch_id % 2 == 0); return even_batch ? arr1_ptr[in_idx] : arr2_ptr[in_idx]; });
现在可以将这个迭代器提供给管道中的下一个算法进行内核融合,或者使用以
begin
和end
迭代器为参数的构造函数初始化一个新的向量。如果未来多次使用交错的arr3
(读取),最好将其放入一个新的向量中,而不是重用迭代器,因为如果batch
不是warp大小(32)的倍数,迭代器不允许连续的全局内存访问。
完整的源代码:
由于使用了设备lambda,nvcc
需要-extended-lambda
选项。自CUDA 12以来,由于某种原因,还需要-rdc=true
选项。
#include <iostream>
#include <iterator>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/scatter.h>
#include <thrust/merge.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
template <typename T>
void print(const thrust::device_vector<T> &vec) {
thrust::host_vector<int> h_vec(vec);
std::ostream_iterator<int> out_it(std::cout, ", ");
thrust::copy(h_vec.cbegin(), h_vec.cend(),
out_it);
std::cout << '\n';
}
void part1() {
constexpr int h_arr1[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
constexpr int h_arr2[] = {1, 1, 1, 2, 2, 2};
constexpr int batch = 3;
thrust::device_vector<int> arr1(std::cbegin(h_arr1), std::cend(h_arr1));
thrust::device_vector<int> arr2(std::cbegin(h_arr2), std::cend(h_arr2));
assert(arr1.size() == 2 * arr2.size());
#if 1
// specialized version where arr2 is scattered to "uneven"
// batches
const auto scatter_indices_it = thrust::make_transform_iterator(
thrust::make_counting_iterator(0),
[batch]
__host__ __device__ (int idx) {
const int batch_id = idx / batch;
const int elem_id = idx % batch;
return (batch_id * 2 + 1) * batch + elem_id;
});
#else
// more general version where arr2 is scattered to batches given
// in indices
constexpr int h_indices[] = {1, 3};
thrust::device_vector<int> indices(
std::cbegin(h_indices), std::cend(h_indices));
const auto indices_ptr = indices.data();
const auto scatter_indices_it = thrust::make_transform_iterator(
thrust::make_counting_iterator(0),
[batch, indices_ptr]
__host__ __device__ (int idx) {
const int batch_id = idx / batch;
const int elem_id = idx % batch;
return indices_ptr[batch_id] * batch + elem_id;
});
#endif
thrust::scatter(
arr2.cbegin(), arr2.cend(),
scatter_indices_it,
arr1.begin());
print(arr1);
}
void part2() {
constexpr int h_arr1[] = {1, 2, 3, 4, 5, 6};
constexpr int h_arr2[] = {1, 1, 1, 2, 2, 2};
constexpr int batch = 3;
thrust::device_vector<int> arr1(std::cbegin(h_arr1), std::cend(h_arr1));
thrust::device_vector<int> arr2(std::cbegin(h_arr2), std::cend(h_arr2));
const auto out_size = arr1.size() + arr2.size();
assert(arr1.size() == arr2.size());
#if 1
const auto arr1_ptr = arr1.data();
const auto arr2_ptr = arr2.data();
const auto batch_interleave_it = thrust::make_transform_iterator(
thrust::make_counting_iterator(0),
[batch, arr1_ptr, arr2_ptr]
__host__ __device__ (int idx) -> int {
const int batch_id = idx / batch;
const int batch_el = idx % batch;
const int in_idx = (batch_id / 2) * batch + batch_el;
const bool even_batch = (batch_id % 2 == 0);
return even_batch ? arr1_ptr[in_idx] : arr2_ptr[in_idx];
});
// only put into vector if really needed, better to feed the
// iterator directly to the next operation for kernel fusion
thrust::device_vector<int> arr3(
batch_interleave_it, batch_interleave_it + out_size);
#else
thrust::device_vector<int> arr3(out_size);
const auto merge_keys1_it = thrust::make_transform_iterator(
thrust::make_counting_iterator(0),
[batch]
__host__ __device__ (int idx) {
const int batch_id = idx / batch;
return batch_id * 2;
});
const auto merge_keys2_it = thrust::make_transform_iterator(
thrust::make_counting_iterator(0),
[batch]
__host__ __device__ (int idx) {
const int batch_id = idx / batch;
return batch_id * 2 + 1;
});
thrust::merge_by_key(
merge_keys1_it, merge_keys1_it + arr1.size(),
merge_keys2_it, merge_keys2_it + arr2.size(),
arr1.cbegin(),
arr2.cbegin(),
thrust::make_discard_iterator(),
arr3.begin());
#endif
print(arr3);
}
int main(void) {
part1();
part2();
}
英文:
-
This operation scatters the values in
arr2
intoarr1
, so let's usethrust::scatter
. The indices to which the values are scattered can be calculated with athrust::transform_iterator
from athrust::counting_iterator
(likestd::ranges::views::iota
). For the general case where the batchindices
are given as another inputthrust::device_vector
, you can useconst auto indices_ptr = indices.data(); const auto scatter_indices_it = thrust::make_transform_iterator( thrust::make_counting_iterator(0), [batch, indices_ptr] __host__ __device__ (int idx) { const int batch_id = idx / batch; const int elem_id = idx % batch; return indices_ptr[batch_id] * batch + elem_id; });
while in the specific case where you just want to scatter to the odd batches, you should rather use
const auto scatter_indices_it = thrust::make_transform_iterator( thrust::make_counting_iterator(0), [batch] __host__ __device__ (int idx) { const int batch_id = idx / batch; const int elem_id = idx % batch; return (batch_id * 2 + 1) * batch + elem_id; });
The
scatter
operation itself is easy:thrust::scatter( arr2.cbegin(), arr2.cend(), scatter_indices_it, arr1.begin());
-
There are certainly multiple possible ways of doing this. The one that came to mind first for me was merging the two vectors with
thrust::merge_by_key
, where the keys are generated using a similar scheme as above for the scatter indices:const auto merge_keys1_it = thrust::make_transform_iterator( thrust::make_counting_iterator(0), [batch] __host__ __device__ (int idx) { const int batch_id = idx / batch; return batch_id * 2; }); const auto merge_keys2_it = thrust::make_transform_iterator( thrust::make_counting_iterator(0), [batch] __host__ __device__ (int idx) { const int batch_id = idx / batch; return batch_id * 2 + 1; }); thrust::merge_by_key( merge_keys1_it, merge_keys1_it + arr1.size(), merge_keys2_it, merge_keys2_it + arr2.size(), arr1.cbegin(), arr2.cbegin(), thrust::make_discard_iterator(), arr3.begin());
This works and is relatively elegant, but probably not ideal for performance due to the complexity of the merge algorithm and the regularity of the operation. A (probably) more performant ansatz would be creating a fancy iterator that takes care of the whole interleaving operation:
const auto arr1_ptr = arr1.data(); const auto arr2_ptr = arr2.data(); const auto batch_interleave_it = thrust::make_transform_iterator( thrust::make_counting_iterator(0), [batch, arr1_ptr, arr2_ptr] __host__ __device__ (int idx) -> int { const int batch_id = idx / batch; const int batch_el = idx % batch; const int in_idx = (batch_id / 2) * batch + batch_el; const bool even_batch = (batch_id % 2 == 0); return even_batch ? arr1_ptr[in_idx] : arr2_ptr[in_idx]; });
This iterator can now be fed to the next algorithm in your pipeline for kernel fusion or used to initialize a new vector using the constructor which takes an
begin
and anend
iterator. If the interleavedarr3
is used (read from) multiple times in future it, you should probably put it into a new vector instead of reusing the iterator as the iterator doesn't allow for coalesced global memory access ifbatch
isn't a multiple of warp size (32).
Complete source code:
Due to the use of device lambdas, nvcc
needs -extended-lambda
. Since CUDA 12 it also needs -rdc=true
for some reason.
#include <iostream>
#include <iterator>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/scatter.h>
#include <thrust/merge.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
template <typename T>
void print(const thrust::device_vector<T> &vec) {
thrust::host_vector<int> h_vec(vec);
std::ostream_iterator<int> out_it(std::cout, ", ");
thrust::copy(h_vec.cbegin(), h_vec.cend(),
out_it);
std::cout << '\n';
}
void part1() {
constexpr int h_arr1[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
constexpr int h_arr2[] = {1, 1, 1, 2, 2, 2};
constexpr int batch = 3;
thrust::device_vector<int> arr1(std::cbegin(h_arr1), std::cend(h_arr1));
thrust::device_vector<int> arr2(std::cbegin(h_arr2), std::cend(h_arr2));
assert(arr1.size() == 2 * arr2.size());
#if 1
// specialized version where arr2 is scattered to "uneven"
// batches
const auto scatter_indices_it = thrust::make_transform_iterator(
thrust::make_counting_iterator(0),
[batch]
__host__ __device__ (int idx) {
const int batch_id = idx / batch;
const int elem_id = idx % batch;
return (batch_id * 2 + 1) * batch + elem_id;
});
#else
// more general version where arr2 is scattered to batches given
// in indices
constexpr int h_indices[] = {1, 3};
thrust::device_vector<int> indices(
std::cbegin(h_indices), std::cend(h_indices));
const auto indices_ptr = indices.data();
const auto scatter_indices_it = thrust::make_transform_iterator(
thrust::make_counting_iterator(0),
[batch, indices_ptr]
__host__ __device__ (int idx) {
const int batch_id = idx / batch;
const int elem_id = idx % batch;
return indices_ptr[batch_id] * batch + elem_id;
});
#endif
thrust::scatter(
arr2.cbegin(), arr2.cend(),
scatter_indices_it,
arr1.begin());
print(arr1);
}
void part2() {
constexpr int h_arr1[] = {1, 2, 3, 4, 5, 6};
constexpr int h_arr2[] = {1, 1, 1, 2, 2, 2};
constexpr int batch = 3;
thrust::device_vector<int> arr1(std::cbegin(h_arr1), std::cend(h_arr1));
thrust::device_vector<int> arr2(std::cbegin(h_arr2), std::cend(h_arr2));
const auto out_size = arr1.size() + arr2.size();
assert(arr1.size() == arr2.size());
#if 1
const auto arr1_ptr = arr1.data();
const auto arr2_ptr = arr2.data();
const auto batch_interleave_it = thrust::make_transform_iterator(
thrust::make_counting_iterator(0),
[batch, arr1_ptr, arr2_ptr]
__host__ __device__ (int idx) -> int {
const int batch_id = idx / batch;
const int batch_el = idx % batch;
const int in_idx = (batch_id / 2) * batch + batch_el;
const bool even_batch = (batch_id % 2 == 0);
return even_batch ? arr1_ptr[in_idx] : arr2_ptr[in_idx];
});
// only put into vector if really needed, better to feed the
// iterator directly to the next operation for kernel fusion
thrust::device_vector<int> arr3(
batch_interleave_it, batch_interleave_it + out_size);
#else
thrust::device_vector<int> arr3(out_size);
const auto merge_keys1_it = thrust::make_transform_iterator(
thrust::make_counting_iterator(0),
[batch]
__host__ __device__ (int idx) {
const int batch_id = idx / batch;
return batch_id * 2;
});
const auto merge_keys2_it = thrust::make_transform_iterator(
thrust::make_counting_iterator(0),
[batch]
__host__ __device__ (int idx) {
const int batch_id = idx / batch;
return batch_id * 2 + 1;
});
thrust::merge_by_key(
merge_keys1_it, merge_keys1_it + arr1.size(),
merge_keys2_it, merge_keys2_it + arr2.size(),
arr1.cbegin(),
arr2.cbegin(),
thrust::make_discard_iterator(),
arr3.begin());
#endif
print(arr3);
}
int main(void) {
part1();
part2();
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论