英文:
How can I do segmented reduction using CUDA thrust?
问题
I want to store partial reduction results in an array.
Say I have data[8] = {10,20,30,40,50,60,70,80}
.
And if I divide the data
with the chunk_size
of 2
, the chunks will be {10,20}
, {30,40}
, ... , {70,80}
.
If I target the summation, the reduction in total will be 360
but I want to get an array of partial_sums = {30,70,110,150}
which is storing the partial sum of each block.
So far, what I have in mind is to construct an iterator strided_iterator
, that will access 0, 2, ... th index of data[8] = {10,20,30,40,50,60,70,80}
and something like
thrust::reduce(stride_iterator, stride_iterator + 2,
partial_sums.begin(),
thrust::plus<int>());
giving the desired result, but have no idea how could this be done efficiently.
For strided access, thrust/examples/strided_range.cu
has a solution but this seems to be not applicable to store segmented reductions.
Of course I can brutally do it with a loop like this,
for (int i = 0; i<4; i++) {
partial_sums[i] = thrust::reduce(data+2*i, data+2*i+2, 0, thrust::plus<int>());
}
But this kind of practice is what CUDA thrust is trying to avoid as much as possible, right? Somehow I should be able to put it all in a single Thrust call.
英文:
I want to store partial reduction results in an array.
Say I have data[8] = {10,20,30,40,50,60,70,80}
.
And if I divide the data
with the chunk_size
of 2
, the chunks will be {10,20}
, {30,40}
, ... , {70,80}
.
If I target the summation, the reduction in total will be 360
but I want to get an array of partial_sums = {30,70,110,150}
which is storing the partial sum of each block.
So far, what I have in mind is to construct an iterator strided_iterator
, that will access 0, 2, ... th index of data[8] = {10,20,30,40,50,60,70,80}
and something like
thrust::reduce(stride_iterator, stride_iterator + 2,
partial_sums.begin(),
thrust::plus<int>());
giving the desired result, but have no idea how could this be done efficiently.
For strided access, thrust/examples/strided_range.cu
has a solution but this seems to be not applicable to store segmented reductions.
Of course I can brutally do it with a loop like this,
for (int i = 0; i<4; i++) {
partial_sums[i] = thrust::reduce(data+2*i, data+2*i+2, 0, thrust::plus<int>());
}
But this kind of practice is what CUDA thrust is trying to avoid as much as possible, right? Somehow I should be able to put it all in a single Thrust call.
答案1
得分: 1
以下是您提供的代码的翻译部分:
根据 https://stackoverflow.com/questions/42260493/reduce-multiple-blocks-of-equal-length-that-are-arranged-in-a-big-vector-using-c 中有用的回答,到目前为止,我想到的如下所示。
实际上,我想在每个块中获取最小或最大值。
using namespace thrust::placeholders;
int main(int argc, char **argv) {
int N = atoi(argv[1]);
int K = atoi(argv[2]);
std::cout << "N " << N << " K " << K << std::endl;
typedef int mytype;
thrust::device_vector<mytype> data(N*K);
thrust::device_vector<mytype> sums(N);
thrust::sequence(data.begin(),data.end());
// 方法1
thrust::reduce_by_key(thrust::device,
thrust::make_transform_iterator(thrust::counting_iterator<int>(0), _1/K),
thrust::make_transform_iterator(thrust::counting_iterator<int>(N*K),_1/K),
data.begin(),
thrust::discard_iterator<int>(),
sums.begin(),
thrust::equal_to<int>(),
thrust::minimum<mytype>());
// 方法2(不好的方法)
for (int i=0; i<N; i++) {
int res = thrust::reduce(data.begin()+K*i, data.begin()+K*i+K,std::numeric_limits<mytype>::max(),thrust::minimum<mytype>());
}
// 仅打印前10个结果
thrust::copy_n(sums.begin(),10,std::ostream_iterator<mytype>(std::cout, ","));
std::cout << std::endl;
return 0;
}
下面是通过sdkTimer测量的估计运行时间显示的结果。可以看到,使用reduce_by_key
的方法1比使用for循环的方法2要快得多。
[sangjun@newmaster01 05_thrust]$ ./exe 1000 100
N 1000 K 100
- Elapsed time: 0.00008 sec
- Elapsed time: 0.02266 sec
0,100,200,300,400,500,600,700,800,900,
[sangjun@newmaster01 05_thrust]$ ./exe 1000 256
N 1000 K 256
- Elapsed time: 0.00008 sec
- Elapsed time: 0.02084 sec
0,256,512,768,1024,1280,1536,1792,2048,2304,
[sangjun@newmaster01 05_thrust]$ ./exe 100000 100
N 100000 K 100
- Elapsed time: 0.00016 sec
- Elapsed time: 1.98978 sec
0,100,200,300,400,500,600,700,800,900,
[sangjun@newmaster01 05_thrust]$ ./exe 100000 256
N 100000 K 256
- Elapsed time: 0.00027 sec
- Elapsed time: 1.92896 sec
0,256,512,768,1024,1280,1536,1792,2048,2304,
希望这有助于您理解代码的翻译部分。
英文:
Based on the useful answer in
https://stackoverflow.com/questions/42260493/reduce-multiple-blocks-of-equal-length-that-are-arranged-in-a-big-vector-using-c, what I come up with so far is like follows.
In fact I wanted to get min or max values in each chunk.
using namespace thrust::placeholders;
int main(int argc, char **argv) {
int N = atoi(argv[1]);
int K = atoi(argv[2]);
std::cout << "N " << N << " K " << K << std::endl;
typedef int mytype;
thrust::device_vector<mytype> data(N*K);
thrust::device_vector<mytype> sums(N);
thrust::sequence(data.begin(),data.end());
// method 1
thrust::reduce_by_key(thrust::device,
thrust::make_transform_iterator(thrust::counting_iterator<int>(0), _1/K),
thrust::make_transform_iterator(thrust::counting_iterator<int>(N*K),_1/K),
data.begin(),
thrust::discard_iterator<int>(),
sums.begin(),
thrust::equal_to<int>(),
thrust::minimum<mytype>());
// method 2 (bad)
for (int i=0; i<N; i++) {
int res = thrust::reduce(data.begin()+K*i, data.begin()+K*i+K,std::numeric_limits<mytype>::max(),thrust::minimum<mytype>());
}
// just print the first 10 results
thrust::copy_n(sums.begin(),10,std::ostream_iterator<mytype>(std::cout, ","));
std::cout << std::endl;
return 0;
}
The results are shown below with the estimated runtime measured via sdkTimer. As can be seen, method 1 with reduce_by_key
is much~ faster than the second one with for loop.
[sangjun@newmaster01 05_thrust]$ ./exe 1000 100
N 1000 K 100
- Elapsed time: 0.00008 sec
- Elapsed time: 0.02266 sec
0,100,200,300,400,500,600,700,800,900,
[sangjun@newmaster01 05_thrust]$ ./exe 1000 256
N 1000 K 256
- Elapsed time: 0.00008 sec
- Elapsed time: 0.02084 sec
0,256,512,768,1024,1280,1536,1792,2048,2304,
[sangjun@newmaster01 05_thrust]$ ./exe 100000 100
N 100000 K 100
- Elapsed time: 0.00016 sec
- Elapsed time: 1.98978 sec
0,100,200,300,400,500,600,700,800,900,
[sangjun@newmaster01 05_thrust]$ ./exe 100000 256
N 100000 K 256
- Elapsed time: 0.00027 sec
- Elapsed time: 1.92896 sec
0,256,512,768,1024,1280,1536,1792,2048,2304,
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论