2023年5月14日 20:35:26go评论76阅读模式

英文:

How can I do segmented reduction using CUDA thrust?

问题

I want to store partial reduction results in an array.

Say I have data[8] = {10,20,30,40,50,60,70,80}.
And if I divide the data with the chunk_size of 2, the chunks will be {10,20}, {30,40}, ... , {70,80}.

If I target the summation, the reduction in total will be 360 but I want to get an array of partial_sums = {30,70,110,150} which is storing the partial sum of each block.

So far, what I have in mind is to construct an iterator strided_iterator, that will access 0, 2, ... th index of data[8] = {10,20,30,40,50,60,70,80} and something like

thrust::reduce(stride_iterator, stride_iterator + 2,
               partial_sums.begin(),
               thrust::plus<int>());

giving the desired result, but have no idea how could this be done efficiently.

For strided access, thrust/examples/strided_range.cu has a solution but this seems to be not applicable to store segmented reductions.

Of course I can brutally do it with a loop like this,

for (int i = 0; i<4; i++) {
  partial_sums[i] = thrust::reduce(data+2*i, data+2*i+2, 0, thrust::plus<int>());
}

But this kind of practice is what CUDA thrust is trying to avoid as much as possible, right? Somehow I should be able to put it all in a single Thrust call.

英文:

I want to store partial reduction results in an array.

Say I have data[8] = {10,20,30,40,50,60,70,80}.
And if I divide the data with the chunk_size of 2, the chunks will be {10,20}, {30,40}, ... , {70,80}.

If I target the summation, the reduction in total will be 360 but I want to get an array of partial_sums = {30,70,110,150} which is storing the partial sum of each block.

So far, what I have in mind is to construct an iterator strided_iterator, that will access 0, 2, ... th index of data[8] = {10,20,30,40,50,60,70,80} and something like

thrust::reduce(stride_iterator, stride_iterator + 2,
               partial_sums.begin(),
               thrust::plus&lt;int&gt;());

giving the desired result, but have no idea how could this be done efficiently.

For strided access, thrust/examples/strided_range.cu has a solution but this seems to be not applicable to store segmented reductions.

Of course I can brutally do it with a loop like this,

for (int i = 0; i&lt;4; i++) {
  partial_sums[i] = thrust::reduce(data+2*i, data+2*i+2, 0, thrust::plus&lt;int&gt;());
}

But this kind of practice is what CUDA thrust is trying to avoid as much as possible, right? Somehow I should be able to put it all in a single Thrust call.

答案1

得分: 1

以下是您提供的代码的翻译部分：

根据 https://stackoverflow.com/questions/42260493/reduce-multiple-blocks-of-equal-length-that-are-arranged-in-a-big-vector-using-c 中有用的回答，到目前为止，我想到的如下所示。

实际上，我想在每个块中获取最小或最大值。

using namespace thrust::placeholders;
int main(int argc, char **argv) {

  int N = atoi(argv[1]);
  int K = atoi(argv[2]);

  std::cout << "N " << N << " K " << K << std::endl;

  typedef int mytype;

  thrust::device_vector<mytype> data(N*K);
  thrust::device_vector<mytype> sums(N);

  thrust::sequence(data.begin(),data.end());

  // 方法1
  thrust::reduce_by_key(thrust::device,
                        
  thrust::make_transform_iterator(thrust::counting_iterator<int>(0),  _1/K),
                        
  thrust::make_transform_iterator(thrust::counting_iterator<int>(N*K),_1/K),
                        data.begin(),
                        thrust::discard_iterator<int>(),
                        sums.begin(),
                        thrust::equal_to<int>(),
                        thrust::minimum<mytype>());

  // 方法2（不好的方法）
  for (int i=0; i<N; i++) {
    int res = thrust::reduce(data.begin()+K*i, data.begin()+K*i+K,std::numeric_limits<mytype>::max(),thrust::minimum<mytype>());
  }

  // 仅打印前10个结果
  thrust::copy_n(sums.begin(),10,std::ostream_iterator<mytype>(std::cout, ","));
  std::cout << std::endl;

  return 0;
}

下面是通过sdkTimer测量的估计运行时间显示的结果。可以看到，使用reduce_by_key的方法1比使用for循环的方法2要快得多。

[sangjun@newmaster01 05_thrust]$ ./exe 1000 100
N 1000 K 100
 - Elapsed time: 0.00008 sec 
 - Elapsed time: 0.02266 sec 
0,100,200,300,400,500,600,700,800,900,

[sangjun@newmaster01 05_thrust]$ ./exe 1000 256
N 1000 K 256
 - Elapsed time: 0.00008 sec 
 - Elapsed time: 0.02084 sec 
0,256,512,768,1024,1280,1536,1792,2048,2304,

[sangjun@newmaster01 05_thrust]$ ./exe 100000 100
N 100000 K 100
 - Elapsed time: 0.00016 sec 
 - Elapsed time: 1.98978 sec 
0,100,200,300,400,500,600,700,800,900,

[sangjun@newmaster01 05_thrust]$ ./exe 100000 256
N 100000 K 256
 - Elapsed time: 0.00027 sec 
 - Elapsed time: 1.92896 sec 
0,256,512,768,1024,1280,1536,1792,2048,2304,

希望这有助于您理解代码的翻译部分。

英文:

Based on the useful answer in
https://stackoverflow.com/questions/42260493/reduce-multiple-blocks-of-equal-length-that-are-arranged-in-a-big-vector-using-c, what I come up with so far is like follows.

In fact I wanted to get min or max values in each chunk.

using namespace thrust::placeholders;
int main(int argc, char **argv) {

  int N = atoi(argv[1]);
  int K = atoi(argv[2]);

  std::cout &lt;&lt; &quot;N &quot; &lt;&lt; N &lt;&lt; &quot; K &quot; &lt;&lt; K &lt;&lt; std::endl;

  typedef int mytype;

  thrust::device_vector&lt;mytype&gt; data(N*K);
  thrust::device_vector&lt;mytype&gt; sums(N);

  thrust::sequence(data.begin(),data.end());

  // method 1
  thrust::reduce_by_key(thrust::device,
                        
  thrust::make_transform_iterator(thrust::counting_iterator&lt;int&gt;(0),  _1/K),
                        
  thrust::make_transform_iterator(thrust::counting_iterator&lt;int&gt;(N*K),_1/K),
                        data.begin(),
                        thrust::discard_iterator&lt;int&gt;(),
                        sums.begin(),
                        thrust::equal_to&lt;int&gt;(),
                        thrust::minimum&lt;mytype&gt;());

  // method 2 (bad)
  for (int i=0; i&lt;N; i++) {
    int res = thrust::reduce(data.begin()+K*i, data.begin()+K*i+K,std::numeric_limits&lt;mytype&gt;::max(),thrust::minimum&lt;mytype&gt;());
  }

  // just print the first 10 results
  thrust::copy_n(sums.begin(),10,std::ostream_iterator&lt;mytype&gt;(std::cout, &quot;,&quot;));
  std::cout &lt;&lt; std::endl;

  return 0;
}

The results are shown below with the estimated runtime measured via sdkTimer. As can be seen, method 1 with reduce_by_key is much~ faster than the second one with for loop.

[sangjun@newmaster01 05_thrust]$ ./exe 1000 100
N 1000 K 100
 - Elapsed time: 0.00008 sec 
 - Elapsed time: 0.02266 sec 
0,100,200,300,400,500,600,700,800,900,
[sangjun@newmaster01 05_thrust]$ ./exe 1000 256
N 1000 K 256
 - Elapsed time: 0.00008 sec 
 - Elapsed time: 0.02084 sec 
0,256,512,768,1024,1280,1536,1792,2048,2304,
[sangjun@newmaster01 05_thrust]$ ./exe 100000 100
N 100000 K 100
 - Elapsed time: 0.00016 sec 
 - Elapsed time: 1.98978 sec 
0,100,200,300,400,500,600,700,800,900,
[sangjun@newmaster01 05_thrust]$ ./exe 100000 256
N 100000 K 256
 - Elapsed time: 0.00027 sec 
 - Elapsed time: 1.92896 sec 
0,256,512,768,1024,1280,1536,1792,2048,2304,

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用CUDA Thrust进行分段归约？

问题

答案1

nlohmann/json：如何将字符串转换为列表并遍历该列表？

将对象与键值匹配模式组合在一起。

Is the code below well formed, in particular regarding aliasing rules?

使用C++17从另一个构造函数调用不同参数类型的构造函数

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论