When using thrust, is it legitimate to create a std::array inside a __host__ __device__ functor?

huangapple go评论71阅读模式
英文:

When using thrust, is it legitimate to create a std::array inside a __host__ __device__ functor?

问题

I wrote a toy code to test some ideas

#include <thrust/transform.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/reduce.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/execution_policy.h>
#include <iostream>
#include <array>
#include <vector>

#define N 20

struct func {
        __host__ __device__
        float operator()(float x) { return x*2; }
};

template <typename S>
struct O {
        const std::array<float,2> a;
        O(std::array<float,2> a): a(a) {}

        S f;
        __host__ __device__
        float operator()(float &v) {
                std::array<int,3> b = {2,3,4};
                int tmp;
                for (int i=0; i<3; i++) {
                        tmp = thrust::reduce(thrust::device,b.begin(),b.end(),0);
                        printf("%d",tmp);
                }
                return a[0]*v + a[1] + f(a[0]);
        }
};

int main(void) {

        thrust::host_vector<float> _v1(N);
        thrust::device_vector<float> v1 = _v1, v2;
        thrust::fill(v1.begin(),v1.end(),12);
        v2.resize(N);

        std::array<float,2> a{1,2};
        auto c_itor = thrust::make_counting_iterator(0);
        thrust::transform(v1.begin(),v1.end(),v2.begin(),O<func>(a));

        thrust::copy(v2.begin(),v2.end(),std::ostream_iterator<float>(std::cout," "));

}

这段代码在使用 nvcc --expt-relaxed-constexpr -std=c++17 时运行良好。可以看到在 __host__ __device__ 函数中有很多 std 容器,比如 std::array。您想知道:

  1. 这样的编写方式是否合理?(从效率角度,不是语法有效性)
  2. 由于代码运行正确,std 对象存储在哪里?(设备还是主机)

我只提供了代码的翻译,没有回答您的问题。

英文:

I wrote a toy code to test some ideas

#include <thrust/transform.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/reduce.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/execution_policy.h>
#include <iostream>
#include <array>
#include <vector>
#define N 20
struct func {
__host__ __device__
float operator()(float x) { return x*2; }
};
template <typename S>
struct O {
const std::array<float,2> a;
O(std::array<float,2> a): a(a) {}
S f;
__host__ __device__
float operator()(float &v) {
std::array<int,3> b = {2,3,4};
int tmp;
for (int i=0; i<3; i++) {
tmp = thrust::reduce(thrust::device,b.begin(),b.end(),0);
printf("%d",tmp);
}
return a[0]*v + a[1] + f(a[0]);
}
};
int main(void) {
thrust::host_vector<float> _v1(N);
thrust::device_vector<float> v1 = _v1, v2;
thrust::fill(v1.begin(),v1.end(),12);
v2.resize(N);
std::array<float,2> a{1,2};
auto c_itor = thrust::make_counting_iterator(0);
thrust::transform(v1.begin(),v1.end(),v2.begin(),O<func>(a));
thrust::copy(v2.begin(),v2.end(),std::ostream_iterator<float>(std::cout," "));
}

This code runs perfectly when using nvcc --expt-relaxed-constexpr -std=c++17. One can see that there are a lot of std containers like std::array occur in a __host__ __device__ functor, what I want to know is

  1. is this writing legitimate? (in term of efficiency, not grammar validity)
  2. since the code runs correctly, where do the std objects store? (device or host)

答案1

得分: 2

以下是翻译好的部分:

使用C++17或更高版本以及--expt-relaxed-constexprstd::array的特殊情况有效,因为std::array只是一个薄包装,包裹着C风格数组,而在C++17中,您使用的所有成员函数都是constexpr。我认为除了std::array::fillstd::array::swap之外的所有成员函数都在C++17中是constexpr的。这两个在C++20中也变成了constexpr

因此,就性能考虑而言,您的代码应该与使用float a[2]int b[3]时的性能相同。这意味着如果可能的话,值将存储在寄存器中(这取决于对b的循环展开以及一般的寄存器压力)。只要不过度使用数组的大小,这是可以接受的。例如,参见此答案以深入讨论数组、寄存器和本地内存。

其他容器/替代方案:

对于使用动态内存的其他STL容器,您可能不会像成员函数是constexpr方面那么幸运。HPC nvc++编译器(以前是PGI C++编译器)不需要__device__标记,因此理论上可以在设备代码中使用更多STL功能,但在大多数情况下,这在性能上是一个坏主意。STL函数仍然必须符合CUDA的C++语言限制

Nvidia正在开发自己的C++标准库实现,带有自己的设备扩展,名为libcu++。目前还没有容器,但它们可能会在将来出现。对于哈希表,有cuCollections库(正在进行中)。

英文:

The special case of using std::array with C++17 or higher and --expt-relaxed-constexpr works because std::array is a very thin wrapper around a C-style array and with C++17 all member functions that you used are constexpr. I think all member functions but std::array::fill and std::array::swap are constexpr by C++17. These two got the constexpr treatment with C++20.

So for performance considerations your code should perform the same as when using float a[2] and int b[3]. This means that the values are stored in registers if possible (this depends on loop-unrolling for b and generally register pressure). This is fine as long as you don't go overboard with the size of the arrays. See e.g. this answer for a deeper discussion of arrays, registers and local memory.

Other Containers / Alternatives:

For other STL containers using dynamic memory you probably wont be as lucky in terms of member functions being constexpr. The HPC nvc++ compiler (former PGI C++ compiler) does not need __device__ markers, so in theory one can use a lot more STL functionality in device code but in most cases that is a bad idea in terms of performance. STL functions must also still conform to CUDA's C++ Language Restrictions.

Nvidia is developing its own C++ standard library implementation with its own device extensions in libcu++. There are no containers yet, but they might come in the future. For hash tables there is the cuCollections library (WIP).

huangapple
  • 本文由 发表于 2023年5月17日 19:15:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/76271476.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定