英文:
When using thrust, is it legitimate to create a std::array inside a __host__ __device__ functor?
问题
I wrote a toy code to test some ideas
#include <thrust/transform.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/reduce.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/execution_policy.h>
#include <iostream>
#include <array>
#include <vector>
#define N 20
struct func {
        __host__ __device__
        float operator()(float x) { return x*2; }
};
template <typename S>
struct O {
        const std::array<float,2> a;
        O(std::array<float,2> a): a(a) {}
        S f;
        __host__ __device__
        float operator()(float &v) {
                std::array<int,3> b = {2,3,4};
                int tmp;
                for (int i=0; i<3; i++) {
                        tmp = thrust::reduce(thrust::device,b.begin(),b.end(),0);
                        printf("%d",tmp);
                }
                return a[0]*v + a[1] + f(a[0]);
        }
};
int main(void) {
        thrust::host_vector<float> _v1(N);
        thrust::device_vector<float> v1 = _v1, v2;
        thrust::fill(v1.begin(),v1.end(),12);
        v2.resize(N);
        std::array<float,2> a{1,2};
        auto c_itor = thrust::make_counting_iterator(0);
        thrust::transform(v1.begin(),v1.end(),v2.begin(),O<func>(a));
        thrust::copy(v2.begin(),v2.end(),std::ostream_iterator<float>(std::cout," "));
}
这段代码在使用 nvcc --expt-relaxed-constexpr -std=c++17 时运行良好。可以看到在 __host__ __device__ 函数中有很多 std 容器,比如 std::array。您想知道:
- 这样的编写方式是否合理?(从效率角度,不是语法有效性)
 - 由于代码运行正确,std 对象存储在哪里?(设备还是主机)
 
我只提供了代码的翻译,没有回答您的问题。
英文:
I wrote a toy code to test some ideas
#include <thrust/transform.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/reduce.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/execution_policy.h>
#include <iostream>
#include <array>
#include <vector>
#define N 20
struct func {
__host__ __device__
float operator()(float x) { return x*2; }
};
template <typename S>
struct O {
const std::array<float,2> a;
O(std::array<float,2> a): a(a) {}
S f;
__host__ __device__
float operator()(float &v) {
std::array<int,3> b = {2,3,4};
int tmp;
for (int i=0; i<3; i++) {
tmp = thrust::reduce(thrust::device,b.begin(),b.end(),0);
printf("%d",tmp);
}
return a[0]*v + a[1] + f(a[0]);
}
};
int main(void) {
thrust::host_vector<float> _v1(N);
thrust::device_vector<float> v1 = _v1, v2;
thrust::fill(v1.begin(),v1.end(),12);
v2.resize(N);
std::array<float,2> a{1,2};
auto c_itor = thrust::make_counting_iterator(0);
thrust::transform(v1.begin(),v1.end(),v2.begin(),O<func>(a));
thrust::copy(v2.begin(),v2.end(),std::ostream_iterator<float>(std::cout," "));
}
This code runs perfectly when using nvcc --expt-relaxed-constexpr -std=c++17. One can see that there are a lot of std containers like std::array  occur in a __host__ __device__ functor, what I want to know is
- is this writing legitimate? (in term of efficiency, not grammar validity)
 - since the code runs correctly, where do the std objects store? (device or host)
 
答案1
得分: 2
以下是翻译好的部分:
使用C++17或更高版本以及--expt-relaxed-constexpr的std::array的特殊情况有效,因为std::array只是一个薄包装,包裹着C风格数组,而在C++17中,您使用的所有成员函数都是constexpr。我认为除了std::array::fill和std::array::swap之外的所有成员函数都在C++17中是constexpr的。这两个在C++20中也变成了constexpr。
因此,就性能考虑而言,您的代码应该与使用float a[2]和int b[3]时的性能相同。这意味着如果可能的话,值将存储在寄存器中(这取决于对b的循环展开以及一般的寄存器压力)。只要不过度使用数组的大小,这是可以接受的。例如,参见此答案以深入讨论数组、寄存器和本地内存。
其他容器/替代方案:
对于使用动态内存的其他STL容器,您可能不会像成员函数是constexpr方面那么幸运。HPC nvc++编译器(以前是PGI C++编译器)不需要__device__标记,因此理论上可以在设备代码中使用更多STL功能,但在大多数情况下,这在性能上是一个坏主意。STL函数仍然必须符合CUDA的C++语言限制。
Nvidia正在开发自己的C++标准库实现,带有自己的设备扩展,名为libcu++。目前还没有容器,但它们可能会在将来出现。对于哈希表,有cuCollections库(正在进行中)。
英文:
The special case of using std::array with C++17 or higher and --expt-relaxed-constexpr works because std::array is a very thin wrapper around a C-style array and with C++17 all member functions that you used are constexpr. I think all member functions but std::array::fill and std::array::swap are constexpr by C++17. These two got the constexpr treatment with C++20.
So for performance considerations your code should perform the same as when using float a[2] and int b[3]. This means that the values are stored in registers if possible (this depends on loop-unrolling for b and generally register pressure). This is fine as long as you don't go overboard with the size of the arrays. See e.g. this answer for a deeper discussion of arrays, registers and local memory.
Other Containers / Alternatives:
For other STL containers using dynamic memory you probably wont be as lucky in terms of member functions being constexpr. The HPC nvc++ compiler (former PGI C++ compiler) does not need __device__ markers, so in theory one can use a lot more STL functionality in device code but in most cases that is a bad idea in terms of performance. STL functions must also still conform to CUDA's C++ Language Restrictions.
Nvidia is developing its own C++ standard library implementation with its own device extensions in libcu++. There are no containers yet, but they might come in the future. For hash tables there is the cuCollections library (WIP).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论