英文:
When using thrust, is it legitimate to create a std::array inside a __host__ __device__ functor?
问题
I wrote a toy code to test some ideas
#include <thrust/transform.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/reduce.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/execution_policy.h>
#include <iostream>
#include <array>
#include <vector>
#define N 20
struct func {
__host__ __device__
float operator()(float x) { return x*2; }
};
template <typename S>
struct O {
const std::array<float,2> a;
O(std::array<float,2> a): a(a) {}
S f;
__host__ __device__
float operator()(float &v) {
std::array<int,3> b = {2,3,4};
int tmp;
for (int i=0; i<3; i++) {
tmp = thrust::reduce(thrust::device,b.begin(),b.end(),0);
printf("%d",tmp);
}
return a[0]*v + a[1] + f(a[0]);
}
};
int main(void) {
thrust::host_vector<float> _v1(N);
thrust::device_vector<float> v1 = _v1, v2;
thrust::fill(v1.begin(),v1.end(),12);
v2.resize(N);
std::array<float,2> a{1,2};
auto c_itor = thrust::make_counting_iterator(0);
thrust::transform(v1.begin(),v1.end(),v2.begin(),O<func>(a));
thrust::copy(v2.begin(),v2.end(),std::ostream_iterator<float>(std::cout," "));
}
这段代码在使用 nvcc --expt-relaxed-constexpr -std=c++17
时运行良好。可以看到在 __host__ __device__
函数中有很多 std 容器,比如 std::array
。您想知道:
- 这样的编写方式是否合理?(从效率角度,不是语法有效性)
- 由于代码运行正确,std 对象存储在哪里?(设备还是主机)
我只提供了代码的翻译,没有回答您的问题。
英文:
I wrote a toy code to test some ideas
#include <thrust/transform.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/reduce.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/execution_policy.h>
#include <iostream>
#include <array>
#include <vector>
#define N 20
struct func {
__host__ __device__
float operator()(float x) { return x*2; }
};
template <typename S>
struct O {
const std::array<float,2> a;
O(std::array<float,2> a): a(a) {}
S f;
__host__ __device__
float operator()(float &v) {
std::array<int,3> b = {2,3,4};
int tmp;
for (int i=0; i<3; i++) {
tmp = thrust::reduce(thrust::device,b.begin(),b.end(),0);
printf("%d",tmp);
}
return a[0]*v + a[1] + f(a[0]);
}
};
int main(void) {
thrust::host_vector<float> _v1(N);
thrust::device_vector<float> v1 = _v1, v2;
thrust::fill(v1.begin(),v1.end(),12);
v2.resize(N);
std::array<float,2> a{1,2};
auto c_itor = thrust::make_counting_iterator(0);
thrust::transform(v1.begin(),v1.end(),v2.begin(),O<func>(a));
thrust::copy(v2.begin(),v2.end(),std::ostream_iterator<float>(std::cout," "));
}
This code runs perfectly when using nvcc --expt-relaxed-constexpr -std=c++17
. One can see that there are a lot of std containers like std::array
occur in a __host__ __device__
functor, what I want to know is
- is this writing legitimate? (in term of efficiency, not grammar validity)
- since the code runs correctly, where do the std objects store? (device or host)
答案1
得分: 2
以下是翻译好的部分:
使用C++17或更高版本以及--expt-relaxed-constexpr
的std::array
的特殊情况有效,因为std::array
只是一个薄包装,包裹着C风格数组,而在C++17中,您使用的所有成员函数都是constexpr
。我认为除了std::array::fill
和std::array::swap
之外的所有成员函数都在C++17中是constexpr
的。这两个在C++20中也变成了constexpr
。
因此,就性能考虑而言,您的代码应该与使用float a[2]
和int b[3]
时的性能相同。这意味着如果可能的话,值将存储在寄存器中(这取决于对b
的循环展开以及一般的寄存器压力)。只要不过度使用数组的大小,这是可以接受的。例如,参见此答案以深入讨论数组、寄存器和本地内存。
其他容器/替代方案:
对于使用动态内存的其他STL容器,您可能不会像成员函数是constexpr
方面那么幸运。HPC nvc++
编译器(以前是PGI C++编译器)不需要__device__
标记,因此理论上可以在设备代码中使用更多STL功能,但在大多数情况下,这在性能上是一个坏主意。STL函数仍然必须符合CUDA的C++语言限制。
Nvidia正在开发自己的C++标准库实现,带有自己的设备扩展,名为libcu++。目前还没有容器,但它们可能会在将来出现。对于哈希表,有cuCollections库(正在进行中)。
英文:
The special case of using std::array
with C++17 or higher and --expt-relaxed-constexpr
works because std::array
is a very thin wrapper around a C-style array and with C++17 all member functions that you used are constexpr
. I think all member functions but std::array::fill
and std::array::swap
are constexpr by C++17. These two got the constexpr
treatment with C++20.
So for performance considerations your code should perform the same as when using float a[2]
and int b[3]
. This means that the values are stored in registers if possible (this depends on loop-unrolling for b
and generally register pressure). This is fine as long as you don't go overboard with the size of the arrays. See e.g. this answer for a deeper discussion of arrays, registers and local memory.
Other Containers / Alternatives:
For other STL containers using dynamic memory you probably wont be as lucky in terms of member functions being constexpr
. The HPC nvc++
compiler (former PGI C++ compiler) does not need __device__
markers, so in theory one can use a lot more STL functionality in device code but in most cases that is a bad idea in terms of performance. STL functions must also still conform to CUDA's C++ Language Restrictions.
Nvidia is developing its own C++ standard library implementation with its own device extensions in libcu++. There are no containers yet, but they might come in the future. For hash tables there is the cuCollections library (WIP).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论