AVX/AVX2寄存器加载数组末尾数据时如何避免越界?

huangapple go评论65阅读模式
英文:

How to go not out of bounds when loading data from the end of an array into AVX/AVX2 registers?

问题

If I know I have e.g. at least 4 doubles sitting at given (aligned) location in memory, double *d, I can simply do __m256d x = _mm256_load_pd(&d[i]), i.e. load them into an AVX(2) register.

The question is: How do I correctly handle cases where there aren't 4 doubles left at the given location, i.e. I'd theoretically access the array out of bounds?

One solution that I have been using so far is to only allocate memory of multiples of 4 * 8 bytes in this specific case. Alternatively, for cases where I do not control the memory allocation completely, I have also been playing with stuff like this, assuming that this probably not the way to go:

static __m256d inline _load_256d(size_t diff, size_t i, double *d){

    if (diff == 4) {
        return _mm256_load_pd(&d[i]);
    }
    if (diff == 3) {
        return _mm256_set_pd(0.0, d[i+2], d[i+1], d[i]);
    }
    if (diff == 2) {
        return _mm256_set_pd(0.0, 0.0, d[i+1], d[i]);
    }
    return _mm256_set_pd(0.0, 0.0, 0.0, d[i]);

}

What is the (a) canonical solution for a case like this?

英文:

If I know I have e.g. at least 4 doubles sitting at given (aligned) location in memory,
double *d, I can simply do __m256d x = _mm256_load_pd(&d[i]), i.e. load them into an AVX(2) register.

The question is: How do I correctly handle cases where there aren't 4 doubles left at the given location, i.e. I'd theoretically access the array out of bounds?

One solution that I have been using so far is to only allocate memory of multiples of 4 * 8 bytes in this specific case. Alternatively, for cases where I do not control the memory allocation completely, I have also been playing with stuff like this, assuming that this probably not the way to go:

static __m256d inline _load_256d(size_t diff, size_t i, double *d){

    if (diff == 4) {
        return _mm256_load_pd(&d[i]);
    }
    if (diff == 3) {
        return _mm256_set_pd(0.0, d[i+2], d[i+1], d[i]);
    }
    if (diff == 2) {
        return _mm256_set_pd(0.0, 0.0, d[i+1], d[i]);
    }
    return _mm256_set_pd(0.0, 0.0, 0.0, d[i]);

}

What is the (a) canonical solution for a case like this?

答案1

得分: 2

对于读取操作,假设整体向量的起始位置对齐,只需读取整个SIMD块并忽略不需要的元素。硬件设计使得如果一个块的第一个字节可读,那么该块的所有字节都是可读的(因为用于映射和保护内存的页面会对齐到至少与SIMD块对齐一样大的边界)。

对于写入操作,没有一个标准的答案;根据情况有多个选择,包括:

  • 要求调用软件提供填充,以便可以始终执行整块存储,即使只有最后一个块中使用了一个元素。
  • 使用一条带有掩码的指令,以指定要更新的元素。
  • 编写与主循环分开的代码来处理最后一个块,使用标量元素或部分块的指令(例如,使用16字节的SIMD指令而不是32字节的指令)。
  • 使用非对齐存储将一个以目标向量结束的完整块存储。这将重叠在前一个块中的元素,因此可以将其存储两次(如果计算允许的话),或者合并(根据需要加载、置换、存储)。(这也需要小心处理整个向量小于一个完整块的情况。)
  • 如果应用程序是单线程的(因此可以确定没有其他可能写入同一块的代码正在执行),则读取最后一个块,将更改的元素合并,并写入最后一个块。
英文:

For reads, and assuming the start of the overall vector is aligned, one simply reads the entire SIMD block and ignores the undesired elements. The hardware design is such that if the first byte of a block is readable, all bytes of the block are readable (because the pages used to map and protect memory are aligned to boundaries at least as large as the SIMD block alignments).

For writes, there is no canonical answer; there are multiple options depending on circumstances, including:

  • Require the calling software to provide padding so that whole-block stores can always be performed even if only one element is used in the last block.
  • Use an instruction that stores with a mask to specify which elements are updated.
  • Write code separate from the main loop to handle the last block using instructions for scalar elements or partial blocks (e.g., 16-byte SIMD instructions instead of 32-byte instructions).
  • Use an unaligned store to store a whole block that ends where the destination vector ends. This will overlap elements in the prior block, so they can either be stored twice (if the computation permits) or merged (load, permute as necessary, store). (This also requires taking care to handle the case where the entire vector is less than a full block.)
  • If the application is single-threaded (so it is known no other code that could be writing to the same block is executing), read the last block, merge in the changed elements, and write the last block.

huangapple
  • 本文由 发表于 2023年4月19日 17:04:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/76052647.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定