如何使用非对齐偏移访问存储缓冲区?

huangapple go评论70阅读模式
英文:

How can I access a storage buffer using non aligned offset?

问题

我有一个包含BGR格式(每像素3字节)的打包像素的存储缓冲区。

我想要编写一个简单的计算着色器,将每个像素写入RGBA纹理。

然而,我找不到在着色器中访问非对齐地址的方法(无论是glsl还是hlsl)。

例如,HLSL有ByteAddressBuffer,但其Load函数要求地址4字节对齐。

ByteAddressBuffer inputBuffer : register(t0); // bgr, 每像素3字节
RWTexture2D<float4> outputTexture : register(u1); // rgba unorm纹理

[numthreads(16, 16, 1)]
void main(uint3 dispatchThreadID : SV_DispatchThreadID)
{
    // 计算输出纹理中当前线程的坐标
    uint2 texCoord = dispatchThreadID.xy;

    // 计算输入缓冲区中像素的起始字节偏移量
    uint byte_offset = texCoord.x + texCoord.y * 1024; // 1024是图像宽度

    // 从输入缓冲区读取B、G和R值 - 不起作用
    uint bgr_value = inputBuffer.Load(byte_offset);
    ...
}

我该如何实现这一目标?

英文:

I have a storage buffer that contains packed pixels in BGR format (3 bytes per pixel).

I would like to write a simple compute shader that writes each pixel to a RGBA texture.

However, I could not find a way to access a non aligned address in a shader (either glsl or hlsl).

HLSL for example has ByteAddressBuffer, but its Load functions require addresses to be 4 bytes aligned.

ByteAddressBuffer inputBuffer : register(t0); // bgr, 3 bytes per pixel
RWTexture2D&lt;float4&gt; outputTexture : register(u1); // rgba unorm texture

[numthreads(16, 16, 1)]
void main(uint3 dispatchThreadID : SV_DispatchThreadID)
{
    // Calculate the coordinates of the current thread in the output texture
    uint2 texCoord = dispatchThreadID.xy;

    // Calculate the starting byte offset of the pixel in the input buffer
    uint byte_offset = texCoord.x + texCoord.y * 1024; // 1024 is image width

    // Read the B, G, and R values from the input buffer - doesn&#39;t work
    uint bgr_value = inputBuffer.Load(byte_offset);
    ...
}

How can I achieve this?

答案1

得分: 2

要在着色器存储缓冲区中使用小于4字节的数据(并访问它们),您需要启用VK_KHR_8bit_storageVK_KHR_16bit_storage(当然还需要检查 - 据我所知,几乎任何现代桌面GPU都支持它,尽管移动GPU上的支持似乎更为有限)。然后,使用std430应该允许您访问像素数组的单个字节。在GLSL中,数据类型为uint8_t等。不要忘记添加#extension GL_EXT_shader_8bit_storage: require

话虽如此,我想提出一个想法,即每个线程处理多个(例如,一次加载3*4字节的四个)像素。加载单个字节可能会带来性能损失,这可能对您的用例有关或无关紧要。

英文:

To have data (and access to them) smaller than 4 bytes in a shader storage buffer, you'll need to enable VK_KHR_8bit_storage or VK_KHR_16bit_storage (and of course check for it - as far as I can tell, just about any modern Desktop GPU has support for it, though support on mobile GPUs looks to be more bleak). Using std430 should then allow you to access individual bytes of your pixel array. In GLSL, the data types are uint8_t etc. Don't forget to put #extension GL_EXT_shader_8bit_storage: require.

That being said, I'd like to pitch the idea of processing multiple (e.g. four, loading 3*4 bytes at once) pixels per thread instead. Loading individual bytes may carry a performance penalty that may or may not matter for your use case.

答案2

得分: 1

Indeed ByteAddressBuffer要求加载时为4字节对齐(其他任何情况都是未定义行为)。

如果您不能使用扩展(我认为DirectX中没有这样的扩展),您有两个选择:

  • 如果您逐像素处理,您需要检查地址是否为4字节对齐,如果是,就加载它,如果不是,您需要执行2次加载并使用一些位移魔术合并像素。
  • 由于您使用了计算,可以同时写入多个像素,因此您可以一次加载4个像素(4*3 = 12,4字节对齐是保证的),然后写入这4个像素。

我记得我两种方法都实现过,第二种方法要快得多。

英文:

Indeed ByteAddressBuffer requires load to be 4 bytes aligned (anything else is undefined behaviour).

If you cannot use extensions (I don't think there is one in DirectX), you have 2 options:

  • If you process pixel per pixel, you need to check if address is 4 bytes aligned, if yes, just load it as it is, if not, you need to perform 2 loads and merge the pixel with some bit shift magic.
  • Since you use compute, you are allowed to write to several pixels at once, so you can load 4 pixels in a row (4*3 = 12, 4 bytes alignment is guaranteed), then write those 4 pixels.

I remember implementing both and second one was much faster.

huangapple
  • 本文由 发表于 2023年6月5日 18:31:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/76405552.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定