英文:
How strided memcpy(3) works in libvpx
问题
我正在尝试理解libvpx中以下函数的功能(vp8/common/reconinter.c
):
void vp8_copy_mem16x16_c(unsigned char *src, int src_stride, unsigned char *dst,
int dst_stride) {
int r;
for (r = 0; r < 16; ++r) {
memcpy(dst, src, 16);
src += src_stride;
dst += dst_stride;
}
}
(同一源文件中还存在8x8和8x4版本。)
它将16个字节从src
复制到dst
,但同时,它会将自定义的stride
添加到src
和dst
中。如果没有关于计算机图形和DSP的先前知识,我对这些函数感到非常困惑:在src
和dst
中支持自定义stride
的意义是什么?使用这样的函数的一些示例或好处是什么,而不只是将整个16 x 16字节一起复制?
非常感谢!
更新:为了明确,当目标平台上没有可用的矢量优化版本时,vp8_copy_mem16x16_c
在构建阶段被重新定义为vp8_copy_mem16x16
。
英文:
I'm trying to understand the following function in libvpx (vp8/common/reconinter.c
):
void vp8_copy_mem16x16_c(unsigned char *src, int src_stride, unsigned char *dst,
int dst_stride) {
int r;
for (r = 0; r < 16; ++r) {
memcpy(dst, src, 16);
src += src_stride;
dst += dst_stride;
}
}
(8x8 and 8x4 versions also exist in the same source file.)
It is copying 16 bytes from the src
to the dst
16 times, but at the same time, it is adding a custom stride
to both src
and dst
. Without prior knowledge on computer graphics and DSP, I feel very confused of these functions: What's the point of supporting custom stride
s in src
and dst
? What are some examples or benefits of using such functions rather than just copying the whole 16 x 16 bytes all together?
Thank you very much!
Update: to make it clear, vp8_copy_mem16x16_c
is re-defined as vp8_copy_mem16x16
during build stage when an vector-optimized version is not available on the target platform.
答案1
得分: 2
在libvpx
的上下文中,有两个主要用途:
-
处理源流中的单个块进行编码。如果你有整个图像,你可以使用源步幅等于
<图像宽度 + 图像步幅 - 块宽度>
,并且目标步幅为0(或者在你的算法中需要的任何值)来高效提取一个块。编辑:要明确的是,大多数编码和解码视频操作都是基于正方形或矩形块的。JPEG就是一个例子,但所有mp4和VP8/9操作也都是基于块的。这是一个非常基本且经常使用的操作。 -
尽管大多数API允许非2的幂次方图像,但高效的内存访问,尤其是在GPU上,几乎都需要它(或者至少需要一些对齐填充)。源和目标可以具有不同的对齐要求,这两个步幅参数在这里都起作用。
然而,总的来说,步幅还有第三个用途:精灵位块传送。与上面的第一点类似,你可以使用步幅来高效地将精灵位块传送到纹理(和/或屏幕,如果没有双缓冲区的话),通过使用步幅来复制内存。
英文:
Your question is what stride is for, if I'm understanding it correctly.
In the context of libvpx
, there's two large use cases for it:
-
Working with encoding individual blocks in the source stream. If you have an entire image, you can use a source stride equal to
<image width + image stride - block width>
and a destination stride of 0 (or whatever's needed in your algorithm) to extract a block efficiently. Edit: to be clear, most encoding and decoding video operations work on square or rectangular blocks. JPEG is an example of this, but all mp4 and VP8/9 operations are also block-based. This is a very basic, very often used operation. -
While most APIs allow non-power-of-two images, efficient memory access, especially on the GPU, pretty much requires it (or at least it requires some alignment padding). Both the source and the destination can have different such requirements, and both stride arguments come into play here.
In general however, there is a third use case for strides: sprite blitting. Similar to the first point above, you can very efficiently blit sprites to textures (and/or the screen, if there's no double buffering) by using strides to copy memory.
答案2
得分: 2
考虑两个具有16字节元素的二维数组,例如 M16 A[1024][1280]
和 M16 B[1024][1600]
,并假设您想要从数组 B
复制一列到数组 A
,如下所示:
AColumn = 37;
BColumn = 46;
for (int i = 0; i < 1024; ++i)
A[i][AColumn] = B[i][BColumn];
此循环操作的是 A
的元素,即 A[0][AColumn]
、A[1][AColumn]
、A[2][AColumn]
等。由于 A
的宽度为1280个元素,在循环中连续的元素在内存中相隔1280个元素,即1280•16 = 20,480字节。
类似地,循环中 B
的连续元素相隔1600个元素,即1600•16 = 25,600字节。
因此,如果我们使用 vp8_copy_mem16x16_c
,将 src_stride
设置为25,600,将 dst_stride
设置为20,480,它可以将 B
的一列复制到 A
的一列。(此外,对于 src
,我们传递第一个目标元素的地址 &A[0][AColumn]
,对于 dst
,我们传递 &B[0][BColumn]
。
不同的步长选择可以将一个数组的列复制到另一个数组的行,反之亦然。vp8_copy_mem16x16_c
是一个通用的“在内存中以某种规律间隔复制16字节块到内存中某种规律间隔的目标位置”的函数,可以操作行、列、交替元素(如列的每两个元素中的一个元素)和其他排列方式。
举另一个例子,考虑 struct { M16 m; RGB p; int i; } B[1024];
和 M16 A[1024]
。我们可以使用 vp8_copy_mem16x16_c
将 B
中结构体的 M16
成员提取到同类的 M16
数组 A
中,方法如下:
vp8_copy_mem16x16_c(A, sizeof *A, &B[0].m, sizeof *B);
英文:
Consider two two-dimensional arrays with 16-byte elements, say M16 A[1024][1280]
and M16 B[1024][1600]
, and suppose you want to copy a column from array B
to array A
, as in:
AColumn = 37;
BColumn = 46;
for (int i = 0; i < 1024; ++i)
A[i][AColumn] = B[i][BColumn];
The elements of A
this loop operates on are A[0][AColumn]
, A[1][AColumn]
, A[2][AColumn]
, and so on. Since the width of A
is 1280 elements, the successive elements in the loop are 1280 elements apart in memory, and that is 1280•16 = 20,480 bytes.
Similarly, the successive elements of B
in the loop are 1600 elements apart, and that is 1600•16 = 25,600 bytes.
Thus, if we call vp8_copy_mem16x16_c
with a src_stride
of 25,600 and a dst_stride
of 20,480, it can copy a column from B
into a column of A
. (Also, for src
, we pass the address of the first destination element, &A[0][AColumn]
, and, for dst
, we pass &B[0][BColumn]
.
Different selections of strides could copy a column of one array into a row of another, or vice-versa. vp8_copy_mem16x16_c
is a generalized “Copy 16-byte chunks at some regular spacing in memory to destinations at some regular spacing in memory” that can operate on rows, columns, alternating elements (such as every second element of a column), and other arrangements.
For another example, consider struct { M16 m; RGB p; int i; } B[1024];
and M16 A[1024]
. We could extract the M16
members of the structures in B
to the homogeneous M16
array A
with vp8_copy_mem16x16_c(A, sizeof *A, &B[0].m, sizeof *B);
.
答案3
得分: 0
以下是翻译好的部分:
这是尝试在两个图像之间(即2D数组)复制一个16x16的方块。
预期的用法是将src
和dst
设置为源块和目标块的起始位置,并将stride
设置为整个图像的宽度。
此函数还提供了两个分开的步长值用于src
和dst
,以使源和目标的宽度不必相同。
注意
这里应该真正使用“步长”而不是“宽度”,因为“宽度”是每个扫描线的有效/可见大小,而“步长”是扫描线的分配大小。从内存的角度来看,这里关键的是步长,而不是宽度。
英文:
This is trying to copy a 16x16 square block between two images (i.e. 2d array).
The intended usage is to set the src
and dst
to the beginning position of the source and destination block and set the stride
to the width of the entire image.
This function also provide two separate strides for src
and dst
so that the src and dst does not have be the same width.
Note
"Width" should really be "stride" here because "width" is the valid/visible size of each scanline but "stride" is the allocated size of the scanline. From a memory point of view, it's the stride that matters here, not width.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论