2023年6月22日 11:21:08go评论78阅读模式

英文:

Trying to understand the new sorting algorithm from AlphaDev: why does my assembly code not work as expected?

问题

以下是您要翻译的内容：

在nature.com上有一篇最新的文章，标题为《使用深度强化学习发现更快的排序算法》（Faster sorting algorithms discovered using deep reinforcement learning），其中谈到了AlphaDev发现了一种更快的排序算法。这引起了我的兴趣，我一直在努力理解这个发现。

关于这个主题的其他文章有：

以下是原始sort3算法与AlphaDev发现的改进算法的伪代码：

原始伪代码

Memory [0] = A
Memory [1] = B
Memory [2] = C

mov Memory[0] P  // P = A
mov Memory[1] Q  // Q = B
mov Memory[2] R  // R = C

mov R S
cmp P R
cmovg P R  // R = max(A, C)
cmovl P S  // S = min(A, C)
mov S P    // P = min(A, C)
cmp S Q
cmovg Q P  // P = min(A, B, C)
cmovg S Q  // Q = max(min(A, C), B)

mov P Memory[0]  // = min(A, B, C)
mov Q Memory[1]  // = max(min(A, C), B)
mov R Memory[2]  // = max(A, C)

AlphaDev伪代码

Memory [0] = A
Memory [1] = B
Memory [2] = C

mov Memory[0] P  // P = A
mov Memory[1] Q  // Q = B
mov Memory[2] R  // R = C

mov R S
cmp P R
cmovg P R  // R = max(A, C)
cmovl P S  // S = min(A, C)

cmp S Q
cmovg Q P  // P = min(A, B)
cmovg S Q  // Q = max(min(A, C), B)

mov P Memory[0]  // = min(A, B)
mov Q Memory[1]  // = max(min(A, C), B)
mov R Memory[2]  // = max(A, C)

改进的重点在于省略了单个移动命令mov S P。为了帮助理解，我编写了以下汇编代码。然而，我的测试表明，当A=3，B=2和C=1时，排序算法不起作用，但当A=3，B=1和C=2时，它确实起作用。

这是在Ubuntu 20.04桌面上编写、编译和运行的：

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.6 LTS
Release:        20.04
Codename:       focal

$ nasm -v
NASM version 2.14.02

$ ld -v
GNU ld (GNU Binutils for Ubuntu) 2.34

我的汇编代码测试...

; -----------------------------------------------------------------
;
; sort_test.asm
;
; AlphaDev排序算法测试
;
; 我的研究发现，当A=3，B=2，C=1时，AlphaDev删除'mov S P'不起作用，但当A=3，B=1，C=2时，它起作用。
;
; 输出：将a、b和c的排序值以带有空格的方式打印到stdout
;
; 编译和运行：
;
; nasm -f elf32 sort_test.asm && ld -m elf_i386 sort_test.o -o sort_test && ./sort_test
;
; -----------------------------------------------------------------

global _start

section .data
  a equ 3
  b equ 2
  c equ 1

section .bss
  buffer resb 5

section .text
_start:
; ------------------- ; AlphaDev伪代码

  mov eax, a          ; P = A
  mov ecx, b          ; Q = B
  mov edx, c          ; R = C
  mov ebx, edx        ; mov R S

  cmp eax, edx        ; cmp P R
  cmovg edx, eax      ; cmovg P R  // R = max(A, C)
  cmovl ebx, eax      ; cmovl P S  // S = min(A, C)

; 原始排序算法中有以下行，
; 但AlphaDev确定它是不必要的
;  mov eax, ebx       ; mov S P   // p = min(A, C)

  cmp ebx, ecx        ; cmp S Q
  cmovg eax, ecx      ; cmovg Q P  // P = min(A, B)
  cmovg ecx, ebx      ; cmovg S Q  // Q = max(min(A, C), B)

; 将值添加到带有空格的缓冲区
  add eax, 30h
  mov [buffer], eax
  mov [buffer+1], byte 0x20

  add ecx, 30h
  mov [buffer+2], ecx
  mov [buffer+3], byte 0x20

  add edx, 30h
  mov [buffer+4], edx

; 将缓冲区写入stdout
  mov eax, 4      ; sys_write系统调用
  mov ebx, 1      ; stdout文件描述符
  mov ecx, buffer ; 要写入的缓冲区
  mov edx, 5      ; 要写入的字节数
  int 0x80

  mov eax, 1      ; sys_exit系统调用
  mov ebx, 0      ; 退出状态0
  int 0x80

我已经运行了这个测试，以在命令行上打印排序的结果，但我还使用了gdb逐行调试这个可执行文件。在调试过程中，我清楚地看到寄存器"A"，也就是"P"，也就是"eax"，在*A=3，B=2和C=1

英文:

There is a recent publication at nature.com, Faster sorting algorithms discovered using deep reinforcement learning, where it talks about AlphaDev discovering a faster sorting algorithm. This caught my interest, and I've been trying to understand the discovery.

答案1

得分: 8

以下是您要翻译的内容：

TL:DR: they're confusingly only showing the last 2 of 3 comparators in a 3-element sorting network, not a complete 3-element sort. This is presented very misleadingly, including in the diagram in their paper.

我会使用AT&T语法（例如在使用GCC汇编的.s文件中的cmovg %ecx, %eax），这样操作数顺序可以与伪代码匹配，目的地在右边。

You're correct, I had a look at the article and the 3-element pseudocode doesn't sort correctly when C is the smallest element. I know x86-64 asm backwards can forwards, and I don't just mean Intel vs. AT&T syntax Even looking at the real code, not just the comments, there's no way for the smallest element to end up in memory[0] = P if it started in R = memory[2] = C.

我认为你说得对，我看了一下这篇文章，当C是最小的元素时，这个3元素的伪代码排序是不正确的。我对x86-64汇编非常了解，我不只是指Intel和AT&T语法的区别:P 即使查看真正的代码，而不仅仅是注释，也无法使最小的元素最终出现在memory[0] = P中，如果它起始于R = memory[2] = C的话。

I opened the article before really reading what your question was asking, and noticed that problem myself after skimming the article until getting to the part about the actual improvement, so I haven't looked at your attempt to replicate it. But I didn't have any bias towards seeing a problem in it, I just wanted to understand it myself. There aren't any instructions writing P that read from values that could contain the starting R value, so there's no way it can get that value.

在真正阅读你的问题之前，我打开了这篇文章，我在快速浏览文章直到看到实际改进部分后，注意到了这个问题，所以我还没有看你尝试复制它的部分。但是，我并没有偏向于发现其中的问题，我只是想自己理解它。没有任何指令写入P，它们读取的值可能包含起始的R值，所以它无法获得那个值。

The article indirectly links their paper published in Nature (Faster sorting algorithms discovered using deep reinforcement learning by Daniel J. Mankowitz, et. al.) The full text is there in the Nature link.

这篇文章间接链接到他们在自然杂志上发表的论文（使用深度强化学习发现更快的排序算法，作者是Daniel J. Mankowitz等人）。自然杂志的链接中有完整的文章内容。

They use the same image of code in the actual paper, but with some explanatory text and diagram in terms of a 3-element sorting network.

他们在实际论文中使用了相同的代码图像，但附有一些解释性文本和关于3元素排序网络的图表。

>
>
> Figure 3a presents an optimal sorting network for three elements (see Methods for an overview of sorting networks). We will explain how AlphaDev has improved the circled network segment. There are many variants of this structure that are found in sorting networks of various sizes, and the same argument applies in each case.
>
> The circled part of the network (last two comparators) can be seen as a sequence of instructions that takes an input sequence ⟨A, B, C⟩ and transforms each input as shown in Table 2a (left). However, a comparator on wires B and C precedes this operator and therefore input sequences where B ≤ C are guaranteed. This means that it is enough to compute min(A, B) as the first output instead of min(A, B, C) as shown in Table 2a (right). The pseudocode difference between Fig. 3b,c demonstrates how the AlphaDev swap move saves one instruction each time it is applied.

>
>
> 图3a呈现了三个元素的最佳排序网络（有关排序网络的概述，请参见方法部分）。我们将解释AlphaDev如何改进了环形网络段。在各种大小的排序网络中都可以找到这种结构的许多变种，每种情况都适用相同的论点。
>
> 网络的圆圈部分（最后两个比较器）可以看作是一系列指令，它接受输入序列⟨A, B, C⟩并将每个输入转换成表2a（左侧）中所示的样式。**然而，在这个运算符之前，B和C上的比较器出现，因此可以保证输入序列中B≤C。**这意味着计算min(A, B)作为第一个输出就足够了，而不是像表2a（右侧）中所示的min(A, B, C)。图3b,c之间的伪代码差异演示了AlphaDev交换操作如何在每次应用时节省一个指令。

So this pseudocode is just for the circled part of the sorting network, the last 2 of 3 compare-and-swap steps. In their blog article, and even in other parts of the paper like Table 2, they make it sound like this is the whole sort, not just the last 2 steps. The pseudocode even confusingly starts with values in memory, which wouldn't be the case after conditionally swapping B and C to ensure B <= C.

因此，这个伪代码仅用于排序网络的圆圈部分，即3个比较和交换步骤中的最后2个。 在他们的博客文章中，甚至在论文的其他部分，如表2中，

英文:

I'd have used AT&T syntax (like cmovg %ecx, %eax in a .s file assembled with GCC) so the operand order can match the pseudocode, destination on the right.

They use the same image of code in the actual paper, but with some explanatory text and diagram in terms of a 3-element sorting network.

Also, it's unlikely just a mov instruction is a huge speedup in a 3-element sort. https://stackoverflow.com/questions/44169342/can-x86s-mov-really-be-free-why-cant-i-reproduce-this-at-all - it's never free (it costs front-end bandwidth), but it has zero latency on most recent microarchitectures other than Ice Lake. I'm guessing this wasn't the case where they got a 70% speedup!

With AVX SIMD instructions like vpminsd dst, src1, src2 (https://www.felixcloutier.com/x86/pminsd:pminsq) / vpmaxsd to do min and max of Signed Dword (32-bit) elements with a non-destructive separate destination, there's no saving except critical-path latency. min(B, prev_result) is still just one instruction, no separate register-copy (vmovdqa xmm0, xmm1) needed like it could be with just SSE4.1 if you were doing a sorting-network. But latency could perhaps be significant when building a sorting network out of shuffles and SIMD min/max comparators, which last I heard was the state of the art in integer sorting for large integer or FP arrays on x86-64, not just saving a mov in scalar cmov code!

But lots of programs are compiled not to assume AVX is available, because unfortunately it's not universally supported, missing on some low-power x86 CPUs from as recently as the past couple years, and on Pentium / Celeron CPUs before Ice Lake (so maybe as recent as 2018 or so for low-budget desktop CPUs.)

Their paper in Nature mentions SIMD sorting networks, but points out that libc++ std::sort doesn't take advantage of it, even for the case where the input is an array of float or int, rather than classes with an overloaded operator <.

This 3-element tweak is a micro-optimization, not a "new sorting algorithm". It might still save latency on AArch64, but only instructions on x86

It's nice that AI can find these micro-optimizations, but they wouldn't be nearly as surprising if presented as having a choice between selecting from min(A,C) or min(B,C) because the latter is what B actually is at that point.

Avoiding register-copy instructions with careful choice of higher-level source is something humans can do, e.g. the choice of _mm_movehl_ps merge destination (first source operand) in my 2016 answer on https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-sse-vector-sum-or-other-reduction - see the comment on the compiler-generated asm # note the reuse of shuf, avoiding a movaps.

Previous work in automated micro-optimization includes STOKE, a stochastic superoptimizer that randomly tries instruction sequences hoping to find cheap sequences that match the outputs of a test function you give it. The search space is so large that it tends to miss possible sequences when it takes more than 3 or 4 instructions (STOKE's own page says it's not production-ready, just a research prototype). So AI is helpful. And it's a lot of work to look at asm by hand for possible missed optimizations that could be fixed by tweaking the source.

But at least for this 3-element subproblem, it is just a micro-optimization, not truly algorithmically new. It's still just a 3-comparator sorting network. One that compiles more cheaply for x86-64, which is nice. But on some 3-operand ISAs with a separate destination for their equivalent of cmov, like AArch64's csel dst, src1, src2, flag_condition conditional-select, there's no mov to save. It could still save latency on the critical path, though.

Their paper in Nature also shows an algorithmic difference for sorting a variable number of elements, where the >= 3 cases both start by sorting the first 3. Maybe this helps branch prediction since that work can be in flight while a final branch on len > 3 is resolving to see whether they need to do a simplified 4-element sort that can assume the first 3 elements are sorted. They say "It is this part of the routine that results in significant latency savings." (They also call this a "fundamentally new" algorithm, which I presume is true for the problem of using sorting networks on short unknown-length inputs.)

答案2

得分: 4

Deepmind的自然文章中AlphaDev Sort3算法的伪代码存在一个错误，影响了在第一个数字大于另外两个相等数字时的排序。也就是说，A大于B=C（满足前提条件B<=C）。例如，如果输入是(2,1,1)，则输出是(2,1,2)；未排序且实际上损坏了。

错误出现在第12行：cmovg Q P，应该是cmovge Q P，以确保在这种情况下覆盖P。重新引入原始的mov S P指令也可以解决问题，但会破坏更改的目的。
第3页d、e、f的图3中的伪代码也适用于Sort8片段（cmovl R T应该是cmovle R T）。

AlphaDev的Github上的Sort3代码以及自然文章中的补充G中的代码没有这个问题，而且很可能LLVM libc++库中的代码也没有这个问题。

逐步说明：

# 在第一个显示的步骤之前（在B，C比较之后）
P = 2
Q = 1
R = 1

# 经过第一步后
  # P=2仍然没有被此步骤修改，仍然只是A
  # Q=1仍然只是B
R = 2   # max(A,C) = max(A, max(B, C)) 因为B<=C
S = 1   # min(A,C) = min(A, max(B, C))

 cmp S, Q(B)  是 cmp 1, 1  # 大于条件是假的
# 经过最后一步后
P = 2 由于cmovg Q(1), P(2)没有改变   !!!! 错误在这里
Q = 1 由于cmovg S(1), Q(1)没有改变
# R = 2仍然没有改变

cmovge Q, P 会复制1而不是保留2。对于升序排序，我们希望PQR = 1,1,2。

英文:

The pseudocode for the AlphaDev Sort3 algorithm in Deepmind's Nature article has a bug that affects the sorting of three numbers when the first one is larger than the other two equal ones. That is, A>B=C (which satisfies the precondition B<=C). For example, if the input is (2,1,1), the output is (2,1,2); not sorted and actually corrupted.

The bug is in line 12: cmovg Q P, which should be cmovge Q P, to ensure P is overwritten in this case. Reintroducing the original mov S P instruction would also fix the problem, but would defeat the purpose of the change.
A similar bug and fix apply to the pseudocode for the Sort8 snippet in figure 3 d,e,f of the article (cmovl R T should be cmovle R T).

The Sort3 code on AlphaDev’s Github and in Supplement G to the Nature article do not have this problem, and presumably neither does the code in the LLVM libc++ library.

Step by step:

# before first shown step  (after the B,C comparator)
P = 2
Q = 1
R = 1

# after first step
  # P=2 still  unmodified by this step, still just A
  # Q=1 still  still just B
R = 2   # max(A,C) = max(A, max(B, C)) since B&lt;=C
S = 1   # min(A,C) = min(A, max(B, C))

 cmp S, Q(B)  is  cmp 1, 1  # the Greater condition is false
# after final step
P = 2 unchanged by cmovg Q(1), P(2)   !!!! Bug here
Q = 1 unchanged by cmovg S(1), Q(1)
# R = 2 still

cmovge Q, P would have copied the 1 instead of keeping the 2. For an ascending sort, we wanted PQR = 1,1,2

答案3

得分: 1

我添加到AlphaDev维基百科文章了这篇博客文章：

理解DeepMind的排序算法

它将伪汇编翻译成了实际汇编：


/	move37.S
	    .equ	P,%rax
	    .equ	Q,%rcx
	    .equ	R,%rdx
	    .equ	S,%rsi
move37:	mov	(%rdi),P
	    mov	8(%rdi),Q
	    mov	16(%rdi),R
	    mov	R,S
	    cmp	P,R
	    cmovg	P,R
	    cmovl	P,S
	    cmp	S,Q
	    cmovg	Q,P
	    cmovg	S,Q
	    mov	R,(%rdi)
	    mov	Q,8(%rdi)
	    mov	P,16(%rdi)
	    ret
	    .type	move37,@function
	    .size	move37,.-move37
	    .globl	move37

英文:

I aded to the AlphaDev wikipedia article this blog post:

Understanding DeepMind's Sorting Algorithm

It translates the pseudo-assembly into actual assembly:


/	move37.S
	    .equ	P,%rax
	    .equ	Q,%rcx
	    .equ	R,%rdx
	    .equ	S,%rsi
move37:	mov	(%rdi),P
	    mov	8(%rdi),Q
	    mov	16(%rdi),R
	    mov	R,S
	    cmp	P,R
	    cmovg	P,R
	    cmovl	P,S
	    cmp	S,Q
	    cmovg	Q,P
	    cmovg	S,Q
	    mov	R,(%rdi)
	    mov	Q,8(%rdi)
	    mov	P,16(%rdi)
	    ret
	    .type	move37,@function
	    .size	move37,.-move37
	    .globl	move37

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

尝试理解AlphaDev的新排序算法：为什么我的汇编代码不按预期工作？

问题

答案1

This 3-element tweak is a micro-optimization, not a "new sorting algorithm". It might still save latency on AArch64, but only instructions on x86

答案2

答案3

如何在Excel中对分隔的列进行排序？

按第一个列表的顺序对列表的各个部分进行排序。

如何在R中对具有相同列数值的行进行排序

如何按月份分组而不失去按年排序？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论