英文:
Lock correctness with compiler optimizations (C / gcc)
问题
When discussing concurrency with my professor, he mentioned that potential compiler optimizations around locks (re-ordering instructions, optimizing accesses, etc.), like pthread_mutex_lock()
could cause problems. The reason this doesn’t occur (according to him) is that the compiler has to treat function calls like a black box and is unable to optimize accesses or re-order around it, as it does not know what the function may do to global state. He said that if you turn on something like link-time optimization, gcc -flto
, you may find that your locks suddenly stop working.
他提到,当讨论并发性时,他的教授提到锁周围的潜在编译器优化(重新排序指令,优化访问等),例如pthread_mutex_lock()
可能会导致问题。这种情况之所以不会发生(根据他的说法),是因为编译器必须将函数调用视为黑盒子,无法优化访问或在其周围重新排序,因为它不知道函数可能对全局状态做什么。他说,如果你打开类似链接时优化的东西,gcc -flto
,你可能会发现你的锁突然停止工作。
He is a fairly reputable source, an old-school GNU guy who developed a lot of GNU core-utils and has even worked on gcc, but I am left wondering how this can be true, when the compiler is able to optimize things like memcpy
to a single instruction, effectively looking across module boundaries without link-time optimization. Could it not then do the same to something like pthread_mutex_lock()
and optimize accesses around it?
他是一个相当有声望的来源,是一个老派的GNU人员,开发了许多GNU核心工具,并且甚至在gcc上工作过,但我不禁怀疑这是否属实,当编译器能够将诸如memcpy
之类的东西优化为单个指令时,有效地跨越模块边界查找,而不需要链接时优化。它是否不能对类似pthread_mutex_lock()
的东西执行相同的操作,并优化其周围的访问?
If this is true and the compiler can peer into the pthread_mutex_lock()
function and sees it does not alter a certain variable, so it optimizes accesses, could this affect correctness, and if so how can such a core function be left so vulnerable to the possibility of not working correctly? Does this mean there has to be some other method employed to tell the compiler not to optimize accesses to these variables, such as the volatile
construct? This gets even trickier when considering static
locks that are module-specific, since a function defined in a different module could not possibly access it.
如果这是真的,编译器可以查看pthread_mutex_lock()
函数并看到它不会改变某个变量,因此它优化访问,这是否会影响正确性,如果是这样,如何才能使这样一个核心函数如此容易受到不正常工作的可能性的威胁?这是否意味着必须采用其他方法来告诉编译器不要优化对这些变量的访问,例如volatile
构造?当考虑到模块特定的static
锁时,情况变得更加复杂,因为在不同模块中定义的函数可能根本无法访问它。
英文:
When discussing concurrency with my professor, he mentioned that potential compiler optimizations around locks (re-ordering instructions, optimizing accesses, etc.), like pthread_mutex_lock()
could cause problems. The reason this doesn’t occur (according to him) is that the compiler has to treat function calls like a black box and is unable to optimize accesses or re-order around it, as it does not know what the function may do to global state. He said that if you turn on something like link-time optimization, gcc -flto
, you may find that your locks suddenly stop working.
He is a fairly reputable source, an old-school GNU guy who developed a lot of GNU core-utils and has even worked on gcc, but I am left wondering how this can be true, when the compiler is able to optimize things like memcpy
to a single instruction, effectively looking across module boundaries without link-time optimization. Could it not then do the same to something like pthread_mutex_lock()
and optimize accesses around it?
If this is true and the compiler can peer into the pthread_mutex_lock()
function and sees it does not alter a certain variable, so it optimizes accesses, could this affect correctness, and if so how can such a core function be left so vulnerable to the possibility of not working correctly? Does this mean there has to be some other method employed to tell the compiler not to optimize accesses to these variables, such as the volatile
construct? This gets even trickier when considering static
locks that are module-specific, since a function defined in a different module could not possibly access it.
答案1
得分: 4
Here are the translated parts of your text:
The reason this doesn’t occur (according to him) is that the compiler has to treat function calls like a black box and is unable to optimize accesses or re-order around it,
这并不会发生(根据他的说法),因为编译器必须将函数调用视为黑盒子,并且无法优化访问或在其周围重新排序,
That's one reason, certainly.
这当然是一个原因。
Another reason that it might not occur specifically with the GNU toolchain is that GNU's link-time optimization applies only to functions that are compiled with the
-flto
option, and therefore carry with them additional information on which the optimization relies. All one needs to prevent GCC's link-time optimization from messing withpthread_mutex_lock()
is to have a version of that function that does not carry the LTO information. (This ordinarily would be the responsibility of the system, not of the application developer.)
可能不会在GNU工具链中特别发生的另一个原因是GNU的链接时优化仅适用于使用-flto
选项编译的函数,因此携带了优化所依赖的额外信息。防止GCC的链接时优化影响pthread_mutex_lock()
的方法只需拥有不携带LTO信息的该函数的版本。这通常是系统的责任,而不是应用程序开发人员的责任。
if you turn on something like link-time optimization, gcc -flto, you may find that your locks suddenly stop working.
如果您打开类似链接时优化的功能,比如gcc -flto,您可能会发现您的锁突然停止工作。
Possibly. And if you do see that, then it constitutes a flaw in your C implementation, whether you attribute it to compiler, linker, library, or all of the above. This is not to say that you should discount the possibility, but rather that it is something that people working on these tools are out to avoid and / or fix, so the likelihood of running into such an issue goes down over time.
有可能。如果您确实看到了这种情况,那么它构成了您的C实现中的一个缺陷,无论您将其归因于编译器、链接器、库还是以上所有。这并不是说您应该排除这种可能性,而是说从事这些工具的人们正在努力避免和/或修复这种问题,因此随着时间的推移,遇到这种问题的可能性会降低。
I am left wondering how this can be true, when the compiler is able to optimize things like
memcpy
to a single instruction, effectively looking across module boundaries without link-time optimization. Could it not then do the same to something likepthread_mutex_lock()
and optimize accesses around it?
我不禁想知道这是如何成立的,当编译器能够将像memcpy
这样的东西优化为单个指令,有效地跨越模块边界进行优化而无需链接时优化。难道它不能对pthread_mutex_lock()
之类的东西执行相同的操作,然后在其周围优化访问吗?
What you're talking about now is optimization of specific functions. For example, the compiler knows all about
memcpy()
in particular, based on its specifications, and in some cases based on being part of the same, integrated C implementation. It can optimize (say) somememcpy
calls at compile time because it knows what that function is supposed to do, and it recognizes usage idioms that it can optimize without actually looking into the library at all.
您现在正在谈论的是对特定函数的优化。例如,编译器特别了解memcpy()
,基于其规范,有时还基于其作为同一集成C实现的一部分。它可以在编译时优化一些memcpy
调用,因为它知道该函数应该做什么,它能够识别可以优化的用法习惯,而无需实际查看库。
A compiler could, in principle, do the same with
pthread_mutex_lock()
, but this would not be a problem because such specific-function optimizations are aware of (and rely on) the semantics of the function involved. There's no reason to think that such an optimization ofpthread_mutex_lock()
would fail to preserve that function's well-documented memory-order semantics.
原则上,编译器可以对pthread_mutex_lock()
执行相同的操作,但这不会成为问题,因为此类特定函数的优化意识到(并依赖于)所涉及函数的语义。没有理由认为对pthread_mutex_lock()
的这种优化会导致无法保留该函数的经过充分记录的内存顺序语义。
If this is true and the compiler can peer into the
pthread_mutex_lock()
function and sees it does not alter a certain variable, so it optimizes accesses, could this affect correctness,
如果这是真的,编译器可以查看pthread_mutex_lock()
函数并发现它不会更改某个变量,因此它会优化访问,这是否会影响正确性,
Can compilers have bugs? Yes, they can and do.
编译器可以有错误吗?是的,它们可能会有错误。
Do current versions of any C compilers have a specific bug along those lines? I don't know.
当前版本的任何C编译器是否存在与此类似的特定错误?我不知道。
and if so how can such a core function be left so vulnerable to the possibility of not working correctly?
如果是这样,那么如何能够让这样一个核心函数如此容易受到不正确工作的可能性?
Nobody is leaving functions open to such failures. To the extent that opportunities for such incorrect optimizations exist, "people" are interested in and motivated to fix their compilers, linkers, and libraries to close those holes.
没有人故意将函数暴露在这些失败的风险中。在存在这种不正确优化的机会的范围内,“人们”对修复他们的编译器、链接器和库以填补这些漏洞感兴趣并有
英文:
> The reason this doesn’t occur (according to him) is that the compiler has to treat function calls like a black box and is unable to optimize accesses or re-order around it,
That's one reason, certainly.
Another reason that it might not occur specifically with the GNU toolchain is that GNU's link-time optimization applies only to functions that are compiled with the -flto
option, and therefore carry with them additional information on which the optimization relies. All one needs to prevent GCC's link-time optimization from messing with pthread_mutex_lock()
is to have a version of that function that does not carry the LTO information. (This ordinarily would be the responsibility of the system, not of the application developer.)
> if you turn on something like link-time optimization, gcc -flto, you may find that your locks suddenly stop working.
Possibly. And if you do see that, then it constitutes a flaw in your C implementation, whether you attribute it to compiler, linker, library, or all of the above. This is not to say that you should discount the possibility, but rather that it is something that people working on these tools are out to avoid and / or fix, so the likelihood of running into such an issue goes down over time.
> I am left wondering how this can be true, when the compiler is able to optimize things like memcpy
to a single instruction, effectively looking across module boundaries without link-time optimization. Could it not then do the same to something like pthread_mutex_lock()
and optimize accesses around it?
What you're talking about now is optimization of specific functions. For example, the compiler knows all about memcpy()
in particular, based on its specifications, and in some cases based on being part of the same, integrated C implementation. It can optimize (say) some memcpy
calls at compile time because it knows what that function is supposed to do, and it recognizes usage idioms that it can optimize without actually looking into the library at all.
A compiler could, in principle, do the same with pthread_mutex_lock()
, but this would not be a problem because such specific-function optimizations are aware of (and rely on) the semantics of the function involved. There's no reason to think that such an optimzation of pthread_mutex_lock()
would fail to preserve that function's well-documented memory-order semantics.
> If this is true and the compiler can peer into the pthread_mutex_lock()
function and sees it does not alter a certain variable, so it optimizes accesses, could this affect correctness,
Can compilers have bugs? Yes, they can and do.
Do current versions of any C compilers have a specific bug along those lines? I don't know.
> and if so how can such a core function be left so vulnerable to the possibility of not working correctly?
Nobody is leaving functions open to such failures. To the extent that opportunities for such incorrect optimizations exist, "people" are interested in and motivated to fix their compilers, linkers, and libraries to close those holes.
Understand also that it is common for new optimizations to be tested carefully for an extended period -- sometimes years and multiple compiler versions -- before being designated safe for production use. If ever they are.
> Does this mean there has to be some other method employed to tell the compiler not to optimize accesses to these variables, such as the volatile
construct?
No. Generally speaking, you should rely on functions and language constructs to behave according to their documentation. Especially functions defined by the C language itself. Almost as much so functions defined by the platform's core specifications, such as POSIX on a POSIX-conforming system.
Also, generally speaking, you should approach compiler options with care and diligence. Some produce non-conforming behavior by design. Some come with caveats. And if I see an option that comes with such lengthy documentation as GCC's -flto
does, I usually take it as a sign that it is an expert feature that I shouldn't mess with unless I've invested the time and effort to make myself an expert with it.
答案2
得分: 3
在很久以前,多线程确实必须依赖于类似不透明函数边界的东西,以充当临时的内存屏障。确实,LTO会破坏这一点。幸运的是,这是在广泛实施LTO之前。
在现代,编译器支持显式内存屏障。对于gcc来说,有几种可能的方法:
-
在最基本的级别,您可以(滥用)使用gcc的扩展汇编:
asm("" : : : "memory");
将强制编译器假定内存中的任何对象都可以被读取或写入,因此编译器无法重新排序加载和存储指令。这可以与内联汇编屏障指令一起使用,该指令防止CPU本身重新排序加载和存储的可见性。 -
后来,gcc引入了一系列同步内置函数。因此,您可以在代码中插入
__sync_synchronize()
,这样可以防止编译器重新排序并插入必要的屏障指令。更可能的情况是,您已经在那里使用了其他读-修改-写原语之一,这些原语也会插入完整的屏障。 -
从C11开始,该语言具有正式的内存模型,本质上提供了有关允许的重新排序种类的保证。这包括像
atomic_thread_fence
这样的标准函数,编译器必须特殊处理它们,再次产生适当类型的屏障,以抑制不希望的重新排序。
因此,使用这些方法中的至少一种,pthread_mutex_lock
的实现将包括一个内存屏障。然后,它可以在编译时或链接时进行尽可能多的内联,仍然会受到尊重,防止违反其语义的任何重新排序。
英文:
In the very old days, yes, multithreading had to rely on things like opaque function boundaries to serve as a makeshift memory barrier. It's true that LTO would break that. Fortunately, this was before LTO was widely implemented.
In modern times, compilers support explicit memory barriers. For gcc in particular, there are several possible approaches:
-
At the most basic level, you can (ab)use gcc's extended asm:
asm("" : : : "memory");
will force the compiler to assume that any object in memory could be read or written, so that the compiler cannot reorder load and store instructions around it. This could be used in conjunction with an inline asm barrier instruction, which prevents the CPU itself from reordering the visibility of loads and stores. -
Later, gcc introduced a series of synchronization builtins. So you could insert
__sync_synchronize()
into the code, which would likewise prevent compiler reordering and insert barrier instructions as needed. More likely you'd already be using one of the other read-modify-write primitives there, which also insert a full barrier. -
From C11 onward, the language has a formal memory model, that in essence provides guarantees as to exactly what kinds of reorderings are allowed. This includes standard functions like
atomic_thread_fence
that the compiler must handle specially, again producing a barrier of an appropriate kind, so as to inhibit unwanted reordering.
So, using at least one of these methods, the implementation of pthread_mutex_lock
would include a memory barrier. Then it can be inlined as much as you like, whether at compile or at link time, and the barrier will still be respected, preventing any reorderings that would violate its semantics.
答案3
得分: 2
给你的教授一点休息时间,他们所说的在过去的70多年(对于C语言是50年)一直是正确的。它将保持不变吗?这是一个棘手的问题。
"C"语言不再是一种轻量级的系统编程语言。多亏了一系列Karen委员会的不懈努力,它现在包括了与无关的库的精确规范,并且更倾向于微观优化编译器而不是一般的系统编程语言的行为。如果说得不客气,他们是在通过削弱"C"语言来抨击他们喜欢的语言(Pascal,Modula-2,Euclid,即将推出的Rust)的失败,但那是一个次要问题。
编译器可以识别memcpy,因为它由stdc定义。Pthread_mutex_mumble没有被stdc定义,所以它不能。(然而,它的增长是显而易见的)。
一般来说,编译器不能超越它的编译单元(=源文件)以获取有关正在发生的事情的提示。这在1991年Plan9引入"Link-Optimising-Compilers"时发生了相当大的变化。当然,Plan/9的人只是对更好的技术感兴趣,而不是赢得微型基准测试。
快进22年,现在一切都是基准测试,编译器最近已经赶上了Link Optimising;所以现在,简单地藏在外部函数调用后面是不够的。
幸运的是,对于大多数实现来说,pthread_mutex_lock看起来像这样:
id = get_my_thread_id();
if (cswap(mutex, 0, id) != 0) {
_syscall_mutex_wait(mutex);
}
return 0;
这对于一个链接时优化器来说是难以理解的,所以你教授最初的说法仍然成立。
最后,由于pthread_mutex_lock至少强制实现了获取-释放语义,并且在系统调用的情况下可能实现了完全丢弃,它保证了在锁定期间进行的各种更新在另一个锁定器可以获取它之前实现了全局可见性。这非常重要。
英文:
To give your prof a break, what they said has been true for the last 70+ years (50 years for C). Will it remain true? Tough question.
The language "C" is no longer a lightweight systems programming language. Thanks to the relentless efforts of a series of Karen committees; it now includes precise specifications for unrelated libraries, and behaviour that is prefers micro-optimising compilers to a general systems programming language. If one were unkind, they are taking out the failure of their preferred language ( Pascal, Modula-2, Euclid, soon-to-be rust ) by undermining "C"; but that is a side issue.
The compiler can recognize memcpy because it is defined by stdc. Pthread_mutex_mumble is not defined by stdc, so it can't. (Yet, its growing).
In general, the compiler cannot look beyond its compilation unit (= source file) to get hints about what is going on. This changed a fair bit in 1991 when Plan9 introduced "Link-Optimising-Compilers". Of course, the plan/9 folks were just interested in better technology, not winning micro benchmarks.
Fast forward 22 years, and it is all benchmarks baby; and compilers have recently caught up to Link Optimising; so know, simply hiding behind an external function call is not enough.
Fortunately, pthread mutex lock, for most implementations looks something like:
id = get_my_thread_id();
if (cswap(mutex, 0, id) != 0) {
_syscall_mutex_wait(mutex);
}
return 0;
which would be incomprehensible to a link-time-optimiser, so your prof's initial statement stands.
As a final; since pthread_mutex_lock forces an acquire-release semantic at the minimum; and possibly a full discard in the case of a system call; it guarantees that the various updates made during its lock achieve global visibility before another locker can acquire it. This is extremely important.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论