2021年8月3日 17:14:02go评论109阅读模式

英文:

Why does the race detector report a race condition, here?

问题

我正在使用Go的竞争检测（-race参数），它检测到了一些我认为不应该报告的竞争条件问题。我创建了这个示例代码来解释我的发现。请不要评论这个示例的目标，因为它除了解释问题之外没有其他目标。

这段代码：

var count int
func main() {
	go update()
	for {
		fmt.Println(count)
		time.Sleep(time.Second)
	}
}
func update() {
	for {
		time.Sleep(time.Second)
		count++
	}
}

被报告为存在竞争条件。

然而这段代码：

var count int
var mutex sync.RWMutex
func main() {
	go update()
	for {
		mutex.RLock()
		fmt.Println(count)
		mutex.RUnlock()
		time.Sleep(time.Second)
	}
}
func update(){
	for {
		time.Sleep(time.Second)
		mutex.Lock()
		count++
		mutex.Unlock()
	}
}

没有报告任何竞争条件问题。

我的问题是为什么会这样？第一段代码中没有错误。main函数在读取一个由另一个goroutine更新的变量。这里没有潜在的隐藏错误。第二段代码中的互斥锁并没有提供任何不同的行为。

我在哪里错了？

英文:

I am using Go race detection (the -race argument), and it detects some race conditions issues that I think should not be reported. I've created this sample code to explain my findings. Please do not comment about the goal of this example, as it has no goal other than to explain the issue.

This code:

var count int
func main() {
	go update()
	for {
		fmt.Println(count)
		time.Sleep(time.Second)
	}
}
func update() {
	for {
		time.Sleep(time.Second)
		count++
	}
}

is reported with a race condition.

While this code:

var count int
var mutex sync.RWMutex
func main() {
	go update()
	for {
		mutex.RLock()
		fmt.Println(count)
		mutex.RUnlock()
		time.Sleep(time.Second)
	}
}
func update(){
	for {
		time.Sleep(time.Second)
		mutex.Lock()
		count++
		mutex.Unlock()
	}
}

is not reported with any race condition issues.

My question is why?
There no bug in the first code.
The main function is reading a variable that another go routine is updating.
There is no potential hidden bug here.
The second code mutex does not provide any different behavior.

Where am I wrong here?

答案1

得分: 4

你的代码存在一个非常明显的竞争条件。

你的for循环在同时访问count变量时，另一个goroutine正在更新它。这就是竞争条件的定义。

主函数正在读取一个另一个goroutine正在更新的变量。

是的，确切地说，这就是竞争条件。

第二段代码中的互斥锁并没有提供任何不同的行为。

实际上，它提供了不同的行为。它防止了不同的goroutine同时读取和写入该变量。

英文:

Your code contains a very clear race.

Your for loop is accessing count at the same time that the other goroutine is updating it. That's the definition of a race.

> The main function is reading a variable that another go routine is updating.

Yes, exactly. That's what a race is.

> The second code mutex does not provide any different behavior.

Yes, it does. It prevents the variable from being read and written at the same time from different goroutines.

答案2

得分: 2

你需要区分“同步错误”和“数据竞争”。同步错误是代码的属性，而数据竞争是程序的特定执行的属性。后者是前者的表现，但通常不能保证发生。

竞争检测器只能检测数据竞争，而不能检测同步错误。它可能会漏掉一些数据竞争（假阴性），但从不报告假阳性：

竞争检测器是检查并发程序正确性的强大工具。它不会发出假阳性警告，所以请认真对待它的警告。

换句话说，当竞争检测器报告数据竞争时，你可以确定你的代码至少包含一个同步错误。你需要修复这样的错误；否则，一切都无法保证。

令人惊讶的是，你的第一个代码片段确实包含一个同步错误：包级变量count在没有任何同步的情况下被main访问（读取）和update（作为goroutine启动）并发更新。以下是Go内存模型的相关部分：

修改被多个goroutine同时访问的数据的程序必须对此类访问进行序列化。
要进行序列化访问，请使用通道操作或其他同步原语，例如sync和sync/atomic包中的原语。

像你在第二个代码片段中所做的那样，使用读写互斥锁修复了你的同步错误。

第二个代码中的互斥锁并没有提供任何不同的行为。

当你执行第一个程序时，你只是碰巧没有发生数据竞争。一般来说，你没有任何保证。

英文:

You need to draw a distinction between a synchronization bug and a data race. A synchronization bug is a property of the code, whereas a data race is a property of a particular execution of the program. The latter is a manifestation of the former, but is in general not guaranteed to occur.

> There no bug in the first code. The main function is reading a variable that another go routine is updating. There is no potential hidden bug here.

The race detector only detects data races, not synchronization bugs. It may miss some data races (false negatives), but it never reports false positives:

> The race detector is a powerful tool for checking the correctness of concurrent programs. It will not issue false positives, so take its warnings seriously.

In other words, when the race detector reports a data race, you can be sure that your code contains at least one synchronization bug. You need to fix such bugs; otherwise, all bets are off.

Lo and behold, your first code snippet does indeed contain a synchronization bug: package-level variable count is accessed (by main) and updated (by update, started as a goroutine) concurrently without any synchronization. Here is a relevant passage of the Go Memory Model:

> Programs that modify data being simultaneously accessed by multiple goroutines must serialize such access.
> To serialize access, protect the data with channel operations or other synchronization primitives such as those in the sync and sync/atomic packages.

Using a reader/writer mutual-exclusion lock, as you did in your second snippet, fixes your synchronization bug.

> The second code mutex does not provide any different behavior.

You just got lucky, when you executed the first program, that no data race occurred. In general, you have no guarantee.

答案3

得分: 1

这与Go语言无关（即使在x86 CPU上，示例Go代码也不会触发该问题），但我有一个大约十年前的演示证明，即使使用LOCK CMPXCHG8B进行读写操作，某些x86 CPU（我认为我们使用的是早期的Haswell实现）上的“撕裂读取”也可能产生不一致的值。

触发此问题的特定条件有些难以设置。我们有一个自定义的分配器，存在一个错误：它只进行四字节对齐1。然后，我们使用“无锁”（单个锁定指令）算法向队列添加条目，具有单写多读的语义。

事实证明，只要LOCK CMPXCHG8B指令不跨越页面边界，它们就可以在未对齐的指针上“工作”。但是，一旦跨越页面边界，读取者在写入者执行原子写入时可能会看到撕裂读取，即它们获取一半旧值和一半新值。

结果是一个极难追踪的错误，系统在运行数小时甚至数天后才会遇到其中之一。我最终通过观察数据模式来诊断它，并最终将问题追踪到分配器。

1是否是错误取决于如何使用分配的对象，但我们将它们用作具有LOCK CMPXCHG8B指令的8字节宽指针。

英文:

This is off topic for Go (and the sample Go code won't trigger the problem even on x86 CPUs), but I have a demonstration proof, from roughly a decade ago at this point, that "torn reads" can produce inconsistent values even if the read and write operations are done with LOCK CMPXCHG8B, on some x86 CPUs (I think we were using early Haswell implementations).

The particular conditions that trigger this are a little difficult to set up. We had a custom allocator that had a bug: it only did four-byte alignment.1 We then had a "lock-free" (single locking instruction) algorithm to add entries to a queue, with single-writer multi-reader semantics.

It turns out that LOCK CMPXCHG8B instructions "work" on misaligned pointers as long as they do not cross page boundaries. As soon as they do, though, the readers can see a torn read, in which they get half the old value and half the new value, when a writer is doing an atomic write.

The result was an extremely difficult-to-track-down bug, where the system would run well for hours or even days before tripping over one of these. I finally diagnosed it by observing the data patterns, and eventually tracked the problem down to the allocator.

1Whether this is a bug depends on how one uses the allocated objects, but we were using them as 8-byte-wide pointers with LOCK CMPXCHG8B instructions.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

为什么竞争检测器在这里报告了一个竞争条件？

问题

答案1

答案2

答案3

在Golang中，Array.prototype.map()的等效函数是什么？

如何将我的Go程序从Mac OS X交叉编译到Ubuntu 64位？

Golang：切片性能优化

Go授权头未发送

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。