英文:
Meaning of "don't move data over channels, move ownership of data over channels"
问题
我正在学习Golang通道,实际上它们比语言提供的许多其他替代方案要慢。当然,它们非常容易理解,但由于它们是高级结构,所以会带来一些开销。
阅读了一些相关文章后,我在这里找到了有人对通道进行基准测试链接。他基本上说通道可以传输10 MB/s,当然这取决于他的硬件。然后他说了一些我还没有完全理解的话:
> 如果你只是想使用通道快速传输数据,那么逐字节移动数据是不明智的。你真正要做的是移动数据的所有权,这样数据传输速率可以是无限的,取决于你传输的数据块的大小。
我在几个地方看到过这个“移动数据的所有权”,但我还没有看到一个具体的例子来说明如何做到这一点,而不是移动数据本身。
我想看一个例子,以便更好地理解这个最佳实践。
英文:
I'm learning that Golang channels are actually slower than many alternatives provided by the language. Of course, they are really easy to grasp but because they are a high level structure, they come with some overhead.
Reading some articles about it, I found someone benchmarking the channels here. He basically says that the channels can transfer 10 MB/s, which of course must be dependant on his hardware. He then says something that I haven't completely understood:
> If you just want to move data quickly using channels then moving it 1
> byte at a time is not sensible. What you really do with a channel is
> move ownership of the data, in which case the data rate can be
> effectively infinite, depending on the size of data block you
> transfer.
I've seen this "move ownership of data" in several places but I haven't seen a solid example illustrating how to do it instead of moving the data itself.
I wanted to see an example in order to understand this best practice.
答案1
得分: 5
在通道上移动数据:
c := make(chan [1000]int)
// 生成一些从该通道读取的goroutine
var data [1000]int
// 填充数据
// 将数据写入通道
c <- data
正如你提到的,这里可能存在的问题是,你正在移动大量数据,因此可能会进行过多的内存复制。
你可以通过发送引用类型(如指针或切片)来防止这种情况发生:
c := make(chan []int)
// 生成一些从该通道读取的goroutine
var data [1000]int
// 填充数据
// 将对data的引用写入通道
c <- data[:]
所以我们刚刚完成了完全相同的数据传输,但减少了内存复制,对吗?嗯,这里可能存在一个潜在问题:你向通道发送了对data
的引用,但是即使在发送之后,该data
值仍然可以在当前作用域中访问:
// 将对data的引用写入通道
c <- data[:]
// 开始操作data
data[0] = 999
data[1] = 1234
...
这段代码可能引入了潜在的数据竞争,因为从通道中读取该切片的任何人可能与你同时对其进行修改。
传递所有权的思想是,在给出对某个东西的引用之后,你也放弃了对该东西的所有权,并且不再使用它。只要我们在给出引用(将切片发送到通道)后不再使用data
,那么我们就正确地传递了所有权。
这个问题是共享状态问题的一个扩展。与例如Rust不同,Go没有语言结构来正确控制共享状态。为了减少这些错误的机会,你可以应用一些策略:
- 避免在通道上传递引用:在上面的示例中,问题出现在我们开始通过引用(切片)传递数据时。除非有实际原因要进行这种优化(测量到了有意义的性能差异),否则完全可以避免。不过,Go中有一些数据类型是固有的引用类型(例如映射和切片)。如果必须将这些类型传递到通道上,那么可以使用其他策略。
- 将数据创建逻辑分离到函数中:在上面的示例中,我们可以重构代码:
func sendData(c chan []int) {
var data [1000]int
// 填充数据
// 将对data的引用写入通道
c <- data[:]
}
c := make(chan []int)
// 生成一些从该通道读取的goroutine
// 发送一些数据
sendData(c)
错误使用data
的可能性仍然存在,但现在它被隔离在一个具有明确意图的小函数中。理论上,这种隔离应该使代码更容易理解,更明显地展示了data
的正确使用方式,并且对它的潜在交互的更改会更少。
- 不要将数据流水线与持久状态混合:所谓数据流水线,是指通过通道在两个或多个并发例程之间流动的数据。在前面的观点上进行扩展,尽可能将拥有引用的创建尽可能靠近其进入数据流水线的位置。使接收数据的goroutine和再次发送或使用数据的位置之间的间隔尽可能紧凑。根据所有权的一般规则,只有在你目前完全拥有某个东西时,才能转移对它的所有权。由于这个规则,你应该尽量避免在通道上发送任何引用,除非你刚刚在发送之前立即创建了引用所引用的数据。如果你有对任何持久或全局状态的引用,确保尊重所有权就会变得更加困难。
通过将引用的创建和所有权的转移放在一个隔离的全局函数中,应该更难出现错误。那么违反所有权规则的唯一方法是:
- 泄漏对全局状态的引用
- 尽量消除全局变量和全局状态
- 泄漏对引用类型参数的状态的引用
- 在数据发送函数中不要使用任何引用类型参数
- 在发送引用后修改引用数据
- 将发送操作放在函数的最末尾。如果需要,可以将发送放在延迟调用中。
没有完美的解决方案来消除所有共享状态问题(即使在Rust中,实际上有时也会存在这些问题),但我希望这些策略能帮助你思考如何解决这个问题。
英文:
Moving data over a channel:
c := make(chan [1000]int)
// spawn some goroutines that read from this channel
var data [1000]int
// populate the data
// write data to the channel
c <- data
The potential problem here, as you mentioned, is that you're moving a lot of data, so you might be doing an excessive amount of memory copying.
You could prevent that by sending a reference type, such as a pointer or slice over the channel:
c := make(chan []int)
// spawn some goroutines that read from this channel
var data [1000]int
// populate the data
// write a reference to data to the channel
c <- data[:]
So we just did the exact same data transfer, but reduced the memory copying, right? Well, here's a potential problem: You sent over the channel a reference to data
, but that data
value continues to be accessible in the current scope, even after the send:
// write a reference to data to the channel
c <- data[:]
// start messing with data
data[0] = 999
data[1] = 1234
...
This code might have just introduced a potential data race, because whoever read that slice from the channel might be working on it at the same time as you start modifying it.
The idea of passing ownership is that after you give out a reference to something, you are also conceding ownership of that thing, and will no use it. So long as we don't use data
after giving out the reference (sending the slice on the channel), then we have properly passed ownership.
This problem is an extension of the general problem of shared state. Unlike, Rust, for example, Go doesn't have language constructs to properly control shared state. In order to reduce the chances of these errors, you could apply some strategies:
- Avoid passing references on channels: In the above example, the problem occurred once we started passing the data by reference, with a slice. This was only done to reduce the amount of memory coping done. Unless there was a pragmatic reason to do this optimization (a worthwhile performance difference was measured), it could be avoided entirely. Still, though, there are some data types in Go that are inherently a reference (e.g., maps and slices). If these types must be passed on a channel, then other strategies can be used.
- Separate the data creation logic into functions: In the example above, we could refactor the code:
func sendData(c chan []int) {
var data [1000]int
// populate the data
// write a reference to data to the channel
c <- data[:]
}
c := make(chan []int)
// spawn some goroutines that read from this channel
// send some data
sendData(c)
The possibility of incorrectly using data
still exists, but now it's isolated to a small function with a clear intent. In theory, the isolation should make the code easier to understand, more obvious what the correct use of data
is, and fewer changes would have potential interaction with it.
- Don't mix data pipelines with persistent state: By data pipeline, I mean two or more concurrent routines, between which data flows via channels. Expanding on the previous point, make the creation of owned references as close as possible to where they enter the data pipeline. Make space between where a goroutine receives data and where it sends it again or uses it, as tight as possible. In the general rules of ownership, you can only transfer ownership of something when you presently have full ownership of it. Due to this rule, you should avoid as much as possible, sending any reference on a channel that you didn't just create the referenced data immediately before sending. If you have a reference to any persistent or global state, it becomes much harder to ensure that ownership is respected.
By keeping the creation of the reference and the transfer of ownership in an isolated, global function, it should be harder to make errors. Then the only ways to violate the ownership rule are to:
- Leak the reference to global state
- Try to eliminate global variables and global state
- Leak the reference to a reference type parameter's state
- Don't take any reference type parameters in data sending functions
- Modify the reference data after sending the reference
- Put the send operation at the very end of the function. If necessary, you could put the send inside a defered call.
There's no perfect solution to eliminate all shared state issues (even in Rust they sometimes exist in practice), but I hope these strategies will help you think about how to tackle this problem.
答案2
得分: 2
Hymns For Disco的回答很好,但我发现写好的简短答案有时是一个有趣的挑战,我想我有一个类比可以帮助解释。
想象一下,你的数据就像占据着仓库一样,每个仓库都有一个大城市街区那么大。你有一千个仓库,分布在许多国家和城市。
你有五个高技能的技术人员,每个人都擅长一项技能。你需要所有的技术人员对所有的数据进行操作。不幸的是,每个技术人员都讨厌其他四个人,如果其中任何一个人在仓库里,他们就不会工作(甚至可能试图杀死其他人)。
一种处理方法是建造五个额外的仓库,并将五个技术人员分别放在这五个新的仓库中。然后,你可以一次将每个仓库的全部内容运送到各个备用仓库中,然后在每个技术人员完成工作后将内容移回原来的仓库;你可以通过一定的优化仓库内容的移动方式,例如先将内容移动到工作仓库#1,然后移动到#2,再移动到#3等等,只有在准备离开#5之后才将其移回原来的仓库。但显然,这需要大量的运输和物流,并且需要大量的时间和金钱来进行这些大规模的移动。即使每个技术人员可以在一天内完全处理一个仓库,也需要数年时间和大量金钱才能完成所有工作。
或者,你可以将这五个技术人员分别运送到不同的仓库。将他们送到仓库1-5。当技术人员#1完成仓库#1的工作后,将他移动到仓库#2,除非技术人员#2仍然在那里;如果下一个空闲的仓库是仓库#6,那就将他移动到仓库#6。
我们移动的是小而轻的“做工作的人”,而不是大而重的“占据人工作空间的东西”。总体成本要低得多。不过,我们必须小心,不要让技术人员意外地相遇。
还要注意,如果数据本身很小、轻便且易于移动,那么这种精心设计的解决方案——即谨慎控制谁在何时访问哪些数据——就没有帮助。在小数据的情况下,我们可以将数据移动,而不是工作人员。
英文:
Hymns For Disco's answer is good, but I find writing good short answers an interesting challenge sometimes, and I think I have an analogy that will help here.
Imagine your data as occupying warehouses, each one a large city block in size. You have a thousand warehouses, scattered across many countries and cities.
You have five highly-skilled technicians, each of whom can do one thing really well. You need all the technicians to operate on all the data. Unfortunately, each technician hates the other four and will do no work (and maybe even try to kill the others) if any of them are present in the warehouse.
One way to deal with this is to build five extra warehouses, and put each of the five technicians in each of the new five warehouses. You can then ship the entire contents of each of the 1000 warehouses to the various spares, one at a time, and then move the contents back once each technician has finished with it; and you can optimize the warehouse-content moves somewhat by, perhaps, moving content to work-warehouse #1, then to #2, then to #3, etc., and only moving it back to its original warehouse after it's ready to leave #5. But obviously this requires a whole lot of shipping and logistics and takes huge amounts of time and money for all this bulk movement. It will be years, and lots of money, before everything is done, even if each technician can completely deal with a whole warehouse in just one day.
Alternatively, you can ship the five technicians around. Send them to warehouses (WHs) 1-5. When tech#1 is done with WH#1, move him to WH#2 unless tech#2 is still there; move him to WH#6 if that's the next free one instead.
We're moving the small and light "person who does the work" around, not the big and heavy "things that occupy the space where the person works". The overall cost is much lower. We do have to take care not to accidentally let the technicians encounter each other though.
Note, too, that this fancy solution—of being careful about who has access to which data at what time—doesn't help if the data themselves are small and light and easy to move around. In the small-data case we might as well move the data, instead of the workers.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论