共享内存与消息传递如何处理大型数据结构?

huangapple go评论147阅读模式
英文:

How does shared memory vs message passing handle large data structures?

问题

在研究Go和Erlang对并发的处理方式时,我注意到它们都依赖于消息传递。

这种方法显然减少了对复杂锁的需求,因为没有共享状态。

然而,考虑到许多客户端希望并行只读访问单个大型内存数据结构(如后缀数组)的情况。

我的问题是:

  • 使用共享状态是否比消息传递更快且使用的内存更少,因为锁大多数情况下是不必要的,数据是只读的,只需要存在于一个位置?

  • 在消息传递的情况下,如何解决这个问题?是否有一个单独的进程可以访问数据结构,客户端只需按顺序从中请求数据?或者,如果可能的话,数据是否可以被分块以创建持有块的多个进程?

  • 鉴于现代CPU和内存的架构,这两种解决方案之间有很大的区别吗?即,共享内存是否可以被多个核心并行读取,这意味着没有硬件瓶颈,否则两种实现大致表现相同?

英文:

In looking at Go and Erlang's approach to concurrency, I noticed that they both rely on message passing.

This approach obviously alleviates the need for complex locks because there is no shared state.

However, consider the case of many clients wanting parallel read-only access to a single large data structure in memory -- like a suffix array.

My questions:

  • Will using shared state be faster and use less memory than message passing, as locks will mostly be unnecessary because the data is read-only, and only needs to exist in a single location?

  • How would this problem be approached in a message passing context? Would there be a single process with access to the data structure and clients would simply need to sequentially request data from it? Or, if possible, would the data be chunked to create several processes that hold chunks?

  • Given the architecture of modern CPUs & memory, is there much difference between the two solutions -- i.e., can shared memory be read in parallel by multiple cores -- meaning there is no hardware bottleneck that would otherwise make both implementations roughly perform the same?

答案1

得分: 31

一个需要意识到的事情是,Erlang并发模型并没有真正规定消息中的数据必须在进程之间进行复制,它只规定了发送消息是唯一的通信方式,并且没有共享状态。由于所有数据都是不可变的,这是基本的,因此实现可能不会复制数据,而只是发送一个引用。或者可能使用两种方法的组合。像往常一样,在选择如何做的时候,没有最佳解决方案,需要权衡取舍。

BEAM使用复制,除了对于大型二进制数据,它会发送一个引用。

英文:

One thing to realise is that the Erlang concurrency model does NOT really specify that the data in messages must be copied between processes, it states that sending messages is the only way to communicate and that there is no shared state. As all data is immutable, which is fundamental, then an implementation may very well not copy the data but just send a reference to it. Or may use a combination of both methods. As always, there is no best solution and there are trade-offs to be made when choosing how to do it.

The BEAM uses copying, except for large binaries where it sends a reference.

答案2

得分: 28

是的,在这种情况下,共享状态可能会更快。但前提是你可以放弃锁定,这只有在绝对只读的情况下才可行。如果它是“大部分只读”的话,你就需要一个锁(除非你能够编写无锁结构,但要注意它们比锁更加棘手),那么你很难使其性能与良好的消息传递架构一样快。

是的,你可以编写一个“服务器进程”来共享它。使用非常轻量级的进程,它的负担不比编写一个小型API来访问数据重。可以将其视为一个“拥有”数据的对象(在面向对象编程中的意义上)。将数据分割成块以增强并行性(在数据库领域称为“分片”)有助于处理大型情况(或者如果数据存储在较慢的存储介质上)。

即使NUMA正在变得主流,每个NUMA单元中的核心数量仍在增加。一个重要的区别是,消息可以在两个核心之间传递,而锁必须在所有核心上刷新缓存,从而将其限制在单元间总线延迟(甚至比RAM访问更慢)。如果说什么的话,共享状态/锁定变得越来越不可行。

简而言之...习惯于消息传递和服务器进程,这是当前的潮流。

编辑:重新审视这个答案,我想补充一下Go文档中的一句话:
> 通过通信共享内存,不要通过共享内存进行通信。

这个想法是:当你有一个在线程之间共享的内存块时,避免并发访问的典型方式是使用锁进行仲裁。Go的风格是通过传递带有引用的消息,只有在接收消息时线程才访问内存。它依赖于一定程度的程序员纪律;但结果是非常干净的代码,可以很容易地进行审查,因此相对容易调试。

优点是你不需要在每个消息上复制大块数据,并且不需要像某些锁实现那样有效地刷新缓存。目前还很难说这种风格是否会导致更高性能的设计。(特别是因为当前的Go运行时在线程调度方面有些天真)

英文:
  • Yes, shared state could be faster in this case. But only if you can forgo the locks, and this is only doable if it's absolutely read-only. if it's 'mostly read-only' then you need a lock (unless you manage to write lock-free structures, be warned that they're even trickier than locks), and then you'd be hard-pressed to make it perform as fast as a good message-passing architecture.

  • Yes, you could write a 'server process' to share it. With really lightweight processes, it's no more heavy than writing a small API to access the data. Think like an object (in OOP sense) that 'owns' the data. Splitting the data in chunks to enhance parallelism (called 'sharding' in DB circles) helps in big cases (or if the data is on slow storage).

  • Even if NUMA is getting mainstream, you still have more and more cores per NUMA cell. And a big difference is that a message can be passed between just two cores, while a lock has to be flushed from cache on ALL cores, limiting it to the inter-cell bus latency (even slower than RAM access). If anything, shared-state/locks is getting more and more unfeasible.

in short.... get used to message passing and server processes, it's all the rage.

Edit: revisiting this answer, I want to add about a phrase found on Go's documentation:
> share memory by communicating, don't communicate by sharing memory.

the idea is: when you have a block of memory shared between threads, the typical way to avoid concurrent access is to use a lock to arbitrate. The Go style is to pass a message with the reference, a thread only accesses the memory when receiving the message. It relies on some measure of programmer discipline; but results in very clean-looking code that can be easily proofread, so it's relatively easy to debug.

the advantage is that you don't have to copy big blocks of data on every message, and don't have to effectively flush down caches as on some lock implementations. It's still somewhat early to say if the style leads to higher performance designs or not. (specially since current Go runtime is somewhat naive on thread scheduling)

答案3

得分: 15

在Erlang中,所有的值都是不可变的 - 所以当消息在进程之间发送时,没有必要复制它,因为它无论如何都不能被修改。

在Go中,消息传递是按照约定的方式进行的 - 没有任何阻止你通过通道发送指针给某人,然后修改指向的数据的机制,只是约定而已,所以再次没有必要复制消息。

英文:

In Erlang, all values are immutable - so there's no need to copy a message when it's sent between processes, as it cannot be modified anyway.

In Go, message passing is by convention - there's nothing to prevent you sending someone a pointer over a channel, then modifying the data pointed to, only convention, so once again there's no need to copy the message.

答案4

得分: 13

大多数现代处理器使用MESI协议的变种。由于共享状态,将只读数据在不同线程之间传递是非常廉价的。然而,修改共享数据非常昂贵,因为存储该缓存行的所有其他缓存必须使其无效。

因此,如果您有只读数据,与使用消息复制相比,在线程之间共享它是非常廉价的。如果您有大部分只读数据,将其在线程之间共享可能会很昂贵,部分原因是需要同步访问,部分原因是写操作破坏了共享数据的缓存友好行为。

不可变数据结构在这里可能是有益的。您不需要更改实际的数据结构,而是创建一个新的数据结构,它与旧数据共享大部分内容,但需要更改的部分已经被更改。共享单个版本是廉价的,因为所有数据都是不可变的,但您仍然可以高效地更新到新版本。

英文:

Most modern processors use variants of the MESI protocol. Because of the shared state, Passing read-only data between different threads is very cheap. Modified shared data is very expensive though, because all other caches that store this cache line must invalidate it.

So if you have read-only data, it is very cheap to share it between threads instead of copying with messages. If you have read-mostly data, it can be expensive to share between threads, partly because of the need to synchronize access, and partly because writes destroy the cache friendly behavior of the shared data.

Immutable data structures can be beneficial here. Instead of changing the actual data structure, you simply make a new one that shares most of the old data, but with the things changed that you need changed. Sharing a single version of it is cheap, since all the data is immutable, but you can still update to a new version efficiently.

答案5

得分: 5

什么是大型数据结构?

一个人的大可能是另一个人的小。

上周我和两个人交谈 - 一个人正在制作嵌入式设备,他使用了“大”这个词 - 我问他这是什么意思 - 他说超过256千字节 - 同一周晚些时候,一个人谈到媒体分发 - 他使用了“大”这个词 - 我问他这是什么意思 - 他思考了一会儿,说“无法放在一台机器上”,大约20-100 T字节。

在Erlang术语中,“大”可能意味着“无法放入RAM” - 因此,对于4 GB RAM的数据结构,大于100 MB的数据结构可能被认为是大型的 - 复制500 MB的数据结构可能是一个问题。在Erlang中,复制小型数据结构(例如小于10 MB)从来不是问题。

真正大型的数据结构(即无法放在一台机器上的数据结构)必须被复制并“分散”到多台机器上。

所以我想你有以下情况:

小型数据结构没有问题 - 因为它们很小,数据处理时间快,复制也快(只是因为它们很小)

大型数据结构是一个问题 - 因为它们无法放在一台机器上 - 所以复制是必要的。

英文:

What is a large data structure?

One persons large is another persons small.

Last week I talked to two people - one person was making embedded devices he used the word
"large" - I asked him what it meant - he say over 256 KBytes - later in the same week a
guy was talking about media distribution - he used the word "large" I asked him what he
meant - he thought for a bit and said "won't fit on one machine" say 20-100 TBytes

In Erlang terms "large" could mean "won't fit into RAM" - so with 4 GBytes of RAM
data structures > 100 MBytes might be considered large - copying a 500 MBytes data structure
might be a problem. Copying small data structures (say < 10 MBytes) is never a problem in Erlang.

Really large data structures (i.e. ones that won't fit on one machine) have to be
copied and "striped" over several machines.

So I guess you have the following:

Small data structures are no problem - since they are small data processing times are
fast, copying is fast and so on (just because they are small)

Big data structures are a problem - because they don't fit on one machine - so copying is essential.

答案6

得分: 4

请注意,你的问题在技术上是没有意义的,因为消息传递可以使用共享状态,所以我假设你的意思是使用深拷贝的消息传递来避免共享状态(就像Erlang目前所做的那样)。

使用共享状态会快得多。

在消息传递的情况下,可以使用两种方法。一种是有一个进程可以访问数据结构,客户端只需按顺序从中请求数据。另一种是将数据分块,创建多个持有块的进程。

鉴于现代CPU和内存的架构,这两种解决方案之间有很大的区别吗?也就是说,共享内存是否可以被多个核心并行读取,这意味着没有硬件瓶颈,否则两种实现的性能大致相同?

拷贝对缓存不友好,因此在多核上会破坏可扩展性,因为它加剧了对主内存这个共享资源的争用。

最终,Erlang风格的消息传递是为并发编程而设计的,而你关于吞吐量性能的问题实际上是针对并行编程的。这两个主题是非常不同的,在实践中它们之间的重叠非常小。具体而言,在并发编程的背景下,延迟通常与吞吐量同样重要,而Erlang风格的消息传递是实现理想延迟特性(即一致低延迟)的好方法。然后,共享内存的问题不是读写者之间的同步,而是低延迟的内存管理。

英文:

Note that your questions are technically non-sensical because message passing can use shared state so I shall assume that you mean message passing with deep copying to avoid shared state (as Erlang currently does).

> Will using shared state be faster and use less memory than message passing, as locks will mostly be unnecessary because the data is read-only, and only needs to exist in a single location?

Using shared state will be a lot faster.

> How would this problem be approached in a message passing context? Would there be a single process with access to the data structure and clients would simply need to sequentially request data from it? Or, if possible, would the data be chunked to create several processes that hold chunks?

Either approach can be used.

> Given the architecture of modern CPUs & memory, is there much difference between the two solutions -- i.e., can shared memory be read in parallel by multiple cores -- meaning there is no hardware bottleneck that would otherwise make both implementations roughly perform the same?

Copying is cache unfriendly and, therefore, destroys scalability on multicores because it worsens contention for the shared resource that is main memory.

Ultimately, Erlang-style message passing is designed for concurrent programming whereas your questions about throughput performance are really aimed at parallel programming. These are two quite different subjects and the overlap between them is tiny in practice. Specifically, latency is typically just as important as throughput in the context of concurrent programming and Erlang-style message passing is a great way to achieve desirable latency profiles (i.e. consistently low latencies). The problem with shared memory then is not so much synchronization among readers and writers but low-latency memory management.

答案7

得分: 3

这里没有提到的一个解决方案是主从复制。如果你有一个大的数据结构,你可以将对它的更改复制到所有执行更新的从节点上。

如果想要扩展到几台甚至没有共享内存的机器上,这尤其有趣(从一个远程计算机的内存中读写的块设备的mmap?)

它的一个变体是有一个事务管理器,可以请求更新复制的数据结构,并确保它只提供一个并发的更新请求。这更像是mnesia模型中的主-主复制,适用于“大数据结构”。

英文:

One solution that has not been presented here is master-slave replication. If you have a large data-structure, you can replicate changes to it out to all slaves that perform the update on their copy.

This is especially interesting if one wants to scale to several machines that don't even have the possibility to share memory without very artificial setups (mmap of a block device that read/write from a remote computer's memory?)

A variant of it is to have a transaction manager that one ask nicely to update the replicated data structure, and it will make sure that it serves one and only update-request concurrently. This is more of the mnesia model for master-master replication of mnesia table-data, which qualify as "large data structure".

答案8

得分: 3

目前的问题确实是,锁定和缓存行一致性可能会像复制一个更简单的数据结构(例如几百字节)一样昂贵。

大多数情况下,一个聪明编写的新的多线程算法,试图消除大部分锁定,总是会更快 - 尤其是在使用现代无锁数据结构时会更快。特别是当你拥有像Sun的Niagara芯片级多线程那样设计良好的缓存系统时。

如果你的系统/问题不能轻易地分解为几个简单的数据访问,那么你就有问题了。并且并不是所有的问题都可以通过消息传递来解决。这就是为什么仍然有一些基于Itanium的超级计算机销售,因为它们拥有TB级的共享内存,并且最多可以有128个CPU在同一块共享内存上工作。它们的价格比具有相同CPU性能的主流x86集群高一个数量级,但你不需要分解你的数据。

迄今为止还没有提到的另一个原因是,当你使用多线程时,程序编写和维护会变得更容易。消息传递和无共享的方法使得它更易于维护。

以Erlang为例,它从来没有被设计成让事情变得更快,而是使用大量的线程来组织复杂的数据和事件流。

我猜这是设计中的一个主要观点。在谷歌的网络世界中,你通常不关心性能 - 只要它可以在云中并行运行。而且通过消息传递,你理想情况下只需添加更多的计算机而无需更改源代码。

英文:

The problem at the moment is indeed that the locking and cache-line coherency might be as expensive as copying a simpler data structure (e.g. a few hundred bytes).

Most of the time a clever written new multi-threaded algorithm that tries to eliminate most of the locking will always be faster - and a lot faster with modern lock-free data structures. Especially when you have well designed cache systems like Sun's Niagara chip level multi-threading.

If your system/problem is not easily broken down into a few and simple data accesses then you have a problem. And not all problems can be solved by message passing. This is why there are still some Itanium based super computers sold because they have terabyte of shared RAM and up to 128 CPU's working on the same shared memory. They are an order of magnitude more expensive then a mainstream x86 cluster with the same CPU power but you don't need to break down your data.

Another reason not mentioned so far is that programs can become much easier to write and maintain when you use multi-threading. Message passing and the shared nothing approach makes it even more maintainable.

As an example, Erlang was never designed to make things faster but instead use a large number of threads to structure complex data and event flows.

I guess this was one of the main points in the design. In the web world of google you usually don't care about performance - as long as it can run in parallel in the cloud. And with message passing you ideally can just add more computers without changing the source code.

答案9

得分: 1

通常消息传递语言(特别是Erlang,因为它具有不可变变量)会优化进程之间的实际数据复制(当然只限于本地进程:您需要明智地考虑网络分发模式),因此这不是一个大问题。

英文:

Usually message passing languages (this is especially easy in erlang, since it has immutable variables) optimise away the actual data copying between the processes (of course local processes only: you'll want to think your network distribution pattern wisely), so this isn't much an issue.

答案10

得分: 0

另一个并发范式是STM,即软件事务内存。Clojure的ref引起了很多关注。Tim Bray有一系列很好的文章探讨了Erlang和Clojure的并发机制。

http://www.tbray.org/ongoing/When/200x/2009/09/27/Concur-dot-next

http://www.tbray.org/ongoing/When/200x/2009/12/01/Clojure-Theses

英文:

The other concurrent paradigm is STM, software transactional memory. Clojure's ref's are getting a lot of attention. Tim Bray has a good series exploring erlang and clojure's concurrent mechanisms

http://www.tbray.org/ongoing/When/200x/2009/09/27/Concur-dot-next

http://www.tbray.org/ongoing/When/200x/2009/12/01/Clojure-Theses

huangapple
  • 本文由 发表于 2009年11月26日 01:14:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/1798455.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定