英文:
The purpose of asynchronous IO / reactive programming
问题
以下是您要的内容的翻译:
首先,问题的基础假设:
同步IO: 当需要执行读取IO操作时,我会对文件描述符执行读取系统调用。CPU 进入特权模式,内核代码执行,通过设备驱动程序请求设备检索数据,并将我的线程置于BLOCKED
状态。最后,调度程序运行,另一个线程占用了我的线程运行的CPU核心。
设备会自行处理请求。一旦完成,它会在CPU上触发中断。中断处理程序执行内核代码,将我的线程状态设置为READY
(这显然是一个很大的简化)。现在我的线程有机会在内核调度程序运行时被调度。
异步IO: 我执行系统调用,系统调用要求设备检索数据。现在,与设置线程状态不同,系统调用返回一个特殊的标记,表示数据尚未准备好。线程继续执行。
通常,我们不直接使用这种系统调用,而是使用一些包装函数(由库提供),该函数以回调函数作为参数。此库还会生成一个选择所有调用的文件描述符的线程(epoll、kqueue 等)。一旦某些文件描述符可以与此线程交互,该线程会在工作线程池的某个适当的回调上安排一个适当的回调(运行事件循环/任务循环)。
如果上述有不正确的地方,我会很乐意得到纠正!
现在,关于问题:
1. 异步IO是否具有性能/资源优势?
就我所知,与完全的上下文切换相比,在线程之间切换相对廉价。如果有足够的工作,CPU 仍然会充分利用(将安排另一个线程)。
以下是我能想到的一些事情:
- 内存利用 - 较少的线程意味着内核代码中分配的堆栈和与线程相关的数据结构的内存较少。
- 调度开销 - 我想内核对线程的调度可能相当复杂。
但异步IO可能会损害性能的一些事情:
- 我们总共执行更多的系统调用(一个用于请求操作,另一个用于等待结果)
- 需要将回调安排到工作线程上
- 在执行回调时跳转到任意位置可能会影响缓存?
2. 响应式编程/协程,将这一思想推向更远(所有代码都作为工作线程上的事件运行),是否具有性能优势?
3. 我们为什么要实际进行响应式编程?
实际上,对我来说,响应式编程在本应作为开发人员工作的东西(进程和线程)之上构建了一个额外的抽象层,引入了许多额外的复杂性。
有时它似乎有道理,例如,如果我们假定我们要有一个单独的UI线程。但从我的角度来看,这种模式基本上是与同步化的替代方法 - 我们可以通过启动获取UI锁的线程来实现相同的目标。
我只是不明白传统并发方法中的什么因素导致了响应式编程框架的出现。
我将非常感激所有涉及这一问题的解释和来源。
英文:
First of all, some assumptions on which the question is based:
Synchronous IO: When I need to make a read IO operation, I perform a read system call on the file descriptor. The CPU goes into privileged mode and kernel code executes, which (via device driver) asks the device to retrieve my data and puts my thread into a BLOCKED
state. Finally, a scheduler is ran and another thread takes the CPU core which my thread was running on.
The device processes the request on its own. Once it is done, it triggers an interrupt on the CPU. The interrupt handler executes kernel code which sets the status of my thread to READY
(this is obviously a big simplification). My thread now has a chance to be scheduled when the kernel scheduler runs.
Asynchronous IO: I perform a system call, the system call asks the device to retrieve data. Now, instead of setting the thread state, the system call returns a special marker to indicate that the data is not yet ready. The thread continues it's execution.
We usually do not use such system call directly, but instead use some wrapper function (provided by the library) that takes a callback as a parameter. This library also spawns a thread that selects on file descriptors from all calls (epoll, kqueue ...). Once some fds can be interacted with this thread schedules an appropriate callback on some kind of thread pool of worker threads (which run event loops / task loops).
If some of the above is not right I am more than happy to be corrected!
Now, onto the questions:
1. Does asynchronous IO have any performance / resource benefits?
From my knowledge, in contrast to full context switch, switching between threads is quite inexpensive. The CPU will still be fully utilized if there is enough work (another thread will be scheduled).
Here are the things I can think of:
- memory utilization - less threads is less memory allocated for stack and thread-related data structures in kernel code.
- scheduling overhead - I guess scheduling of the threads by kernel might be quite complex.
But there are also some things that I think might hurt the performance with async IO:
- We perform more syscalls in total (one for requesting an operation, another one for awaiting the result)
- The callbacks need to be scheduled onto workers
- Jumping to arbitrary locations when executing callbacks might mess with cache?
2. Does reactive programming / coroutines, which takes this idea even further (all code runs as events on the worker threads) have any performance benefit?
3. Why do we actually do reactive programming?
It really just seems to me that Reactive Programming builds an additional layer of abstraction on top of something that is already supposed to be an abstraction for developers to work on (processes and threads), which brings a lot of additional complexity.
Sometimes it might seem to make sense, for example if we assumed that we want to have a separate UI thread. The problem with that is that from my perspective this pattern is basically an alternative approach to synchronization - we would be able to accomplish the same just firing up a thread that acquires UI lock.
I just fail to see what it is in the traditional approach to concurrency that has lead to creation of the reactive programming frameworks.
I will be really grateful for all explanations and sources which touch on this.
答案1
得分: 2
The device processes the request on its own
这个设备会自行处理请求。
I perform a system call, the system call asks the device to retrieve data. Now, instead of setting the thread state, the system call returns a special marker to indicate that the data is not yet ready. The thread continues its execution.
我执行一个系统调用,系统调用请求设备检索数据。现在,不设置线程状态,系统调用返回一个特殊标记来指示数据还未准备好。线程继续执行。
Does asynchronous IO have any performance / resource benefits?
异步IO是否有性能/资源优势?
Does reactive programming / coroutines, which takes this idea even further (all code runs as events on the worker threads) have any performance benefit?
采用反应式编程/协程,进一步推动这一想法(所有代码都作为工作线程上的事件运行)是否具有性能优势?
Why do we actually do reactive programming?
为什么我们要实际进行反应式编程?
英文:
> The device processes the request on its own
This is not so true in practice. Regarding the target device (as well as the OS and the driver implementation), the request may or may not be fully offloaded. In practice, a kernel thread is generally responsible for completing the request (in interaction with the IO scheduler). The actual code executed by this kernel thread and its actual behaviour is platform-independent (a HDD, a fast NVMe SSD and a HPC NIC will clearly not behave the same way). For example, a DMA request with a polling strategy can be used to alleviate the use of hardware interrupt for low-latency devices (since it is generally slow). Anyway, this operation is done in the OS side and users should not care much about this beside the impact on the CPU usage and on the latency/throughput. What matters is that requests are performed serially and the thread is de-scheduled during the IO operation.
> I perform a system call, the system call asks the device to retrieve data. Now, instead of setting the thread state, the system call returns a special marker to indicate that the data is not yet ready. The thread continues it's execution.
The state of asynchronous IO is complex in practice and its implementation is also platform dependent. An API can provide asynchronous IO functions while the underlying OS do not support it. One common strategy is to span a progression thread polling the state of the IO request. This is not an efficient solution, but it can be better than synchronous IO regarding the actual application (explained later). In fact, the OS can even provide standard API for that while not fully supporting asynchronous IO in its own kernel so an intermediate layer is responsible to hide this discrepancies! On top of that, the target device also matter regarding the target platform.
One (old) way to do asynchronous IO is to do non-blocking IO combined with polling functions like select
or poll
. In this case, an application can start multiple requests and then wait for them. It can even do some useful computation before waiting for the completion of the target requests. This is significantly better than doing one request at a time, especially for high-latency IO request like waiting for a network message from Tokyo to Paris to be received (lasting for at least 32 ms due to the speed of light, but likely >100 ms in practice). That being said, there are several issues with this approach :
- it is hard to overlap the latency with computation well (because of many unknown like the latency time, the computational speed, the amount of computation)
- it poorly scale because each request is scanned when a request is ready (not to mention the number of descriptor is often limited and it use a lot more OS resources than what it should).
- it makes application less maintainable due to polling loops. In many cases, this polling loops is put in a separate thread (or even a pool of threads) at the expense of a higher latency (due to additional context switches and cache misses). This strategy can actually be the one implemented by asynchronous IO libraries.
In order to solve there issues, event-based asynchronous IO functions can be used instead. A good example is epoll
(more specifically the edge-triggered interface). It is meant to solve the useless scan of many waiting request and only focus on the one that are ready. As a result, it scale better (O(n)
time VS O(1)
for epoll
). There is no need of any active probing loop but an event-based code doing similar things. This part can be hidden by user-side high-level software libraries. In fact, software libraries are also critical to write portable code since OS have different asynchronous interfaces. For example, epoll
is only for Linux, kqueue
is for BSD and Windows also use another method (see here for more information). Also, one thing to keep in mind is that epoll_wait
is a blocking call so while there can be more than one request pending, there is still a final synchronous wait operation. Putting it in a thread to make this operation from the user point-of-view can typically decrease performance (mainly latency).
On POSIX systems, there is the AIO API specifically designed for asynchronous operation (based on callbacks). That being said, the standard Linux implementation of AIO emulates asynchronous IOs using threads internally because the kernel did not have any fully asynchronous compatible interface to do that until recently. In the end, this is not much better than using threads yourself to process asynchronous IO requests. In fact, AIO can be slower because it performs more kernel calls. Fortunately, Linux recently introduced a new kernel interface for asynchronous IOs : io_uring
. This new interface is the best on Linux. It is not meant to be used directly (as it is very low-level). <!-- It is not clear to me if it could be used as a back-end for AIO in the future (to avoid completion threads). --> For more information on the difference between AIO and io_uring, please read this. Note that io_uring is pretty new so AFAIK it is not used by many high-level libraries yet.
In the end, an asynchronous call from a high-level library can result in several system calls or context switches. When used, completion threads can also have a strong impact on the CPU usage, the latency of the operation, cache misses, etc. This is why asynchronous IO is not always so great in practice performance-wise not to mention asynchronous IO often require the target application to be implemented pretty differently.
> Does asynchronous IO have any performance / resource benefits?
This is dependent of the use case but asynchronous IO can drastically improve the performance of a wide range of applications. Actually, all applications able to start multiple requests simultaneously can benefit from asynchronous IO, especially when the target requests last for a while (HDD request, network ones, etc). If you are working with high-latency devices, this is the key point and you can forget about all other overheads which are negligible (eg. a seek time of an HDD last about a dozen of milliseconds while context switches generally last a few microseconds, that is at least 2 orders of magnitude less). For low-latency devices, the story is more complex because the many overheads may not be negligible : the best is to try on your specific platform.
As for the provided points that might hurt performance, they are dependent of the underlying interface used and possibly the device (and so the platform).
For example nothing force the implementation to call callbacks on different threads. The point about cache misses caused by the callback are probably the least of your problem after doing a system call that is far more expensive not to mention modern CPUs have pretty big caches nowadays. Thus, unless you have a very large set of callbacks to call or very large callback codes, you should not see a statistically significant performance impact due to this point.
With interfaces like io_uring, the number of system calls is not really a problem anymore. In fact, AFAIK, io_uring will likely perform better than all other interface. For example, you can create a chain of IO operations avoiding some callbacks and ping-pong between the user application and the kernel. Besides, io_uring_enter
can wait for an IO request and submit a new one at the same time.
> Does reactive programming / coroutines, which takes this idea even further (all code runs as events on the worker threads) have any performance benefit?
With coroutines nothing is run in a separate system thread. This is a common misunderstanding. Coroutines are a function that can be paused. The pause is based on a continuation mechanism : registers including the code pointer are temporary stored in memory (pause) so they can be restored back later (restart). Such an operation happens in the same thread. Coroutines typically also have their own stack. Coroutines are similar to fibers.
Mechanisms similar to coroutines (continuations) are used to implement asynchronous functions in programming languages (some may argue that they are actually coroutines). For example, async
/await
in C# (and many other languages) do that. In this case, an asynchronous function can start an IO request and be paused when it is waiting on it so another asynchronous function can start other IO request until there is no asynchronous function to run. The language runtime can then wait for IO requests to be completed so to then restart the target asynchronous functions that was awaiting for the read request. Such a mechanism makes asynchronous programming much easier. It is not meant to make things fast (despite using asynchronous IO). In fact, coroutines/continuations have a slight overhead so it can be slower than using low-level asynchronous API, but the overhead is generally much smaller than than the one of the IO request latency and even generally smaller than the one of a context switch (or even a system call).
I am not very familiar with reactive programming but AFAIK it is meant to simplify the implementation of programs having a large set of dependent operation with incremental updates. This seems pretty orthogonal to asynchronous programming to me. A good implementation can benefit from asynchronous operations but this is not the main goal of this approach. The benefit of the approach is to only update the things that needs to be updated in a declarative way, no more. Incremental updates are critical for performance as recomputing the whole dataset can be much more expensive than a small part regarding the target application. This is especially true in GUIs.
One thing to keep in mind is that asynchronous programming can improve performance thanks to concurrency, but this is only useful if the latency of the target asynchronous operation can be mitigated. Making a compute-bound code concurrent is useless performance-wise (and actually certainly even detrimental due to the concurrency overhead) because there is no latency issue (assuming you are not operating at the granularity of CPU instructions).
> Why do we actually do reactive programming?
As said above, it is a good solution to perform incremental updates of a complex dataflow. Not all applications benefit from this.
Programming models are like tools : developers need to pick the best one so to address the specific problems of a given application. Otherwise, this is the recipe for a disaster. Unfortunately, this is not rare for people to use programming models not well suited for their needs. There are many reasons for this (historical, psychological, technical, etc.) but this is a too broad topic and this answer is already pretty big.
Note that using threads to do asynchronous operations is generally not a great idea. This is one way to implement asynchronous programming, and not an efficient one (especially without a thread-pool). It often introduces more issues than it solves. For example, you may need to protect variables with locks (or any synchronization mechanism) to avoid race conditions; care about (low-level) operations that cannot be executed on a separate threads; consider the overheads of the TLS, the ones due to cache misses, inter-core communications, possible context switches and NUMA effects (not to mention the target cores can be sleeping, operating at a lower frequency, etc.).
Related post:
答案2
得分: 2
Does asynchronous IO have any performance / resource benefits?
异步IO有性能和资源上的好处吗?
在某种程度上是的。正如你所指出的,异步代码通常比同步代码更慢。在设置回调结构等方面会增加更多的开销。
然而,异步代码更可扩展,因为它不会不必要地阻塞线程。对运行真实世界代码的Web服务器进行的实验表明,从同步代码切换到异步代码时,可扩展性显著增加。
总之,异步代码不是关于性能,而是关于可扩展性。
Why do we actually do reactive programming?
我们实际上为什么进行响应式编程?
响应式编程相当不同。异步代码仍然是拉取式的;也就是说,您的应用程序请求某个I/O操作,然后一段时间后该操作完成。而响应式代码是推送式的;一个更自然的例子可能是类似监听套接字或WebSocket连接的东西,可以随时推送命令。
在响应式代码中,代码定义了它如何对传入事件作出响应。代码的结构更具声明性,而不是命令式的。响应式框架有一种声明如何对事件作出反应的方式,可以“订阅”这些事件,然后在完成后“取消订阅”这些事件。
可以将异步代码结构化为响应式代码(I/O请求是“订阅”,只有一个事件,即该请求的完成,然后取消订阅)。但这通常不会用于所有异步代码;只有在已经有大量使用声明性响应式模式样式的代码,并且该代码希望以异步代码的方式继续保持相同的样式时,才会正常进行。任何异步代码都可以以响应式样式编写,但出于复杂性和可维护性的原因,通常不会这样做。响应式代码往往更难理解和维护。
英文:
> Does asynchronous IO have any performance / resource benefits?
In a way. As you noted, asynchronous code can often be slower than synchronous code. There's more overhead in terms of setting up the callback structures and whatnot.
However, asynchronous code is more scalable, precisely because it doesn't block threads unnecessarily. Experiments on web servers running real-world-ish code showed a significant increase in scalability when switching from synchronous to asynchronous code.
In summary, asynchronous code isn't about performance, but about scalability.
> Why do we actually do reactive programming?
Reactive programming is quite different. Asynchronous code is still pull-based; i.e., your app requests some I/O operation, and some time later that operation completes. Reactive code is push-based; a more natural example would be something like a listening socket or a WebSocket connection that can push commands at any time.
With reactive code, the code defines how it reacts to incoming events. The structure of the code is more declarative rather than imperative. Reactive frameworks have a way to declare how to react to events, "subscribe" to those events, and then "unsubscribe" from the events when done.
It's possible to structure asynchronous code as reactive (the I/O request is the "subscription", there is only one event which is the completion of that request, followed by an unsubscription). But this is not normally done for all asynchronous code; it's only normal if there's already a significant amount of code using the declarative reactive pattern style and that code wants to do asynchronous code while keeping the same style.
Any asynchronous code can be written in a reactive style, but that's not normally done for complexity/maintainability reasons. Reactive code tends to be more difficult to understand and maintain.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论