Project loom: 使用虚拟线程时性能更好的原因是什么?

huangapple go评论65阅读模式
英文:

Project loom: what makes the performance better when using virtual threads?

问题

以下是您要翻译的内容:

"为了提供一些背景信息,我已经关注Project Loom一段时间了。我已经阅读了《Loom的状态》。我已经进行了异步编程。

由Java NIO提供的异步编程在任务等待时将线程返回到线程池,并且在不阻塞线程的情况下采取了大量措施。这带来了很大的性能提升,因为现在我们可以处理更多的请求,它们不再直接受制于操作系统线程的数量。但在这里我们失去了上下文。一旦我们将任务与线程分离,就失去了所有上下文。异常跟踪不提供非常有用的信息,调试很困难。

然后Project Loom引入了虚拟线程,它们成为并发的单一单位。现在您可以在单个虚拟线程上执行单个任务。

到目前为止都很好,但文章接着陈述说,使用Project Loom:

>一个简单的同步Web服务器将能够处理更多的请求,而无需更多的硬件。

我不明白Project Loom如何在异步API上提供性能优势?异步API确保不会使任何线程空闲。那么,Project Loom是如何使其更高效和性能更好的,超越了异步API呢?"

编辑

让我重新表达这个问题。假设我们有一个接收请求并与持久性后端数据库进行一些CRUD操作的HTTP服务器。比方说,这个HTTP服务器处理了大量请求 - 100K RPM。有两种实现方式:

  1. HTTP服务器具有专用的线程池。当请求到达时,线程负责执行任务,直到它达到数据库,然后任务必须等待来自数据库的响应。此时,线程将返回到线程池,继续执行其他任务。当数据库响应时,它再次由线程池中的某个线程处理,并返回HTTP响应。
  2. HTTP服务器仅为每个请求生成虚拟线程。如果有IO操作,虚拟线程只需等待任务完成,然后返回HTTP响应。基本上,虚拟线程不需要进行线程池管理。

鉴于硬件和吞吐量保持不变,哪种解决方案在响应时间或处理更多吞吐量方面表现更好?

我猜想在性能方面两者之间不会有任何区别。"

英文:

To give some context here, I have been following Project Loom for some time now. I have read The state of Loom. I have done asynchronous programming.

Asynchronous programming (provided by Java NIO) returns the thread to the thread pool when the task waits and it goes to great lengths to not block threads. And this gives a large performance gain, we can now handle many more request as they are not directly bound by the number of OS threads. But what we lose here, is the context. The same task is now NOT associated with just one thread. All the context is lost once we dissociate tasks from threads. Exception traces do not provide very useful information and debugging is difficult.

In comes Project Loom with virtual threads that become the single unit of concurrency. And now you can perform a single task on a single virtual thread.

It's all fine until now, but the article goes on to state, with Project Loom:

> A simple, synchronous web server will be able to handle many more requests without requiring more hardware.

I don't understand how we get performance benefits with Project Loom over asynchronous APIs? The asynchrounous API:s make sure to not keep any thread idle. So, what does Project Loom do to make it more efficient and performant that asynchronous API:s?

EDIT

Let me re-phrase the question. Let's say we have an http server that takes in requests and does some crud operations with a backing persistent database. Say, this http server handles a lot of requests - 100K RPM. Two ways of implementing this:

  1. The HTTP server has a dedicated pool of threads. When a request comes in, a thread carries the task up until it reaches the DB, wherein the task has to wait for the response from DB. At this point, the thread is returned to the thread pool and goes on to do the other tasks. When DB responds, it is again handled by some thread from the thread pool and it returns an HTTP response.
  2. The HTTP server just spawns virtual threads for every request. If there is an IO, the virtual thread just waits for the task to complete. And then returns the HTTP Response. Basically, there is no pooling business going on for the virtual threads.

Given that the hardware and the throughput remain the same, would any one solution fare better than the other in terms of response times or handling more throughput?

My guess is that there would not be any difference w.r.t performance.

答案1

得分: 16

我们在异步API上没有获得好处。我们可能会获得类似于异步的性能,但使用同步代码。

英文:

We don't get benefit over asynchronous API. What we potentially will get is performance similar to asynchronous, but with synchronous code.

答案2

得分: 9

以下是您要翻译的内容:

"The answer by @talex puts it crisply. Adding further to it.

Loom is more about a native concurrency abstraction, which additionally helps one write asynchronous code. Given its a VM level abstraction, rather than just code level (like what we have been doing till now with CompletableFuture etc), It lets one implement asynchronous behavior but with reduced boilerplate.

With Loom, a more powerful abstraction is the savior. We have seen this repeatedly on how abstraction with syntactic sugar, makes one effectively write programs. Whether it was FunctionalInterfaces in JDK8, for-comprehensions in Scala.

With loom, there isn't a need to chain multiple CompletableFuture's (to save on resources). But one can write the code synchronously. And with each blocking operation encountered (ReentrantLock, i/o, JDBC calls), the virtual-thread gets parked. And because these are lightweight threads, the context switch is way-cheaper, distinguishing itself from kernel-threads.

When blocked, the actual carrier-thread (that was running the run-body of the virtual thread), gets engaged for executing some other virtual-thread's run. So effectively, the carrier-thread is not sitting idle but executing some other work. And comes back to continue the execution of the original virtual-thread whenever unparked. Just like how a thread-pool would work. But here, you have a single carrier-thread in a way executing the body of multiple virtual-threads, switching from one to another when blocked.

We get the same behavior (and hence performance) as manually written asynchronous code, but instead avoiding the boiler-plate to do the same thing.


Consider the case of a web-framework, where there is a separate thread-pool to handle i/o and the other for execution of http requests. For simple HTTP requests, one might serve the request from the http-pool thread itself. But if there are any blocking (or) high CPU operations, we let this activity happen on a separate thread asynchronously.

This thread would collect the information from an incoming request, spawn a CompletableFuture, and chain it with a pipeline (read from database as one stage, followed by computation from it, followed by another stage to write back to database case, web service calls etc). Each one is a stage, and the resultant CompletableFuture is returned back to the web-framework.

When the resultant future is complete, the web-framework uses the results to be relayed back to the client. This is how Play-Framework and others, have been dealing with it. Providing an isolation between the http thread handling pool, and the execution of each request. But if we dive deeper in this, why is it that we do this?

One core reason is to use the resources effectively. Particularly blocking calls. And hence we chain with thenApply etc so that no thread is blocked on any activity, and we do more with less number of threads.

This works great, but quite verbose. And debugging is indeed painful, and if one of the intermediary stages results with an exception, the control-flow goes hay-wire, resulting in further code to handle it.

With Loom, we write synchronous code, and let someone else decide what to do when blocked.

英文:

The answer by @talex puts it crisply. Adding further to it.

Loom is more about a native concurrency abstraction, which additionally helps one write asynchronous code. Given its a VM level abstraction, rather than just code level (like what we have been doing till now with CompletableFuture etc), It lets one implement asynchronous behavior but with reduce boiler plate.

With Loom, a more powerful abstraction is the savior. We have seen this repeatedly on how abstraction with syntactic sugar, makes one effectively write programs. Whether it was FunctionalInterfaces in JDK8, for-comprehensions in Scala.

With loom, there isn't a need to chain multiple CompletableFuture's (to save on resources). But one can write the code synchronously. And with each blocking operation encountered (ReentrantLock, i/o, JDBC calls), the virtual-thread gets parked. And because these are light-weight threads, the context switch is way-cheaper, distinguishing itself from kernel-threads.

When blocked, the actual carrier-thread (that was running the run-body of the virtual thread), gets engaged for executing some other virtual-thread's run. So effectively, the carrier-thread is not sitting idle but executing some other work. And comes back to continue the execution of the original virtual-thread whenever unparked. Just like how a thread-pool would work. But here, you have a single carrier-thread in a way executing the body of multiple virtual-threads, switching from one to another when blocked.

We get the same behavior (and hence performance) as manually written asynchronous code, but instead avoiding the boiler-plate to do the same thing.


Consider the case of a web-framework, where there is a separate thread-pool to handle i/o and the other for execution of http requests. For simple HTTP requests, one might serve the request from the http-pool thread itself. But if there are any blocking (or) high CPU operations, we let this activity happen on a separate thread asynchronously.

This thread would collect the information from an incoming request, spawn a CompletableFuture, and chain it with a pipeline (read from database as one stage, followed by computation from it, followed by another stage to write back to database case, web service calls etc). Each one is a stage, and the resultant CompletablFuture is returned back to the web-framework.

When the resultant future is complete, the web-framework uses the results to be relayed back to the client. This is how Play-Framework and others, have been dealing with it. Providing an isolation between the http thread handling pool, and the execution of each request. But if we dive deeper in this, why is it that we do this?

One core reason is to use the resources effectively. Particularly blocking calls. And hence we chain with thenApply etc so that no thread is blocked on any activity, and we do more with less number of threads.

This works great, but quite verbose. And debugging is indeed painful, and if one of the intermediary stages results with an exception, the control-flow goes hay-wire, resulting in further code to handle it.

With Loom, we write synchronous code, and let someone else decide what to do when blocked. Rather than sleep and do nothing.

答案3

得分: 4

  1. HTTP服务器有一个专用线程池...
    线程池有多大?(CPU数量)* N + C?如果N > 1,可以回退到反扩展,因为锁争用会增加延迟;而N = 1可能会未充分利用可用带宽。这里有一份良好的分析链接

  2. HTTP服务器只是生成...
    这将是这个概念的一个非常天真的实现。更现实的实现将努力从一个动态池中收集数据,该池为每个阻塞的系统调用保留一个真实线程 + 每个真实CPU一个。至少这是Go背后的人们想出的方法。

关键是让{处理程序、回调、完成、虚拟线程、goroutine:都是PEA中的一部分}不要争夺内部资源;因此,它们不会在绝对必要之前依赖于基于系统的阻塞机制。这属于锁避免的范畴,可以通过各种排队策略(参见libdispatch)等方式来实现。请注意,这将PEA与底层系统线程分开,因为它们在内部进行了多路复用。这是您关于分离概念的担忧。在实践中,您会传递您喜欢的语言对上下文指针的抽象。

正如1所示,这种方法可以直接与具体结果相关联;还有一些难以捉摸的东西。加锁很容易——你只需在你的事务周围做一个大锁,然后你就可以运行了。这不会扩展;但是细粒度锁定很难。很难使其工作,很难选择颗粒度。在什么时候使用{锁、CVs、信号量、屏障,...}在教科书示例中是显而易见的;在深度嵌套的逻辑中则稍微不那么明显。锁避免使这在很大程度上消失,并且局限于争用的叶组件,例如malloc()。

我保持一些怀疑,因为研究通常显示出一个扩展不良的系统,然后将其转变为锁避免模型,然后显示出更好的结果。我还没有看到有人能够让一些经验丰富的开发人员来分析系统的同步行为,将其转化为可扩展性,然后测量结果。但是,即使那是一个胜利,经验丰富的开发人员也是一种稀缺(有点)和昂贵的商品;可扩展性的核心实际上是金融。

英文:
  1. The http server has a dedicated pool of threads ....
    How big of a pool? (Number of CPUs)*N + C? N>1 one can fall back to anti-scaling, as lock contention extends latency; where as N=1 can under-utilize available bandwidth. There is a good analysis here.

  2. The http server just spawns...
    That would be a very naive implementation of this concept. A more realistic one would strive for collecting from a dynamic pool which kept one real thread for every blocked system call + one for every real CPU. At least that is what the folks behind Go came up with.

The crux is to keep the {handlers, callbacks, completions, virtual threads, goroutines : all PEAs in a pod} from fighting over internal resources; thus they do not lean on system based blocking mechanisms until absolutely necessary This falls under the banner of lock avoidance, and might be accomplished with various queuing strategies (see libdispatch), etc.. Note that this leaves the PEA divorced from the underlying system thread, because they are internally multiplexed between them. This is your concern about divorcing the concepts. In practice, you pass around your favourite languages abstraction of a context pointer.

As 1 indicates, there are tangible results that can be directly linked to this approach; and a few intangibles. Locking is easy -- you just make one big lock around your transactions and you are good to go. That doesn't scale; but fine-grained locking is hard. Hard to get working, hard to choose the fineness of the grain. When to use { locks, CVs, semaphores, barriers, ... } are obvious in textbook examples; a little less so in deeply nested logic. Lock avoidance makes that, for the most part, go away, and be limited to contended leaf components like malloc().

I maintain some skepticism, as the research typically shows a poorly scaled system, which is transformed into a lock avoidance model, then shown to be better. I have yet to see one which unleashes some experienced developers to analyze the synchronization behavior of the system, transform it for scalability, then measure the result. But, even if that were a win experienced developers are a rare(ish) and expensive commodity; the heart of scalability is really financial.

huangapple
  • 本文由 发表于 2020年8月12日 14:02:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/63370669.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定