英文:
How many concurrent tasks in one executor and how Spark handles multithreading among tasks in one executor?
问题
在Spark中,同时执行多少个任务?有关讨论可以在以下链接中找到:
https://stackoverflow.com/questions/37528047/how-are-stages-split-into-tasks-in-spark
https://stackoverflow.com/questions/25836316/how-dag-works-under-the-covers-in-rdd
但我在那里没有找到明确的结论。
考虑以下情况(假设 spark.task.cpus = 1
,为简单起见忽略 vcore
概念):
- 10个执行者(每个执行者2个核心),10个分区 => 我认为同时执行的任务数是 10
- 10个执行者(每个执行者2个核心),2个分区 => 我认为同时执行的任务数是 2
- 10个执行者(每个执行者2个核心),20个分区 => 我认为同时执行的任务数是 20
- 10个执行者(每个执行者1个核心),20个分区 => 我认为同时执行的任务数是 10
我理解正确吗?对于第3种情况,考虑到多线程(即因为有2个核心而有2个线程)在一个执行者内,是否会是 20 ?
更新1
如果第3种情况是正确的,这意味着:
- 当执行者内有空闲核心时,Spark 可能会自动决定在该执行者内触发多线程
- 当执行者内只有一个核心时,不会发生多线程
如果这是真的,Spark 在执行者内的行为是否有点 不确定(单线程与多线程)?
请注意,从驱动程序传输到执行者的代码可能没有考虑到使用例如 synchronized 关键字来处理原子性问题。
Spark 如何处理这个问题?
英文:
In Spark, how many tasks are executed in parallel at a time? Discussions are found in
https://stackoverflow.com/questions/37528047/how-are-stages-split-into-tasks-in-spark and
https://stackoverflow.com/questions/25836316/how-dag-works-under-the-covers-in-rdd
But I do not find clear conclusion there.
Consider the following scenarios (assume spark.task.cpus = 1
, and ignore vcore
concept for simplicity):
- 10 executors (2 cores/executor), 10 partitions => I think the number of concurrent tasks at a time is 10
- 10 executors (2 cores/executor), 2 partitions => I think the number of concurrent tasks at a time is 2
- 10 executors (2 cores/executor), 20 partitions => I think the number of concurrent tasks at a time is 20
- 10 executors (1 cores/executor), 20 partitions => I think the number of concurrent tasks at a time is 10
Am I correct? Regarding the 3rd case, will it be 20 considering multi-threading (i.e. 2 threads because there are 2 cores) inside one executor?
UPDATE1
If the 3rd case is correct, it means:
- when idle cores inside an executor are available, Spark could automatically decide to trigger multithreads in that executor
- when there is only one core in the executor, multithread won't happen in that executor.
If this is true, isn't the behavior of Spark in an executor a bit uncertain (single thread v.s. multithread)?
Note that the code that is shipped from driver to the executors may not have considered automicity problem using e.g. synchronized keyword.
How is this handled by Spark?
答案1
得分: 1
我认为你是对的,这取决于你的执行器数量和核心数,一个分区创建一个在一个核心上运行的任务。
英文:
I think you are right, this depend on your executor number and the cores, one partition create a task running on one core .
答案2
得分: 1
I think all the 4 cases are correct, and the 4th case makes sense in reality ("overbook" cores). We should normally consider a factor of 2 to 4 for the nb. of partitions, i.e. nb. of partitions equals to 2 to 4 times of nb. of total cpu cores in the cluster.
Regarding threading, 2 tasks in one executor running concurrently should not have issues regarding multi-threading, as each task is handling its own set of RDD
.
If spark.task.cpus = 2
is set, which means 2 cpu cores per task, then IMO there might be race condition problem (if there're var
), but usually we are handling immutable values like RDD
, so there should merely have issues either.
英文:
I think all the 4 cases are correct, and the 4th case makes sense in reality ("overbook" cores). We should normally consider a factor of 2 to 4 for the nb. of partitions, i.e. nb. of partitions equals to 2 to 4 times of nb. of total cpu cores in the cluster.
Regarding threading, 2 tasks in one executor running concurrently should not have issues regarding multi-threading, as each task is handling its own set of RDD
.
If spark.task.cpus = 2
is set, which means 2 cpu cores per task, then IMO there might be race condition problem (if there're var
), but usually we are handling immutable values like RDD
, so there should merely have issues either.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论