2023年3月31日 22:13:03go评论80阅读模式

英文:

What is the purpose of Apache Spark job, task and stage?

问题

我正在学习Apache Spark，并想了解Spark历史记录中每个任务的含义。

例如，我编写的应用程序创建了17个作业，其中作业0运行了10分钟，有2384个小任务，我想了解这2384个任务的含义，这可能吗？

我找到了一个有关作业中DAG的图片，想了解DAG和任务之间的关系，这可能吗（具体来自附加文件中的DAG和2384个任务）？

链接：https://i.stack.imgur.com/Azva4.png

英文:

I am learning about Apache Spark and want to know the meaning of each Task created on the Jobs recorded on Spark history.

For example, the application I write creates 17 jobs, in which job 0 runs for 10 minutes, there are 2384 small tasks and I want to learn about the meaning of these 2384, is it possible?

I found a picture of DAG in the Jobs and want to know the relationship between DAG and Task, is it possible (Specifically from the attached file DAG and 2384 tasks below)?

https://i.stack.imgur.com/Azva4.png

答案1

得分: 2

首先，我们需要理解 Spark 中的一些概念，一个应用程序是几个作业的组合。每个作业是一系列彼此依赖的阶段，这些阶段由任务组成。任务是 Spark 中最小的工作单元，它们负责在数据分区上执行实际的计算。

要理解每个任务的含义，您需要了解您在 Spark 应用程序中执行的操作。以下是应用程序、作业、阶段和任务之间关系的分解：

应用程序：您编写的顶级 Spark 程序。
作业：根据 count()、save() 或 collect() 等操作，将您的应用程序分成多个作业。每个作业对应一个操作。
阶段：作业根据可以并行执行的操作被划分为阶段。阶段是由诸如 map()、filter() 或 reduce() 等变换创建的。
任务：每个阶段进一步划分为可以在不同数据分区上并行执行的任务。

要理解您所指的应用程序中作业 0 中的 2384 个任务的含义，您应该分析在该特定作业中应用的变换和操作。这将帮助您了解每个任务正在执行的操作。

现在，有向无环图（DAG）是您的 Spark 应用程序中操作（变换和操作）序列的图形表示。换句话说，它是整个计算过程的表示。DAG 用于通过将它们划分为阶段来计划任务的执行，然后将这些阶段分配给群集中的不同工作节点。DAG 中的每个节点表示一个 RDD 操作，节点之间的边表示这些操作之间的依赖关系。

DAG 与任务之间的关系如下：

DAG 表示了您的 Spark 应用程序中操作的整体流程，从输入数据到最终结果。
任务是对数据分区的特定操作对应的单个工作单位。
Spark 中的 DAG 调度器根据任务之间的依赖关系将整个计算划分为阶段。
这些阶段进一步划分为在不同数据分区上并行执行的任务。

总的来说，任务是您的 Spark 应用程序中最小的工作单位，而 DAG 表示了这些任务之间操作和依赖关系的序列。了解应用程序中的操作以及相应的任务将有助于您理解作业中每个任务的含义。

您可以查看以下书籍，它是深入了解 Spark 概念的绝佳资源：Learning Spark, 2nd Edition，作者 Jules S. Damji, Brooke Wenig
链接：https://www.oreilly.com/library/view/learning-spark-2nd/9781492050032/

英文:

First of all, we need to understand a couple of concepts in Spark, an application is a combination of several jobs. Each job is a sequence of stages that depend on each other, and these stages are formed by tasks. Tasks are the smallest unit of work in Spark, and they are responsible for performing the actual computation on data partitions.

To understand the meaning of each task, you need to understand the operations you have performed in your Spark application. Here's a breakdown of the relationship between your application, jobs, stages, and tasks:

Application: The top-level Spark program you have written.
Jobs: Your application is divided into multiple jobs based on actions like count(), save(), or collect(). Each job corresponds to an action.
Stages: A job is divided into stages based on operations that can be executed in parallel. Stages are created due to transformations such as map(), filter(), or reduce().
Tasks: Each stage is further divided into tasks that can be executed in parallel on different data partitions.

To understand the meaning of the 2384 tasks in Job 0 on the application you are referring, you should analyze the transformations and actions applied in that specific job. This will help you figure out the operations each task is performing.

Now, the Directed Acyclic Graph (DAG) is a graph representation of the sequence of operations (transformations and actions) in your Spark application. In other words, it's a representation of the entire computation process. DAGs are used to plan the execution of tasks by dividing them into stages, which are then assigned to different worker nodes in the cluster. Each node in the DAG represents an RDD operation, and the edges between the nodes represent the dependencies between these operations.

The relationship between DAG and Task is as follows:

The DAG represents the overall flow of operations in your Spark application, from the input data to the final result.
Tasks are the individual units of work that correspond to a specific operation on a partition of data.
The DAG scheduler in Spark divides the entire computation into stages based on the dependencies between tasks.
These stages are further divided into tasks that are executed in parallel on different data partitions.

In summary, tasks are the smallest units of work in your Spark application, and the DAG represents the sequence of operations and dependencies between these tasks. Understanding the operations in your application and the corresponding tasks will help you grasp the meaning of each task in your jobs.

You can check the following book, is a great resource to understand Spark concepts in depth.: Learning Spark, 2nd Edition by Jules S. Damji, Brooke Wenig
https://www.oreilly.com/library/view/learning-spark-2nd/9781492050032/

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Apache Spark 作业、任务和阶段的目的是什么？

问题

答案1

如何将一列拆分为列表并将其保存到新的 .csv 文件中。

How does reduceByKey() in pyspark knows which column is key and which one is value?

删除基于另一个pyspark的值的列。

需要帮助编写PySpark Azure Databricks中的CTE递归，格式如下SQL。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。