2023年8月9日 18:08:12go评论211阅读模式

英文:

Spark driver stopped unexpectedly (Databricks)

问题

我在Azure Databricks中有一个Python笔记本，其中包含一个包含137次迭代的for循环。对于每次迭代，它使用dbutils.notebook.run调用另一个Scala笔记本。Scala笔记本通过对MongoDB数据库的查询创建一个DataFrame。我在Scala笔记本中使用df.createOrReplaceGlobalTempView("<<view_name>>")创建一个全局临时视图，因为我需要从Python笔记本中恢复这些数据并进行处理。从Python笔记本中读取的代码如下所示：

global_temporary_database = spark.conf.get("spark.sql.globalTempDatabase")
for _ in range(137):
    dbutils.notebook.run(path="<<path_to_scala_notebook>>", timeout_seconds=600, arguments=<<current_configuration>>)
    # 恢复数据并删除全局临时视图
    df = table(f"{global_temporary_database}.<<view_name>>")
    spark.catalog.dropGlobalTempView("<<view_name>>")
    # 进行一些处理，如过滤行和重命名列

这对于有限次数的迭代是有效的。然而，当我尝试运行整个循环时，我会遇到以下错误：

The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached

我尝试使用time.sleep()在迭代之间添加一些延迟，以避免过载集群。我还尝试在每次迭代后使用spark.catalog.clearCache()，但它不起作用。

以下是集群规格：

Databricks Runtime版本：9.1 LTS（包括Apache Spark 3.1.2，Scala 2.12）
2个工作节点：61 GB内存，8个核心
1个驱动节点：16 GB内存，4个核心

不幸的是，我需要使用2个笔记本，因为我正在使用Scala库对DataFrame应用一些操作，然后我需要在笔记本之间共享这些数据，所以无法避免这一部分。

任何帮助将不胜感激。

英文:

I have a Python notebook in Azure Databricks which performs a for loop with 137 iterations. For each iteration, it calls another Scala notebook using dbutils.notebook.run. The Scala notebook creates a DataFrame from a query to a MongoDB database. I create a global temporary view in the Scala notebook using df.createOrReplaceGlobalTempView("<<view_name>>") because I need to recover this data from the Python notebook and keep processing it. The code to read from the Python notebook looks like this:

global_temporary_database = spark.conf.get(&quot;spark.sql.globalTempDatabase&quot;)
for _ in range(137):
    dbutils.notebook.run(path=&quot;&lt;&lt;path_to_scala_notebook&gt;&gt;&quot;, timeout_seconds=600, arguments=&lt;&lt;current_configuration&gt;&gt;)
    # Recover the data and drop the global temporary view
    df = table(f&quot;{global_temporary_database}.&lt;&lt;view_name&gt;&gt;&quot;)
    spark.catalog.dropGlobalTempView(&quot;&lt;&lt;view_name&gt;&gt;&quot;)
    # Do some processing like filtering rows and rename columns

This works for a limited number of iterations. However, when I try to run the whole loop I get the following error:

The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached

I've tried to use time.sleep() to add some delay between iterations to avoid overloading the cluster. I've also tried to use spark.catalog.clearCache() after each iteration but it doesn't work.

Below are the cluster specs:

Databricks Runtime Version: 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12)
2 Workers: 61 GB Memory, 8 Cores
1 Driver: 16 GB Memory, 4 Cores

Unfortunately, I need to use 2 notebooks because I'm using a Scala library to apply some operations to the DataFrames, and then I need to share this data between notebooks, so there's no way to avoid that part.

Any help would be appreciated.

答案1

得分: 1

我在复制问题时在我的环境中遇到了相同的错误：

Spark驱动程序意外停止（Databricks）

我从代码中删除了 spark.stop() 并重新运行代码。它成功运行而没有任何错误：

Spark驱动程序意外停止（Databricks）

英文:

I got the same error in my environment while replicating the issue:

Spark驱动程序意外停止（Databricks）

I removed spark.stop() from the code and run the code it again. It run successfully without any error:

Spark驱动程序意外停止（Databricks）

For more information you can check below page:

https://kb.databricks.com/jobs/driver-unavailable#:~:text=One%20common%20cause%20for%20this,to%20frequent%20full%20garbage%20collection.

答案2

得分: 0

我通过只调用一次我的Scala笔记本，并将所有的循环逻辑移到Scala笔记本中来解决了这个问题，这有点繁琐。但我猜调用笔记本137次并产生大量开销并不是一个好主意。

执行时间显著提高。现在只需要大约20分钟，而以前需要1小时才能最终停止驱动程序并完成。

英文:

I resolved this by calling my Scala notebook only once and moving all the loop logic into the Scala notebook, which was a bit tedious. But I guess it's not a good idea to call a notebook 137 times as it produces a lot of overhead.

The execution time has improved significantly. Now it takes around 20 minutes, whereas previously it took 1 hour to eventually stop the driver and not finish.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Spark驱动程序意外停止（Databricks）

问题

答案1

答案2

Spark Structured Streaming 中的多重聚合

Communication link failure. Failed to connect to server. Reason: HTTP Response code: 403

如何使用正则表达式解决这个Pyspark代码块

最快的方式找到在最后 ‘x’ 分钟内修改的文件。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论