2023年4月4日 15:32:51go评论121阅读模式

英文:

Databricks: Issue while creating spark data frame from pandas

问题

我有一个pandas数据框，我想将其转换为spark数据框。通常，我使用下面的代码来从pandas创建spark数据框，但突然间我开始遇到下面的错误，我知道pandas已经移除了iteritems()，但我的当前pandas版本是2.0.0，我也尝试安装较早的版本并尝试创建spark数据框，但仍然遇到相同的错误。错误发生在spark函数内部。这个问题的解决方案是什么？我应该安装哪个pandas版本以创建spark数据框？我还尝试更改Databricks集群的运行时并尝试重新运行，但仍然遇到相同的错误。

英文:

I have a pandas data frame which I want to convert into spark data frame. Usually, I use the below code to create spark data frame from pandas but all of sudden I started to get the below error, I am aware that pandas has removed iteritems() but my current pandas version is 2.0.0 and also I tried to install lesser version and tried to created spark df but I still get the same error. The error invokes inside the spark function. What is the solution for this? which pandas version should I install in order to create spark df. I also tried to change the runtime of cluster databricks and tried re running but I still get the same error.

import pandas as pd
spark.createDataFrame(pd.DataFrame({&#39;i&#39;:[1,2,3],&#39;j&#39;:[1,2,3]}))
error:-
UserWarning: createDataFrame attempted Arrow optimization because &#39;spark.sql.execution.arrow.pyspark.enabled&#39; is set to true; however, failed by the reason below:
  &#39;DataFrame&#39; object has no attribute &#39;iteritems&#39;
Attempting non-optimization as &#39;spark.sql.execution.arrow.pyspark.fallback.enabled&#39; is set to true.
  warn(msg)
AttributeError: &#39;DataFrame&#39; object has no attribute &#39;iteritems&#39;

答案1

得分: 26

这与使用的 Databricks Runtime（DBR）版本有关 - 在 DBR 12.2 及以下的 Spark 版本中，依赖 .iteritems 函数来构建 Spark DataFrame 从 Pandas DataFrame。这个问题在 Spark 3.4 中得到修复，该版本作为 DBR 13.x 提供。

如果无法升级到 DBR 13.x，则需要将 Pandas 降级到最新的 1.x 版本（目前是 1.5.3），使用 %pip install -U pandas==1.5.3 命令在您的笔记本中。尽管最好使用随 DBR 一起提供的 Pandas 版本 - 它经过测试，确保与 DBR 中的其他软件包兼容。

英文:

It's related to the Databricks Runtime (DBR) version used - the Spark versions in up to DBR 12.2 rely on .iteritems function to construct a Spark DataFrame from Pandas DataFrame. This issue was fixed in the Spark 3.4 that is available as DBR 13.x.

If you can't upgrade to DBR 13.x, then you need to downgrade the Pandas to latest 1.x version (1.5.3 right now) by using %pip install -U pandas==1.5.3 command in your notebook. Although it's just better to use Pandas version shipped with your DBR - it was tested for compatibility with other packages in DBR.

答案2

得分: 13

我不能更改包版本，但看起来这只是一个名称更改。

所以我做了

df.iteritems = df.items

然后现在spark.createDataFrame(df)可用。

当然，这很丑陋，当我转移到一个新的DBR集群时，它会破坏我的笔记本，但现在可以使用。

英文:

I couldn't change package versions, but it looks like this was a name change only.

So I did

df.iteritems = df.items

and spark.createDataFrame(df) works now.

Sure, it's ugly, and it will break my notebook when I move to a cluster with a new DBR, but it works for now.

答案3

得分: 2

Arrow 优化失败，因为缺少 'iteritems' 属性。
您应该尝试在您的 Spark 会话中禁用 Arrow 优化，并创建没有 Arrow 优化的 DataFrame。

这是如何工作的：

import pandas as pd
from pyspark.sql import SparkSession
# 创建一个 Spark 会话
spark = SparkSession.builder \
    .appName("Pandas to Spark DataFrame") \
    .getOrCreate()
# 禁用 Arrow 优化
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")
# 创建一个 pandas DataFrame
pdf = pd.DataFrame({'i': [1, 2, 3], 'j': [1, 2, 3]})
# 将 pandas DataFrame 转换为 Spark DataFrame
sdf = spark.createDataFrame(pdf)
# 显示 Spark DataFrame
sdf.show()

应该可以运行，但如果你愿意，你可以降级你的 pandas 版本以进行 Arrow 优化，就像这样 pip install pandas==1.2.5

英文:

The Arrow optimization is failing because of the missing 'iteritems' attribut.
You should try disabling the Arrow optimization in your Spark session and create the DataFrame without Arrow optimization.

Here is how it would work:

import pandas as pd
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder \
    .appName(&quot;Pandas to Spark DataFrame&quot;) \
    .getOrCreate()
# Disable Arrow optimization
spark.conf.set(&quot;spark.sql.execution.arrow.pyspark.enabled&quot;, &quot;false&quot;)
# Create a pandas DataFrame
pdf = pd.DataFrame({&#39;i&#39;: [1, 2, 3], &#39;j&#39;: [1, 2, 3]})
# Convert pandas DataFrame to Spark DataFrame
sdf = spark.createDataFrame(pdf)
# Show the Spark DataFrame
sdf.show()

It should work but also if you want you can downgrade your pandas version for the Arrow optimisation like that pip install pandas==1.2.5

答案4

得分: 2

这个问题是由于 pandas 版本 <= 2.0 导致的。在 Pandas 2.0 中，.iteritems 函数已被移除。

有两个解决方案：

降级 pandas 版本到 < 2。例如，

pip install -U pandas==1.5.3

使用最新的 Spark 版本，即 3.4

英文:

This issue is occurred due to pandas version <= 2.0. In Pandas 2.0, .iteritems function is removed.

There are two solutions for this issue.

Down grade the pandas version < 2. For example,

pip install -U pandas==1.5.3

Use the latest Spark version i.e 3.4

答案5

得分: 1

如果你想保留你当前的 pandas 版本，尝试这样做：

import pandas as pd
pd.DataFrame.iteritems = pd.DataFrame.items

英文:

if you want to keep version that you have of pandas try this :

import pandas as pd
pd.DataFrame.iteritems = pd.DataFrame.items

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Databricks：从pandas创建spark数据帧时出现问题

问题

答案1

答案2

答案3

答案4

答案5

如何将文件中的所有项设置为字典中的内容

在Python和PowerShell中捕获正确的错误代码。

Using multiple sets of credentials in to_parquet when transfering to s3 using pandas

在字典定义中结合理解和键-值列表是否可能？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。