Databricks:从pandas创建spark数据帧时出现问题

huangapple go评论97阅读模式
英文:

Databricks: Issue while creating spark data frame from pandas

问题

我有一个pandas数据框,我想将其转换为spark数据框。通常,我使用下面的代码来从pandas创建spark数据框,但突然间我开始遇到下面的错误,我知道pandas已经移除了iteritems(),但我的当前pandas版本是2.0.0,我也尝试安装较早的版本并尝试创建spark数据框,但仍然遇到相同的错误。错误发生在spark函数内部。这个问题的解决方案是什么?我应该安装哪个pandas版本以创建spark数据框?我还尝试更改Databricks集群的运行时并尝试重新运行,但仍然遇到相同的错误。

英文:

I have a pandas data frame which I want to convert into spark data frame. Usually, I use the below code to create spark data frame from pandas but all of sudden I started to get the below error, I am aware that pandas has removed iteritems() but my current pandas version is 2.0.0 and also I tried to install lesser version and tried to created spark df but I still get the same error. The error invokes inside the spark function. What is the solution for this? which pandas version should I install in order to create spark df. I also tried to change the runtime of cluster databricks and tried re running but I still get the same error.

import pandas as pd
spark.createDataFrame(pd.DataFrame({'i':[1,2,3],'j':[1,2,3]}))

error:-
UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
  'DataFrame' object has no attribute 'iteritems'
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warn(msg)
AttributeError: 'DataFrame' object has no attribute 'iteritems'

答案1

得分: 26

这与使用的 Databricks Runtime(DBR)版本有关 - 在 DBR 12.2 及以下的 Spark 版本中,依赖 .iteritems 函数来构建 Spark DataFrame 从 Pandas DataFrame。这个问题在 Spark 3.4 中得到修复,该版本作为 DBR 13.x 提供。

如果无法升级到 DBR 13.x,则需要将 Pandas 降级到最新的 1.x 版本(目前是 1.5.3),使用 %pip install -U pandas==1.5.3 命令在您的笔记本中。尽管最好使用随 DBR 一起提供的 Pandas 版本 - 它经过测试,确保与 DBR 中的其他软件包兼容。

英文:

It's related to the Databricks Runtime (DBR) version used - the Spark versions in up to DBR 12.2 rely on .iteritems function to construct a Spark DataFrame from Pandas DataFrame. This issue was fixed in the Spark 3.4 that is available as DBR 13.x.

If you can't upgrade to DBR 13.x, then you need to downgrade the Pandas to latest 1.x version (1.5.3 right now) by using %pip install -U pandas==1.5.3 command in your notebook. Although it's just better to use Pandas version shipped with your DBR - it was tested for compatibility with other packages in DBR.

答案2

得分: 13

我不能更改包版本,但看起来这只是一个名称更改。

所以我做了

df.iteritems = df.items

然后现在spark.createDataFrame(df)可用。

当然,这很丑陋,当我转移到一个新的DBR集群时,它会破坏我的笔记本,但现在可以使用。

英文:

I couldn't change package versions, but it looks like this was a name change only.

So I did

df.iteritems = df.items

and spark.createDataFrame(df) works now.

Sure, it's ugly, and it will break my notebook when I move to a cluster with a new DBR, but it works for now.

答案3

得分: 2

Arrow 优化失败,因为缺少 'iteritems' 属性。
您应该尝试在您的 Spark 会话中禁用 Arrow 优化,并创建没有 Arrow 优化的 DataFrame。

这是如何工作的:

import pandas as pd
from pyspark.sql import SparkSession

# 创建一个 Spark 会话
spark = SparkSession.builder \
    .appName("Pandas to Spark DataFrame") \
    .getOrCreate()

# 禁用 Arrow 优化
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")

# 创建一个 pandas DataFrame
pdf = pd.DataFrame({'i': [1, 2, 3], 'j': [1, 2, 3]})

# 将 pandas DataFrame 转换为 Spark DataFrame
sdf = spark.createDataFrame(pdf)

# 显示 Spark DataFrame
sdf.show()

应该可以运行,但如果你愿意,你可以降级你的 pandas 版本以进行 Arrow 优化,就像这样 pip install pandas==1.2.5

英文:

The Arrow optimization is failing because of the missing 'iteritems' attribut.
You should try disabling the Arrow optimization in your Spark session and create the DataFrame without Arrow optimization.

Here is how it would work:

import pandas as pd
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder \
    .appName("Pandas to Spark DataFrame") \
    .getOrCreate()

# Disable Arrow optimization
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")

# Create a pandas DataFrame
pdf = pd.DataFrame({'i': [1, 2, 3], 'j': [1, 2, 3]})

# Convert pandas DataFrame to Spark DataFrame
sdf = spark.createDataFrame(pdf)

# Show the Spark DataFrame
sdf.show()

It should work but also if you want you can downgrade your pandas version for the Arrow optimisation like that pip install pandas==1.2.5

答案4

得分: 2

这个问题是由于 pandas 版本 <= 2.0 导致的。在 Pandas 2.0 中,.iteritems 函数已被移除。

有两个解决方案:

  1. 降级 pandas 版本到 < 2。例如,

pip install -U pandas==1.5.3

  1. 使用最新的 Spark 版本,即 3.4
英文:

This issue is occurred due to pandas version <= 2.0. In Pandas 2.0, .iteritems function is removed.

There are two solutions for this issue.

  1. Down grade the pandas version < 2. For example,

pip install -U pandas==1.5.3

  1. Use the latest Spark version i.e 3.4

答案5

得分: 1

如果你想保留你当前的 pandas 版本,尝试这样做:

import pandas as pd
pd.DataFrame.iteritems = pd.DataFrame.items
英文:

if you want to keep version that you have of pandas try this :

import pandas as pd
pd.DataFrame.iteritems = pd.DataFrame.items

huangapple
  • 本文由 发表于 2023年4月4日 15:32:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/75926636.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定