Join两个PySpark DataFrames,并在列名相似时从一个DataFrame获取一些列。

huangapple go评论65阅读模式
英文:

Join two PySpark DataFrames and get some of the columns from one DataFrame when column names are similar

问题

我想加入2个PySpark数据框。但是,我想要一个数据框的所有列,以及第二个数据框的一些列。关键是两个数据框中有一个列具有相似的名称。

示例数据框:

# 准备数据
data_1 = [
    (1, "Italy", "Europe"),
    (2, "Italy", "Europe"),
    (3, "Germany", None),
    (4, "Iran", "Asia"),
    (5, "China", "Asia"),
    (6, "China", None),
    (7, "Japan", "Asia"),
    (8, "France", None),
]

# 创建数据框
columns = ["Code", "Country", "Continent"]
df_1 = spark.createDataFrame(data=data_1, schema=columns)
df_1.show(truncate=False)

Join两个PySpark DataFrames,并在列名相似时从一个DataFrame获取一些列。

# 准备数据
data_2 = [
    (1, "Italy", "EUR", 11),
    (2, "Germany", "EUR", 12),
    (3, "China", "CNY", 13),
    (4, "Japan", "JPY", 14),
    (5, "France", "EUR", 15),
    (6, "Taiwan", "TWD", 16),
    (7, "USA", "USD", 17),
    (8, "India", "INR", 18),
]

# 创建数据框
columns = ["Code", "Country", "Currency", "Sales"]
df_2 = spark.createDataFrame(data=data_2, schema=columns)
df_2.show(truncate=False)

Join两个PySpark DataFrames,并在列名相似时从一个DataFrame获取一些列。

我想要第一个数据框的所有列以及第二个数据框的"Currency"列。

当我使用左连接:

output = df_1.join(df_2, ["Country"], "left")
output.show()

现在,在连接操作后有两个名称为"Code"的列。

Join两个PySpark DataFrames,并在列名相似时从一个DataFrame获取一些列。

使用删除列:

output = df_1.join(df_2, ["Country"], "left").drop('Code', 'Sales')
output.show()

两个名为"Code"的列都被删除了。但是,我想保留第一个数据框中的"Code"列。

有没有办法解决这个问题?

另一个问题是如何在连接操作后使"Code"列成为结果数据框中最左边的列。

如果你需要帮助,请告诉我。

英文:

I want to join 2 PySpark DataFrames. But, I want all columns from one DataFrame, and some of columns from the 2nd DataFrame. The point is that there is a column with similar name in both DataFrames.

Sample DataFrames:

# Prepare Data
data_1 = [
    (1, "Italy", "Europe"),
    (2, "Italy", "Europe"),
    (3, "Germany", None),
    (4, "Iran", "Asia"),
    (5, "China", "Asia"),
    (6, "China", None),
    (7, "Japan", "Asia"),
    (8, "France", None),
]

# Create DataFrame
columns = ["Code", "Country", "Continent"]
df_1 = spark.createDataFrame(data=data_1, schema=columns)
df_1.show(truncate=False)

Join两个PySpark DataFrames,并在列名相似时从一个DataFrame获取一些列。

# Prepare Data
data_2 = [
    (1, "Italy", "EUR", 11),
    (2, "Germany", "EUR", 12),
    (3, "China", "CNY", 13),
    (4, "Japan", "JPY", 14),
    (5, "France", "EUR", 15),
    (6, "Taiwan", "TWD", 16),
    (7, "USA", "USD", 17),
    (8, "India", "INR", 18),
]

# Create DataFrame
columns = ["Code", "Country", "Currency", "Sales"]
df_2 = spark.createDataFrame(data=data_2, schema=columns)
df_2.show(truncate=False)

Join两个PySpark DataFrames,并在列名相似时从一个DataFrame获取一些列。

I want all columns of the 1st DataFrame and only column "Currency" from the 2nd DataFrame.
When I use left join:

output = df_1.join(df_2, ["Country"], "left")
output.show()

Now, there are two columns with name "Code" after Join operation.

Join两个PySpark DataFrames,并在列名相似时从一个DataFrame获取一些列。

Using drop columns:

output = df_1.join(df_2, ["Country"], "left").drop('Code', 'Sales')
output.show()

Join两个PySpark DataFrames,并在列名相似时从一个DataFrame获取一些列。

Both columns named "Code" are dropped. But, I want to keep column "Code" from the 1st DataFrame.

Any idea how to solve this issue?

Another question is how to make column "Code" as the left-most column in the resulting DataFrame after Join operation.

答案1

得分: 1

如果你不需要来自df_2的列,你可以在加入之前将它们删除,就像这样:

output = df_1.join(
    df_2.select('Country', 'Currency'),
    ['Country'], 'left'
)

请注意,你也可以通过指定它们来自的数据框来消除具有相同名称的两列的歧义。例如,df_1['Code']。所以在你的情况下,在加入之后,而不是使用drop,你可以使用这种方法来保留只有来自df_1Currency列的列:

output = df_1\
    .join(df_2, ['Country'], 'left')\
    .select([df_1[c] for c in df_1.columns] + ['Currency'])
英文:

If you don't need columns from df_2, you can drop them before the join like this:

output = df_1.join(
    df_2.select('Country', 'Currency'),
    ['Country'], 'left'
)

Note that you can also disambiguate two columns with the same name by specifying the dataframe they come from. e.g. df_1['Code']. So in your case, after the join, instead of using drop, you could use that to keep only the columns from df_1 and the Currency column:

output = df_1\
    .join(df_2, ['Country'], 'left')\
    .select([df_1[c] for c in df_1.columns] + ['Currency'])

huangapple
  • 本文由 发表于 2023年2月6日 15:25:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/75358407.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定