英文:
Join two PySpark DataFrames and get some of the columns from one DataFrame when column names are similar
问题
我想加入2个PySpark数据框。但是,我想要一个数据框的所有列,以及第二个数据框的一些列。关键是两个数据框中有一个列具有相似的名称。
示例数据框:
# 准备数据
data_1 = [
(1, "Italy", "Europe"),
(2, "Italy", "Europe"),
(3, "Germany", None),
(4, "Iran", "Asia"),
(5, "China", "Asia"),
(6, "China", None),
(7, "Japan", "Asia"),
(8, "France", None),
]
# 创建数据框
columns = ["Code", "Country", "Continent"]
df_1 = spark.createDataFrame(data=data_1, schema=columns)
df_1.show(truncate=False)
# 准备数据
data_2 = [
(1, "Italy", "EUR", 11),
(2, "Germany", "EUR", 12),
(3, "China", "CNY", 13),
(4, "Japan", "JPY", 14),
(5, "France", "EUR", 15),
(6, "Taiwan", "TWD", 16),
(7, "USA", "USD", 17),
(8, "India", "INR", 18),
]
# 创建数据框
columns = ["Code", "Country", "Currency", "Sales"]
df_2 = spark.createDataFrame(data=data_2, schema=columns)
df_2.show(truncate=False)
我想要第一个数据框的所有列以及第二个数据框的"Currency"列。
当我使用左连接:
output = df_1.join(df_2, ["Country"], "left")
output.show()
现在,在连接操作后有两个名称为"Code"的列。
使用删除列:
output = df_1.join(df_2, ["Country"], "left").drop('Code', 'Sales')
output.show()
两个名为"Code"的列都被删除了。但是,我想保留第一个数据框中的"Code"列。
有没有办法解决这个问题?
另一个问题是如何在连接操作后使"Code"列成为结果数据框中最左边的列。
如果你需要帮助,请告诉我。
英文:
I want to join 2 PySpark DataFrames. But, I want all columns from one DataFrame, and some of columns from the 2nd DataFrame. The point is that there is a column with similar name in both DataFrames.
Sample DataFrames:
# Prepare Data
data_1 = [
(1, "Italy", "Europe"),
(2, "Italy", "Europe"),
(3, "Germany", None),
(4, "Iran", "Asia"),
(5, "China", "Asia"),
(6, "China", None),
(7, "Japan", "Asia"),
(8, "France", None),
]
# Create DataFrame
columns = ["Code", "Country", "Continent"]
df_1 = spark.createDataFrame(data=data_1, schema=columns)
df_1.show(truncate=False)
# Prepare Data
data_2 = [
(1, "Italy", "EUR", 11),
(2, "Germany", "EUR", 12),
(3, "China", "CNY", 13),
(4, "Japan", "JPY", 14),
(5, "France", "EUR", 15),
(6, "Taiwan", "TWD", 16),
(7, "USA", "USD", 17),
(8, "India", "INR", 18),
]
# Create DataFrame
columns = ["Code", "Country", "Currency", "Sales"]
df_2 = spark.createDataFrame(data=data_2, schema=columns)
df_2.show(truncate=False)
I want all columns of the 1st DataFrame and only column "Currency" from the 2nd DataFrame.
When I use left join:
output = df_1.join(df_2, ["Country"], "left")
output.show()
Now, there are two columns with name "Code" after Join operation.
Using drop columns:
output = df_1.join(df_2, ["Country"], "left").drop('Code', 'Sales')
output.show()
Both columns named "Code" are dropped. But, I want to keep column "Code" from the 1st DataFrame.
Any idea how to solve this issue?
Another question is how to make column "Code" as the left-most column in the resulting DataFrame after Join operation.
答案1
得分: 1
如果你不需要来自df_2
的列,你可以在加入之前将它们删除,就像这样:
output = df_1.join(
df_2.select('Country', 'Currency'),
['Country'], 'left'
)
请注意,你也可以通过指定它们来自的数据框来消除具有相同名称的两列的歧义。例如,df_1['Code']
。所以在你的情况下,在加入之后,而不是使用drop
,你可以使用这种方法来保留只有来自df_1
和Currency
列的列:
output = df_1\
.join(df_2, ['Country'], 'left')\
.select([df_1[c] for c in df_1.columns] + ['Currency'])
英文:
If you don't need columns from df_2
, you can drop them before the join like this:
output = df_1.join(
df_2.select('Country', 'Currency'),
['Country'], 'left'
)
Note that you can also disambiguate two columns with the same name by specifying the dataframe they come from. e.g. df_1['Code']
. So in your case, after the join, instead of using drop
, you could use that to keep only the columns from df_1
and the Currency
column:
output = df_1\
.join(df_2, ['Country'], 'left')\
.select([df_1[c] for c in df_1.columns] + ['Currency'])
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论