2023年2月6日 15:25:06go评论65阅读模式

英文:

Join two PySpark DataFrames and get some of the columns from one DataFrame when column names are similar

问题

我想加入2个PySpark数据框。但是，我想要一个数据框的所有列，以及第二个数据框的一些列。关键是两个数据框中有一个列具有相似的名称。

示例数据框：

# 准备数据
data_1 = [
    (1, "Italy", "Europe"),
    (2, "Italy", "Europe"),
    (3, "Germany", None),
    (4, "Iran", "Asia"),
    (5, "China", "Asia"),
    (6, "China", None),
    (7, "Japan", "Asia"),
    (8, "France", None),
]

# 创建数据框
columns = ["Code", "Country", "Continent"]
df_1 = spark.createDataFrame(data=data_1, schema=columns)
df_1.show(truncate=False)

Join两个PySpark DataFrames，并在列名相似时从一个DataFrame获取一些列。

# 准备数据
data_2 = [
    (1, "Italy", "EUR", 11),
    (2, "Germany", "EUR", 12),
    (3, "China", "CNY", 13),
    (4, "Japan", "JPY", 14),
    (5, "France", "EUR", 15),
    (6, "Taiwan", "TWD", 16),
    (7, "USA", "USD", 17),
    (8, "India", "INR", 18),
]

# 创建数据框
columns = ["Code", "Country", "Currency", "Sales"]
df_2 = spark.createDataFrame(data=data_2, schema=columns)
df_2.show(truncate=False)

Join两个PySpark DataFrames，并在列名相似时从一个DataFrame获取一些列。

我想要第一个数据框的所有列以及第二个数据框的"Currency"列。

当我使用左连接：

output = df_1.join(df_2, ["Country"], "left")
output.show()

现在，在连接操作后有两个名称为"Code"的列。

Join两个PySpark DataFrames，并在列名相似时从一个DataFrame获取一些列。

使用删除列：

output = df_1.join(df_2, ["Country"], "left").drop('Code', 'Sales')
output.show()

两个名为"Code"的列都被删除了。但是，我想保留第一个数据框中的"Code"列。

有没有办法解决这个问题？

另一个问题是如何在连接操作后使"Code"列成为结果数据框中最左边的列。

如果你需要帮助，请告诉我。

英文:

I want to join 2 PySpark DataFrames. But, I want all columns from one DataFrame, and some of columns from the 2nd DataFrame. The point is that there is a column with similar name in both DataFrames.

Sample DataFrames:

# Prepare Data
data_1 = [
    (1, &quot;Italy&quot;, &quot;Europe&quot;),
    (2, &quot;Italy&quot;, &quot;Europe&quot;),
    (3, &quot;Germany&quot;, None),
    (4, &quot;Iran&quot;, &quot;Asia&quot;),
    (5, &quot;China&quot;, &quot;Asia&quot;),
    (6, &quot;China&quot;, None),
    (7, &quot;Japan&quot;, &quot;Asia&quot;),
    (8, &quot;France&quot;, None),
]

# Create DataFrame
columns = [&quot;Code&quot;, &quot;Country&quot;, &quot;Continent&quot;]
df_1 = spark.createDataFrame(data=data_1, schema=columns)
df_1.show(truncate=False)

# Prepare Data
data_2 = [
    (1, &quot;Italy&quot;, &quot;EUR&quot;, 11),
    (2, &quot;Germany&quot;, &quot;EUR&quot;, 12),
    (3, &quot;China&quot;, &quot;CNY&quot;, 13),
    (4, &quot;Japan&quot;, &quot;JPY&quot;, 14),
    (5, &quot;France&quot;, &quot;EUR&quot;, 15),
    (6, &quot;Taiwan&quot;, &quot;TWD&quot;, 16),
    (7, &quot;USA&quot;, &quot;USD&quot;, 17),
    (8, &quot;India&quot;, &quot;INR&quot;, 18),
]

# Create DataFrame
columns = [&quot;Code&quot;, &quot;Country&quot;, &quot;Currency&quot;, &quot;Sales&quot;]
df_2 = spark.createDataFrame(data=data_2, schema=columns)
df_2.show(truncate=False)

I want all columns of the 1st DataFrame and only column "Currency" from the 2nd DataFrame.
When I use left join:

output = df_1.join(df_2, [&quot;Country&quot;], &quot;left&quot;)
output.show()

Now, there are two columns with name "Code" after Join operation.

Using drop columns:

output = df_1.join(df_2, [&quot;Country&quot;], &quot;left&quot;).drop(&#39;Code&#39;, &#39;Sales&#39;)
output.show()

Both columns named "Code" are dropped. But, I want to keep column "Code" from the 1st DataFrame.

Any idea how to solve this issue?

Another question is how to make column "Code" as the left-most column in the resulting DataFrame after Join operation.

答案1

得分: 1

如果你不需要来自df_2的列，你可以在加入之前将它们删除，就像这样：

output = df_1.join(
    df_2.select('Country', 'Currency'),
    ['Country'], 'left'
)

请注意，你也可以通过指定它们来自的数据框来消除具有相同名称的两列的歧义。例如，df_1['Code']。所以在你的情况下，在加入之后，而不是使用drop，你可以使用这种方法来保留只有来自df_1和Currency列的列：

output = df_1\
    .join(df_2, ['Country'], 'left')\
    .select([df_1[c] for c in df_1.columns] + ['Currency'])

英文:

If you don't need columns from df_2, you can drop them before the join like this:

output = df_1.join(
    df_2.select(&#39;Country&#39;, &#39;Currency&#39;),
    [&#39;Country&#39;], &#39;left&#39;
)

Note that you can also disambiguate two columns with the same name by specifying the dataframe they come from. e.g. df_1['Code']. So in your case, after the join, instead of using drop, you could use that to keep only the columns from df_1 and the Currency column:

output = df_1\
    .join(df_2, [&#39;Country&#39;], &#39;left&#39;)\
    .select([df_1[c] for c in df_1.columns] + [&#39;Currency&#39;])

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Join两个PySpark DataFrames，并在列名相似时从一个DataFrame获取一些列。

问题

答案1

如何使用DeltaTable API在PySpark中设置Delta表的表属性。

获取行或列中非空值的快速方法

从pandas到polars的Dataframe转换–最终维度的差异

将一个DataFrame的多层索引中的一部分作为列应用，是否可能？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论