2023年6月5日 18:19:34go评论106阅读模式

英文:

In spark dataframe add columns to from one df to another without creating combination of matching rows

问题

我的数据框如下所示：

df1 = spark.createDataFrame([(1, "a"), (1, "b"), (1, "c")], ["col1", "col2"])

+----+----+
|col1|col2|
+----+----+
|   1|   a|
|   1|   b|
|   1|   c|
+----+----+

df2 = spark.createDataFrame([(1, "k1"), (1, "k2"), (1, "k3"), (1, "k4")], ["col1", "col3"])

+----+----+
|col1|col3|
+----+----+
|   1|  k1|
|   1|  k2|
|   1|  k3|
|   1|  k4|
+----+----+

我想生成：

df3 = spark.createDataFrame([(1, "a", "k1"), (1, "b", "k2"), (1, "c", "k3"), (1, None, "k4")], ["col1", "col2", "col3"])

期望的输出如下：

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   a|  k1|
|   1|   b|  k2|
|   1|   c|  k3|
|   1|null|  k4|
+----+----+----+

我尝试了 df1.join(df2, on='col1', how='leftouter') 并得到了以下结果：

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   a|  k4|
|   1|   a|  k3|
|   1|   a|  k2|
|   1|   a|  k1|
|   1|   b|  k4|
|   1|   b|  k3|
|   1|   b|  k2|
|   1|   b|  k1|
|   1|   c|  k4|
|   1|   c|  k3|
|   1|   c|  k2|
|   1|   c|  k1|
+----+----+----+

我查看了 https://stackoverflow.com/questions/41049490/merge-rows-from-one-dataframe-that-do-not-match-specific-columns-in-another-data。这几乎是我想要的。但是，它使用了 pandas 数据框。我不确定是否在执行此操作时从 Spark 数据框切换到 pandas 数据框是否是一个好主意。是否有一种本机的 PySpark 方法可以实现这个目标？

需要帮助进行转换以获得期望的输出。

英文:

My data frames are like the following:

df1 = spark.createDataFrame([(1, &quot;a&quot;), (1, &quot;b&quot;), (1, &quot;c&quot;)], (&quot;col1&quot;, &quot;col2&quot;))

        +----+----+
        |col1|col2|
        +----+----+
        |   1|   a|
        |   1|   b|
        |   1|   c|
        +----+----+
        
df2 = spark.createDataFrame([(1, &quot;k1&quot;), (1, &quot;k2&quot;), (1, &quot;k3&quot;),(1,&quot;k4&quot;)], (&quot;col1&quot;, &quot;col3&quot;))

        +----+----+
        |col1|col3|
        +----+----+
        |   1|  k1|
        |   1|  k2|
        |   1|  k3|
        |   1|  k4|
        +----+----+

I want to generate

df3 = spark.createDataFrame([(1, &quot;a&quot;, &quot;k1&quot;), (1, &quot;b&quot;, &quot;k2&quot;), (1, &quot;c&quot;, &quot;k3&quot;),(1, None, &quot;k4&quot;)], (&quot;col1&quot;, &quot;col2&quot;, &quot;col3&quot;))

i.e., desired output:

    +----+----+----+
    |col1|col2|col3|
    +----+----+----+
    |   1|   a|  k1|
    |   1|   b|  k2|
    |   1|   c|  k3|
    |   1|null|  k4|
    +----+----+----+

I tried df1.join(df2, on='col1', how="leftouter") and got:

    +----+----+----+
    |col1|col2|col3|
    +----+----+----+
    |   1|   a|  k4|
    |   1|   a|  k3|
    |   1|   a|  k2|
    |   1|   a|  k1|
    |   1|   b|  k4|
    |   1|   b|  k3|
    |   1|   b|  k2|
    |   1|   b|  k1|
    |   1|   c|  k4|
    |   1|   c|  k3|
    |   1|   c|  k2|
    |   1|   c|  k1|
    +----+----+----+

I looked into https://stackoverflow.com/questions/41049490/merge-rows-from-one-dataframe-that-do-not-match-specific-columns-in-another-data . This very nearly what I want. But, It is using pandas df. Im not sure if switching form spark df to pandas df to do just this operation is a good idea. Is there a native pysprak way of doing this?

Need help with the transformation that gives the desired output.

答案1

得分: 0

你可以使用 window 来创建一个新的 rank 列，表示你的两个数据帧中行的顺序，然后根据第一列和这个 rank 列连接你的两个数据帧，最后删除 rank 列，如下所示：

from pyspark.sql import Window
from pyspark.sql import functions as F

window1 = Window.partitionBy("col1").orderBy("col2")
window2 = Window.partitionBy("col1").orderBy("col3")

ranked_df1 = df1.withColumn("rank", F.row_number().over(window1))
ranked_df2 = df2.withColumn("rank", F.row_number().over(window2))

result_df = ranked_df1.join(
    ranked_df2, 
    on=['col1', 'rank'], 
    how='full_outer'
).drop('rank')

使用与你问题中定义的 df1 和 df2 数据帧，你会得到以下的 result_df 数据帧：

+----+----+----+
|col1|col2|col3|
+----+----+----+
|1   |a   |k1  |
|1   |b   |k2  |
|1   |c   |k3  |
|1   |null|k4  |
+----+----+----+

英文:

You can create a new rank column representing order of your rows in your two dataframes using window, join your two dataframes on your first column and this rank column and finally drop rank column, as follows:

from pyspark.sql import Window
from pyspark.sql import functions as F

window1 = Window.partitionBy(&quot;col1&quot;).orderBy(&quot;col2&quot;)
window2 = Window.partitionBy(&quot;col1&quot;).orderBy(&quot;col3&quot;)

ranked_df1 = df1.withColumn(&quot;rank&quot;, F.row_number().over(window1))
ranked_df2 = df2.withColumn(&quot;rank&quot;, F.row_number().over(window2))

result_df = ranked_df1.join(
    ranked_df2, 
    on=[&#39;col1&#39;, &#39;rank&#39;], 
    how=&#39;full_outer&#39;
).drop(&#39;rank&#39;)

With df1 and df2 dataframes defined as in your question, you get the following result_df dataframe:

+----+----+----+
|col1|col2|col3|
+----+----+----+
|1   |a   |k1  |
|1   |b   |k2  |
|1   |c   |k3  |
|1   |null|k4  |
+----+----+----+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

In spark dataframe add columns to from one df to another without creating combination of matching rows

问题

答案1

理解 Python 字符串

将 .json 文件转换为 .csv 文件

Python多进程与函数的功能

ModuleNotFoundError: 使用Metaflow时找不到模块’pandas.core.indexes.numeric’

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论