2023年4月6日 22:01:15go评论136阅读模式

英文:

Pyspark : How to join two differents datasets with differents conditions with differents columns?

问题

我将使用Pyspark连接这两个数据集，并根据不同列的不同条件获取一个数据集：

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# 创建Spark会话
spark = SparkSession.builder.appName("example").getOrCreate()

# 创建第一个数据集df1
data1 = [("AB2", "AB1", "AB6", "jean"),
        ("AB4", "AB3", "AB7", "shein"),
        ("AB9", "AB5", "AB8", "patrick")]

columns1 = ["rc1", "rc2", "rc3", "resp"]
df1 = spark.createDataFrame(data=data1, schema=columns1)

# 创建第二个数据集df2
data2 = [("AB1", "Normal"),
        ("AB4", "Expand"),
        ("AB3", "Small"),
        ("AB6", "Big"),
        ("AB8", "First"),
        ("AB2", "Dock"),
        ("AB7", "Missing"),
        ("AB9", "Package"),
        ("AB5", "Wrong")]

columns2 = ["Key", "description"]
df2 = spark.createDataFrame(data=data2, schema=columns2)

# 使用join连接两个数据集
final_df = df1.join(df2, col("rc1") == col("Key")).join(df2, col("rc2") == col("Key")).join(df2, col("rc3") == col("Key"))

# 选择需要的列并显示结果
final_df.select("rc1", "rc2", "rc3", "resp", "description", "description", "description").show()

英文:

I will join this two datasets with differents conditions of differents columns to obtain one datasets in Pyspark :

The first dataset df1 &gt;&gt;&gt;  
| rc1 | rc2 | rc3 | resp|
|-----|-----|-----|-----|
| AB2 | AB1 | AB6 | jean|
| AB4 | AB3 | AB7 |shein|
| AB9 | AB5 | AB8 |patrick| 

The second dataset df2 &gt;&gt;&gt; 
| Key | description |
| --- | ------------|
| AB1 |    Normal   |
| AB4 |    Expand   |
| AB3 |    small    |
| AB6 |    Big      |
| AB8 |    First    |
| AB2 |    Dock     |
| AB7 |    Missing  |
| AB9 |    Package  |
| AB5 |    Wrong    | 

I will obtain in final dataset df join with df1 &amp; df2 &gt;&gt;&gt;
| rc1     | rc2    | rc3     | resp    |
| --------| ------ | ------- | ------- |
| Dock    | Normal | Big     | jean    |
| Expand  | Small  | Missing | shein   |
| Package | Wrong  | First   | Patrick | 
Please, can you kelp how I join df1 &amp; df2 ?

答案1

得分: 1

I will provide a translation of the code part you provided:

df = (
       df1.join(df2, df1.rc1 == df2.Key, 'inner').drop("Key","rc1") 
          .withColumnRenamed('description', 'rc1') 
          .join(df2, df1.rc2 == df2.Key, 'inner').drop("Key","rc2") 
          .withColumnRenamed('description', 'rc2') 
          .join(df2, df1.rc3 == df2.Key, 'inner').drop("Key","rc3") 
          .withColumnRenamed('description', 'rc3') 
          .select("rc1","rc2","rc3","resp")
     )

df.show()

This is the translation of the code portion you provided.

英文:

See the below implementation -

df = (
       df1.join(df2, df1.rc1 == df2.Key, &#39;inner&#39;).drop(&quot;Key&quot;,&quot;rc1&quot;) 
          .withColumnRenamed(&#39;description&#39;, &#39;rc1&#39;) 
          .join(df2, df1.rc2 == df2.Key, &#39;inner&#39;).drop(&quot;Key&quot;,&quot;rc2&quot;) 
          .withColumnRenamed(&#39;description&#39;, &#39;rc2&#39;) 
          .join(df2, df1.rc3 == df2.Key, &#39;inner&#39;).drop(&quot;Key&quot;,&quot;rc3&quot;) 
          .withColumnRenamed(&#39;description&#39;, &#39;rc3&#39;) 
          .select(&quot;rc1&quot;,&quot;rc2&quot;,&quot;rc3&quot;,&quot;resp&quot;)
     )

df.show()

+-------+------+-------+-------+
|    rc1|   rc2|    rc3|   resp|
+-------+------+-------+-------+
|   Dock|Normal|    Big|   jean|
|Package| Wrong|  First|patrick|
| Expand| Small|Missing|  shein|
+-------+------+-------+-------+

答案2

得分: 0

I'll provide the translation of the code portion:

定义一个列的列表，你想要替换值的列。创建一个堆栈表达式并堆叠数据框。然后与`df2`进行`join`操作，根据共同的`Key`替换值。最后，对数据框进行`pivot`操作，重新整理数据。

    cols = ['rc1', 'rc2', 'rc3']
    expr = f"stack({len(cols)}, %s) as (rc, Key)" % ', '.join(f"'{c}', {c}" for c in cols)
    result = (
        df1.selectExpr('resp', expr)
        .join(df2, on='key', how='left')
        .drop('Key')
        .groupBy('resp')
        .pivot('rc')
        .agg(F.first('description'))
    )

Please note that this is a translation of the code you provided. If you have any specific questions or need further assistance, feel free to ask.

英文:

Code

Define a list of columns where you want to substitute values. Create a stack expression and stack the dataframe. Then join it with df2 to substitute the values based on a common Key. Finally, pivot the dataframe to reshape it back

cols = [&#39;rc1&#39;, &#39;rc2&#39;, &#39;rc3&#39;]
expr = f&quot;stack({len(cols)}, %s) as (rc, Key)&quot; % &#39;, &#39;.join(f&quot;&#39;{c}&#39;, {c}&quot; for c in cols)
result = (
    df1.selectExpr(&#39;resp&#39;, expr)
    .join(df2, on=&#39;key&#39;, how=&#39;left&#39;)
    .drop(&#39;Key&#39;)
    .groupBy(&#39;resp&#39;)
    .pivot(&#39;rc&#39;)
    .agg(F.first(&#39;description&#39;))
)

Result

+-------+-------+------+-------+
|   resp|    rc1|   rc2|    rc3|
+-------+-------+------+-------+
|   jean|   Dock|Normal|    Big|
|patrick|Package| Wrong|  First|
|  shein| Expand| small|Missing|
+-------+-------+------+-------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pyspark: 如何使用不同条件和不同列连接两个不同的数据集？

问题

答案1

答案2

Code

Result

在类构造函数init()中如何初始化Pandas “DataFrame”作为类属性？

Pyspark表名与时间戳

Reading a complex, large text file.

打开 Python 中的文本文件

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论