Pyspark: 如何使用不同条件和不同列连接两个不同的数据集?

huangapple go评论64阅读模式
英文:

Pyspark : How to join two differents datasets with differents conditions with differents columns?

问题

我将使用Pyspark连接这两个数据集,并根据不同列的不同条件获取一个数据集:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# 创建Spark会话
spark = SparkSession.builder.appName("example").getOrCreate()

# 创建第一个数据集df1
data1 = [("AB2", "AB1", "AB6", "jean"),
        ("AB4", "AB3", "AB7", "shein"),
        ("AB9", "AB5", "AB8", "patrick")]

columns1 = ["rc1", "rc2", "rc3", "resp"]
df1 = spark.createDataFrame(data=data1, schema=columns1)

# 创建第二个数据集df2
data2 = [("AB1", "Normal"),
        ("AB4", "Expand"),
        ("AB3", "Small"),
        ("AB6", "Big"),
        ("AB8", "First"),
        ("AB2", "Dock"),
        ("AB7", "Missing"),
        ("AB9", "Package"),
        ("AB5", "Wrong")]

columns2 = ["Key", "description"]
df2 = spark.createDataFrame(data=data2, schema=columns2)

# 使用join连接两个数据集
final_df = df1.join(df2, col("rc1") == col("Key")).join(df2, col("rc2") == col("Key")).join(df2, col("rc3") == col("Key"))

# 选择需要的列并显示结果
final_df.select("rc1", "rc2", "rc3", "resp", "description", "description", "description").show()
英文:

I will join this two datasets with differents conditions of differents columns to obtain one datasets in Pyspark :

The first dataset df1 >>>  
| rc1 | rc2 | rc3 | resp|
|-----|-----|-----|-----|
| AB2 | AB1 | AB6 | jean|
| AB4 | AB3 | AB7 |shein|
| AB9 | AB5 | AB8 |patrick| 

The second dataset df2 >>> 
| Key | description |
| --- | ------------|
| AB1 |    Normal   |
| AB4 |    Expand   |
| AB3 |    small    |
| AB6 |    Big      |
| AB8 |    First    |
| AB2 |    Dock     |
| AB7 |    Missing  |
| AB9 |    Package  |
| AB5 |    Wrong    | 

I will obtain in final dataset df join with df1 & df2 >>>
| rc1     | rc2    | rc3     | resp    |
| --------| ------ | ------- | ------- |
| Dock    | Normal | Big     | jean    |
| Expand  | Small  | Missing | shein   |
| Package | Wrong  | First   | Patrick | 
Please, can you kelp how I join df1 & df2 ? 

答案1

得分: 1

I will provide a translation of the code part you provided:

df = (
       df1.join(df2, df1.rc1 == df2.Key, 'inner').drop("Key","rc1") 
          .withColumnRenamed('description', 'rc1') 
          .join(df2, df1.rc2 == df2.Key, 'inner').drop("Key","rc2") 
          .withColumnRenamed('description', 'rc2') 
          .join(df2, df1.rc3 == df2.Key, 'inner').drop("Key","rc3") 
          .withColumnRenamed('description', 'rc3') 
          .select("rc1","rc2","rc3","resp")
     )

df.show()

This is the translation of the code portion you provided.

英文:

See the below implementation -

df = (
       df1.join(df2, df1.rc1 == df2.Key, 'inner').drop("Key","rc1") 
          .withColumnRenamed('description', 'rc1') 
          .join(df2, df1.rc2 == df2.Key, 'inner').drop("Key","rc2") 
          .withColumnRenamed('description', 'rc2') 
          .join(df2, df1.rc3 == df2.Key, 'inner').drop("Key","rc3") 
          .withColumnRenamed('description', 'rc3') 
          .select("rc1","rc2","rc3","resp")
     )

df.show()

+-------+------+-------+-------+
|    rc1|   rc2|    rc3|   resp|
+-------+------+-------+-------+
|   Dock|Normal|    Big|   jean|
|Package| Wrong|  First|patrick|
| Expand| Small|Missing|  shein|
+-------+------+-------+-------+

答案2

得分: 0

I'll provide the translation of the code portion:

定义一个列的列表你想要替换值的列创建一个堆栈表达式并堆叠数据框然后与`df2`进行`join`操作根据共同的`Key`替换值最后对数据框进行`pivot`操作重新整理数据

    cols = ['rc1', 'rc2', 'rc3']
    expr = f"stack({len(cols)}, %s) as (rc, Key)" % ', '.join(f"'{c}', {c}" for c in cols)
    result = (
        df1.selectExpr('resp', expr)
        .join(df2, on='key', how='left')
        .drop('Key')
        .groupBy('resp')
        .pivot('rc')
        .agg(F.first('description'))
    )

Please note that this is a translation of the code you provided. If you have any specific questions or need further assistance, feel free to ask.

英文:

Code

Define a list of columns where you want to substitute values. Create a stack expression and stack the dataframe. Then join it with df2 to substitute the values based on a common Key. Finally, pivot the dataframe to reshape it back

cols = ['rc1', 'rc2', 'rc3']
expr = f"stack({len(cols)}, %s) as (rc, Key)" % ', '.join(f"'{c}', {c}" for c in cols)
result = (
    df1.selectExpr('resp', expr)
    .join(df2, on='key', how='left')
    .drop('Key')
    .groupBy('resp')
    .pivot('rc')
    .agg(F.first('description'))
)

Result

+-------+-------+------+-------+
|   resp|    rc1|   rc2|    rc3|
+-------+-------+------+-------+
|   jean|   Dock|Normal|    Big|
|patrick|Package| Wrong|  First|
|  shein| Expand| small|Missing|
+-------+-------+------+-------+

huangapple
  • 本文由 发表于 2023年4月6日 22:01:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/75950400.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定