英文:
Pyspark : How to join two differents datasets with differents conditions with differents columns?
问题
我将使用Pyspark连接这两个数据集,并根据不同列的不同条件获取一个数据集:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# 创建Spark会话
spark = SparkSession.builder.appName("example").getOrCreate()
# 创建第一个数据集df1
data1 = [("AB2", "AB1", "AB6", "jean"),
("AB4", "AB3", "AB7", "shein"),
("AB9", "AB5", "AB8", "patrick")]
columns1 = ["rc1", "rc2", "rc3", "resp"]
df1 = spark.createDataFrame(data=data1, schema=columns1)
# 创建第二个数据集df2
data2 = [("AB1", "Normal"),
("AB4", "Expand"),
("AB3", "Small"),
("AB6", "Big"),
("AB8", "First"),
("AB2", "Dock"),
("AB7", "Missing"),
("AB9", "Package"),
("AB5", "Wrong")]
columns2 = ["Key", "description"]
df2 = spark.createDataFrame(data=data2, schema=columns2)
# 使用join连接两个数据集
final_df = df1.join(df2, col("rc1") == col("Key")).join(df2, col("rc2") == col("Key")).join(df2, col("rc3") == col("Key"))
# 选择需要的列并显示结果
final_df.select("rc1", "rc2", "rc3", "resp", "description", "description", "description").show()
英文:
I will join this two datasets with differents conditions of differents columns to obtain one datasets in Pyspark :
The first dataset df1 >>>
| rc1 | rc2 | rc3 | resp|
|-----|-----|-----|-----|
| AB2 | AB1 | AB6 | jean|
| AB4 | AB3 | AB7 |shein|
| AB9 | AB5 | AB8 |patrick|
The second dataset df2 >>>
| Key | description |
| --- | ------------|
| AB1 | Normal |
| AB4 | Expand |
| AB3 | small |
| AB6 | Big |
| AB8 | First |
| AB2 | Dock |
| AB7 | Missing |
| AB9 | Package |
| AB5 | Wrong |
I will obtain in final dataset df join with df1 & df2 >>>
| rc1 | rc2 | rc3 | resp |
| --------| ------ | ------- | ------- |
| Dock | Normal | Big | jean |
| Expand | Small | Missing | shein |
| Package | Wrong | First | Patrick |
Please, can you kelp how I join df1 & df2 ?
答案1
得分: 1
I will provide a translation of the code part you provided:
df = (
df1.join(df2, df1.rc1 == df2.Key, 'inner').drop("Key","rc1")
.withColumnRenamed('description', 'rc1')
.join(df2, df1.rc2 == df2.Key, 'inner').drop("Key","rc2")
.withColumnRenamed('description', 'rc2')
.join(df2, df1.rc3 == df2.Key, 'inner').drop("Key","rc3")
.withColumnRenamed('description', 'rc3')
.select("rc1","rc2","rc3","resp")
)
df.show()
This is the translation of the code portion you provided.
英文:
See the below implementation -
df = (
df1.join(df2, df1.rc1 == df2.Key, 'inner').drop("Key","rc1")
.withColumnRenamed('description', 'rc1')
.join(df2, df1.rc2 == df2.Key, 'inner').drop("Key","rc2")
.withColumnRenamed('description', 'rc2')
.join(df2, df1.rc3 == df2.Key, 'inner').drop("Key","rc3")
.withColumnRenamed('description', 'rc3')
.select("rc1","rc2","rc3","resp")
)
df.show()
+-------+------+-------+-------+
| rc1| rc2| rc3| resp|
+-------+------+-------+-------+
| Dock|Normal| Big| jean|
|Package| Wrong| First|patrick|
| Expand| Small|Missing| shein|
+-------+------+-------+-------+
答案2
得分: 0
I'll provide the translation of the code portion:
定义一个列的列表,你想要替换值的列。创建一个堆栈表达式并堆叠数据框。然后与`df2`进行`join`操作,根据共同的`Key`替换值。最后,对数据框进行`pivot`操作,重新整理数据。
cols = ['rc1', 'rc2', 'rc3']
expr = f"stack({len(cols)}, %s) as (rc, Key)" % ', '.join(f"'{c}', {c}" for c in cols)
result = (
df1.selectExpr('resp', expr)
.join(df2, on='key', how='left')
.drop('Key')
.groupBy('resp')
.pivot('rc')
.agg(F.first('description'))
)
Please note that this is a translation of the code you provided. If you have any specific questions or need further assistance, feel free to ask.
英文:
Code
Define a list of columns where you want to substitute values. Create a stack expression and stack the dataframe. Then join
it with df2
to substitute the values based on a common Key
. Finally, pivot
the dataframe to reshape it back
cols = ['rc1', 'rc2', 'rc3']
expr = f"stack({len(cols)}, %s) as (rc, Key)" % ', '.join(f"'{c}', {c}" for c in cols)
result = (
df1.selectExpr('resp', expr)
.join(df2, on='key', how='left')
.drop('Key')
.groupBy('resp')
.pivot('rc')
.agg(F.first('description'))
)
Result
+-------+-------+------+-------+
| resp| rc1| rc2| rc3|
+-------+-------+------+-------+
| jean| Dock|Normal| Big|
|patrick|Package| Wrong| First|
| shein| Expand| small|Missing|
+-------+-------+------+-------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论