英文:
pyspark join with multiple conditions for different rows in in table dfs
问题
我要加入两个表格,根据pyspark中的一些条件
我有两个数据帧(GR_df和HK_df)分别用于表A和表B
这两个数据帧中有三列是共同的。
GR_df.columnx = HK_df.columnh,
GR_df.columny = HK_df.columni,
GR_df.columnz = HK_df.columnj
现在,根据下面提到的连接条件执行连接:
joincond1 = (GR_df.columnx = HK_df.columnh & GR_df.columny = HK_df.columni & GR_df.columnz = HK_df.columnj),
joincond2 = (GR_df.columnx = HK_df.columnh & GR_df.columny = HK_df.columni),
joincond3 = (GR_df.columnx = HK_df.columnh & GR_df.columnz = HK_df.columnj),
joincond4 = (GR_df.columnx = HK_df.columnh),
如果每个列值的GR_df和HK_df中只有一个匹配行符合joincond1,则将其写入最终数据帧
如果每个列值的GR_df和HK_df中由joincond1返回多行,则将不同的记录写入最终数据帧
如果GR_df和Hk_df中的值没有匹配行,则需要将条件更改为joincond2等,直到条件4
如果一条记录不满足任何条件,将跳过该记录。
我知道我需要遍历数据帧行,但我不确定如何在pyspark或spark-sql中编写此问题的逻辑,有人可以帮助我。
GR_df-
columnx columny columnz 其他列
1 a aa -
2 b bb -
3 c cc -
4 d dd -
5 l uu -
HK_df -
columnh columni columnj 其他列
1 a aa -
2 b zz -
3 m cc -
4 i jj -
最终数据帧 -
columnh columni columnj columns_GR_df columns_HK_df
1 a aa - -
2 b zz - -
3 m cc - -
4 i jj - -
英文:
I want to join two tables based on some conditions in pyspark
I have 2 dataframe (GR_df and HK_df) for table A and table B
there are three columns which are common in both the dataframes.
GR_df.columnx = HK_df.columnh,
GR_df.columny = HK_df.columni,
GR_df.columnz = HK_df.columnj
now, join will be perform based on the join conditions mentioned below:
joincond1 = (GR_df.columnx = HK_df.columnh & GR_df.columny = HK_df.columni & GR_df.columnz = HK_df.columnj),
joincond2 = (GR_df.columnx = HK_df.columnh & GR_df.columny = HK_df.columni),
joincond3 = (GR_df.columnx = HK_df.columnh & GR_df.columnz = HK_df.columnj),
joincond4 = (GR_df.columnx = HK_df.columnh),
If there is only one matching row for joincond1 for each value of columns in GR_df and HK_df
then write that to final df
If there is multiple rows return by joincond1 for each value of columns in GR_df and HK_df
then write distinct records to final df
If there is no matching row for a value in GR_df and Hk_df need to change the condition to joincond2 and so on till condition 4
If a record doesn't satisfy any condition that record will be skipped.
I know I need to iterate over dataframe rows but I am not sure How I can write a logic in pyspark or spark-sql for this problem can anyone help me.
GR_df-
columnx columny columnz Other columns
1 a aa -
2 b bb -
3 c cc -
4 d dd -
5 l uu -
HK_df -
columnh columni columnj Other columns
1 a aa -
2 b zz -
3 m cc -
4 i jj -
final df -
columnh columni columnj columns_GR_df columns_HK_df
1 a aa - -
2 b zz - -
3 m cc - -
4 i jj - -
答案1
得分: 1
我认为你正在寻找一个内连接,条件由"or"运算符组合。要在连接后删除重复项,我们可以在连接之后使用distinct():
output_df = GR_df.join(HK_df, joincond1 | joincond2 | joincond3, "inner").distinct()
根据当cond1和cond2都满足时你想要发生什么(异或?),你的连接条件可能会变成:
(joincond1 & ~joincond2 & ~joincond3) | (~joincond1 & joincond2 & ~joincond3) | (~joincond1 & ~joincond2 & joincond3)
英文:
I think you're looking for an inner join, with the conditions combined by the or operator. To drop the duplicates after, we use distinct() after the join:
output_df=GR_df.join(HK_df, joincond1 | joincond2 | joincond 3, "inner").distinct()
Depending on what you want to happen when both cond1 and cond2 are satisfied (exclusive or?), your join condition may become:
(joincond1 & ~joincond2 & ~joincond3) | (~joincond1 & joincond2 & ~joincond3) | (~joincond1 & ~joincond2 & joincond3)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论