pyspark中使用多个条件连接不同行的表dfs:

huangapple go评论78阅读模式
英文:

pyspark join with multiple conditions for different rows in in table dfs

问题

我要加入两个表格,根据pyspark中的一些条件

我有两个数据帧(GR_df和HK_df)分别用于表A和表B
这两个数据帧中有三列是共同的。

GR_df.columnx = HK_df.columnh,

GR_df.columny = HK_df.columni,

GR_df.columnz = HK_df.columnj

现在,根据下面提到的连接条件执行连接:

joincond1 = (GR_df.columnx = HK_df.columnh & GR_df.columny = HK_df.columni & GR_df.columnz = HK_df.columnj),

joincond2 = (GR_df.columnx = HK_df.columnh & GR_df.columny = HK_df.columni),

joincond3 = (GR_df.columnx = HK_df.columnh & GR_df.columnz = HK_df.columnj),

joincond4 = (GR_df.columnx = HK_df.columnh),

如果每个列值的GR_df和HK_df中只有一个匹配行符合joincond1,则将其写入最终数据帧

如果每个列值的GR_df和HK_df中由joincond1返回多行,则将不同的记录写入最终数据帧

如果GR_df和Hk_df中的值没有匹配行,则需要将条件更改为joincond2等,直到条件4

如果一条记录不满足任何条件,将跳过该记录。

我知道我需要遍历数据帧行,但我不确定如何在pyspark或spark-sql中编写此问题的逻辑,有人可以帮助我。

GR_df-

columnx	columny	columnz	其他列

 1	      a	      aa	  -
 2	      b	      bb	  -
 3	      c	      cc	  -
 4	      d	      dd	  -
 5	      l	      uu	  -

HK_df -

columnh	 columni columnj 其他列

 1	       a	   aa	      -
 2	       b	   zz	      -
 3	       m	   cc	      -
 4	       i	   jj	      -

最终数据帧 -

columnh  columni  columnj  columns_GR_df  columns_HK_df

   1	   a	    aa	       -	          -
   2	   b	    zz	       -	          -
   3	   m	    cc	       -	          -
   4	   i	    jj	       -	          -
英文:

I want to join two tables based on some conditions in pyspark

I have 2 dataframe (GR_df and HK_df) for table A and table B
there are three columns which are common in both the dataframes.

GR_df.columnx = HK_df.columnh,

GR_df.columny = HK_df.columni,

GR_df.columnz = HK_df.columnj

now, join will be perform based on the join conditions mentioned below:

joincond1 = (GR_df.columnx = HK_df.columnh & GR_df.columny = HK_df.columni & GR_df.columnz = HK_df.columnj),

joincond2 = (GR_df.columnx = HK_df.columnh & GR_df.columny = HK_df.columni),

joincond3 = (GR_df.columnx = HK_df.columnh & GR_df.columnz = HK_df.columnj),

joincond4 = (GR_df.columnx = HK_df.columnh),

If there is only one matching row for joincond1 for each value of columns in GR_df and HK_df
then write that to final df

If there is multiple rows return by joincond1 for each value of columns in GR_df and HK_df
then write distinct records to final df

If there is no matching row for a value in GR_df and Hk_df need to change the condition to joincond2 and so on till condition 4

If a record doesn't satisfy any condition that record will be skipped.

I know I need to iterate over dataframe rows but I am not sure How I can write a logic in pyspark or spark-sql for this problem can anyone help me.

GR_df-

columnx	columny	columnz	Other columns

 1	      a	      aa	  -
 2	      b	      bb	  -
 3	      c	      cc	  -
 4	      d	      dd	  -
 5	      l	      uu	  -

HK_df -

columnh	 columni columnj Other columns

 1	       a	   aa	      -
 2	       b	   zz	      -
 3	       m	   cc	      -
 4	       i	   jj	      -

final df -

columnh  columni  columnj  columns_GR_df  columns_HK_df

   1	   a	    aa	       -	          -
   2	   b	    zz	       -	          -
   3	   m	    cc	       -	          -
   4	   i	    jj	       -	          -

答案1

得分: 1

我认为你正在寻找一个内连接,条件由"or"运算符组合。要在连接后删除重复项,我们可以在连接之后使用distinct():

output_df = GR_df.join(HK_df, joincond1 | joincond2 | joincond3, "inner").distinct()

根据当cond1和cond2都满足时你想要发生什么(异或?),你的连接条件可能会变成:

(joincond1 & ~joincond2 & ~joincond3) | (~joincond1 & joincond2 & ~joincond3) | (~joincond1 & ~joincond2 & joincond3)
英文:

I think you're looking for an inner join, with the conditions combined by the or operator. To drop the duplicates after, we use distinct() after the join:

output_df=GR_df.join(HK_df, joincond1 | joincond2 | joincond 3, "inner").distinct()

Depending on what you want to happen when both cond1 and cond2 are satisfied (exclusive or?), your join condition may become:

(joincond1 & ~joincond2 & ~joincond3) | (~joincond1 & joincond2 & ~joincond3) | (~joincond1 & ~joincond2 & joincond3)

huangapple
  • 本文由 发表于 2023年6月12日 16:32:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/76454845.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定