2023年6月12日 16:32:57go评论102阅读模式

英文:

pyspark join with multiple conditions for different rows in in table dfs

问题

我要加入两个表格，根据pyspark中的一些条件

我有两个数据帧（GR_df和HK_df）分别用于表A和表B
这两个数据帧中有三列是共同的。

GR_df.columnx = HK_df.columnh,

GR_df.columny = HK_df.columni,

GR_df.columnz = HK_df.columnj

现在，根据下面提到的连接条件执行连接：

joincond1 = (GR_df.columnx = HK_df.columnh & GR_df.columny = HK_df.columni & GR_df.columnz = HK_df.columnj),

joincond2 = (GR_df.columnx = HK_df.columnh & GR_df.columny = HK_df.columni),

joincond3 = (GR_df.columnx = HK_df.columnh & GR_df.columnz = HK_df.columnj),

joincond4 = (GR_df.columnx = HK_df.columnh),

如果每个列值的GR_df和HK_df中只有一个匹配行符合joincond1，则将其写入最终数据帧

如果每个列值的GR_df和HK_df中由joincond1返回多行，则将不同的记录写入最终数据帧

如果GR_df和Hk_df中的值没有匹配行，则需要将条件更改为joincond2等，直到条件4

如果一条记录不满足任何条件，将跳过该记录。

我知道我需要遍历数据帧行，但我不确定如何在pyspark或spark-sql中编写此问题的逻辑，有人可以帮助我。

GR_df-

columnx	columny	columnz	其他列
 1	      a	      aa	  -
 2	      b	      bb	  -
 3	      c	      cc	  -
 4	      d	      dd	  -
 5	      l	      uu	  -

HK_df -

columnh	 columni columnj 其他列
 1	       a	   aa	      -
 2	       b	   zz	      -
 3	       m	   cc	      -
 4	       i	   jj	      -

最终数据帧 -

columnh  columni  columnj  columns_GR_df  columns_HK_df
   1	   a	    aa	       -	          -
   2	   b	    zz	       -	          -
   3	   m	    cc	       -	          -
   4	   i	    jj	       -	          -

英文:

I want to join two tables based on some conditions in pyspark

I have 2 dataframe (GR_df and HK_df) for table A and table B
there are three columns which are common in both the dataframes.

GR_df.columnx = HK_df.columnh,

GR_df.columny = HK_df.columni,

GR_df.columnz = HK_df.columnj

now, join will be perform based on the join conditions mentioned below:

joincond1 = (GR_df.columnx = HK_df.columnh & GR_df.columny = HK_df.columni & GR_df.columnz = HK_df.columnj),

joincond2 = (GR_df.columnx = HK_df.columnh & GR_df.columny = HK_df.columni),

joincond3 = (GR_df.columnx = HK_df.columnh & GR_df.columnz = HK_df.columnj),

joincond4 = (GR_df.columnx = HK_df.columnh),

If there is only one matching row for joincond1 for each value of columns in GR_df and HK_df
then write that to final df

If there is multiple rows return by joincond1 for each value of columns in GR_df and HK_df
then write distinct records to final df

If there is no matching row for a value in GR_df and Hk_df need to change the condition to joincond2 and so on till condition 4

If a record doesn't satisfy any condition that record will be skipped.

I know I need to iterate over dataframe rows but I am not sure How I can write a logic in pyspark or spark-sql for this problem can anyone help me.

GR_df-

columnx	columny	columnz	Other columns
 1	      a	      aa	  -
 2	      b	      bb	  -
 3	      c	      cc	  -
 4	      d	      dd	  -
 5	      l	      uu	  -

HK_df -

columnh	 columni columnj Other columns
 1	       a	   aa	      -
 2	       b	   zz	      -
 3	       m	   cc	      -
 4	       i	   jj	      -

final df -

columnh  columni  columnj  columns_GR_df  columns_HK_df
   1	   a	    aa	       -	          -
   2	   b	    zz	       -	          -
   3	   m	    cc	       -	          -
   4	   i	    jj	       -	          -

答案1

得分: 1

我认为你正在寻找一个内连接，条件由"or"运算符组合。要在连接后删除重复项，我们可以在连接之后使用distinct()：

output_df = GR_df.join(HK_df, joincond1 | joincond2 | joincond3, "inner").distinct()

根据当cond1和cond2都满足时你想要发生什么（异或？），你的连接条件可能会变成：

(joincond1 & ~joincond2 & ~joincond3) | (~joincond1 & joincond2 & ~joincond3) | (~joincond1 & ~joincond2 & joincond3)

英文:

I think you're looking for an inner join, with the conditions combined by the or operator. To drop the duplicates after, we use distinct() after the join:

output_df=GR_df.join(HK_df, joincond1 | joincond2 | joincond 3, &quot;inner&quot;).distinct()

Depending on what you want to happen when both cond1 and cond2 are satisfied (exclusive or?), your join condition may become:

(joincond1 &amp; ~joincond2 &amp; ~joincond3) | (~joincond1 &amp; joincond2 &amp; ~joincond3) | (~joincond1 &amp; ~joincond2 &amp; joincond3)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

pyspark中使用多个条件连接不同行的表dfs：

问题

答案1

在Python中使用Selenium查找元素

将文件添加到系统托盘

subscripts in sympy and python

在一列中计算连续NaN值的快速方法

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。