如何使用多列和条件像PySpark一样连接Pandas数据框。

huangapple go评论64阅读模式
英文:

How to join pandas dataframe with multiple columns and conditions like pyspark

问题

我有一个在pyspark脚本中的连接操作。

变量'd'和'p'是Spark数据框。
是否有办法在pandas中执行类似操作?

英文:

I have this join in a pyspark script.

    d = d.join(p, [
        d.p_hash == p.hash,
        d.dy >= p.mindy,
        d.dy <= p.maxdy,
    ], "left") \
    .drop(p.hash) \
    .drop(p.mindy) \
    .drop(p.maxdy)

The variables 'd' and 'p' are spark dataframes.
Is there any way I could do this in pandas?

答案1

得分: 1

是的,您可以简单地执行合并操作,并根据您的条件筛选数据框,然后删除不需要的列。

d = d.merge(p, left_on=['p_hash'], right_on=['hash'], how='left')
d = d[(d['dy'] >= d['mindy']) & (d['dy'] <= d['maxdy'])]
d = d.drop(['hash', 'mindy', 'maxdy'], axis=1)

在 pandas 中进行合并不像在 pyspark 中那样具有条件连接功能。

您还可以查看这里的答案:如何在Python Pandas中执行/解决条件连接?

英文:

Yes, you can simply do the merge and filter the data frame with your condition, then drop the unwanted columns.

d = d.merge(p, left_on=[&#39;p_hash&#39;], right_on=[&#39;hash&#39;], how=&#39;left&#39;)
d = d[(d[&#39;dy&#39;] &gt;= d[&#39;mindy&#39;]) &amp; (d[&#39;dy&#39;] &lt;= d[&#39;maxdy&#39;])]
d = d.drop([&#39;hash&#39;, &#39;mindy&#39;, &#39;maxdy&#39;], axis=1)

Merge on pandas isn't quite like on pyspark, it doesn't have conditional join.

You can also review answers from here: How to do/workaround a conditional join in python Pandas?

huangapple
  • 本文由 发表于 2023年6月15日 09:00:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/76478432.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定