英文:
How to join pandas dataframe with multiple columns and conditions like pyspark
问题
我有一个在pyspark脚本中的连接操作。
变量'd'和'p'是Spark数据框。
是否有办法在pandas中执行类似操作?
英文:
I have this join in a pyspark script.
d = d.join(p, [
d.p_hash == p.hash,
d.dy >= p.mindy,
d.dy <= p.maxdy,
], "left") \
.drop(p.hash) \
.drop(p.mindy) \
.drop(p.maxdy)
The variables 'd' and 'p' are spark dataframes.
Is there any way I could do this in pandas?
答案1
得分: 1
是的,您可以简单地执行合并操作,并根据您的条件筛选数据框,然后删除不需要的列。
d = d.merge(p, left_on=['p_hash'], right_on=['hash'], how='left')
d = d[(d['dy'] >= d['mindy']) & (d['dy'] <= d['maxdy'])]
d = d.drop(['hash', 'mindy', 'maxdy'], axis=1)
在 pandas 中进行合并不像在 pyspark 中那样具有条件连接功能。
您还可以查看这里的答案:如何在Python Pandas中执行/解决条件连接?
英文:
Yes, you can simply do the merge and filter the data frame with your condition, then drop the unwanted columns.
d = d.merge(p, left_on=['p_hash'], right_on=['hash'], how='left')
d = d[(d['dy'] >= d['mindy']) & (d['dy'] <= d['maxdy'])]
d = d.drop(['hash', 'mindy', 'maxdy'], axis=1)
Merge on pandas isn't quite like on pyspark, it doesn't have conditional join.
You can also review answers from here: How to do/workaround a conditional join in python Pandas?
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论