将Pandas DataFrame合并以获取仅存在于其中一个DataFrame中的记录

huangapple go评论58阅读模式
英文:

Pandas Dataframe merge to get only non-existing records

问题

代码片段如下:

df2 = df.merge(df_existing,
               on=['symbolid', 'timeframeid', 'datetime'],
               how='left',
               indicator=True).query('_merge == "left_only"').drop(columns='_merge')

现在的结果显示所有非连接列都以 _x 和 _y 为后缀,根据它们来自的 df 不同。

期望的结果是与原始数据框中相同的列,但基于 symbolid、timeframeid 和 datetime 的重复行已被删除。

英文:

Okay.. so I'm trying to merge two dataframes to only get the records from dataframe1 (df) that doesn't already exist in dataframe2 (df_existing)

columns in both dataframes:
symbolid
timeframeid
datetime
open
high
low
close
volume

Code snippet that as far as I know used to work fine:

df2 = df.merge(df_existing,
                        on = ['symbolid', 'timeframeid', 'datetime'],
                        how = 'left',
                        indicator = True).query('_merge == "left_only"').drop(columns = '_merge')

The result now is showing all the non-join columns duplicated with suffixes _x and _y according to what df they originate from.

The desired outcome is the same columns as in the original dataframes but with the duplicate rows based on symbolid, timeframeid and datetime removed.

答案1

得分: 1

使用 merge 来对齐两个 DataFrame 时,可以通过切片合并的列来避免后缀:

cols = ['symbolid', 'timeframeid', 'datetime']

df2 = (df.merge(df_existing[cols],
                on=cols, how='left',
                indicator=True)
         .query('_merge == "left_only"')
         .drop(columns = '_merge')
       )

使用 poploc 进行替代,以在单一步骤中进行筛选和删除:

cols = ['symbolid', 'timeframeid', 'datetime']

df2 = (df.merge(df_existing[cols],
                on=cols, how='left',
                indicator=True)
         .loc[lambda d: d.pop('_merge').eq('left_only')
     )
英文:

When using a merge to align two DataFrames, you can avoid suffixes by just slicing the merging columns:

cols = ['symbolid', 'timeframeid', 'datetime']

df2 = (df.merge(df_existing[cols],
                on=cols, how='left',
                indicator=True)
         .query('_merge == "left_only"')
         .drop(columns = '_merge')
       )

Alternative with pop and loc to filter and drop in a single step:

cols = ['symbolid', 'timeframeid', 'datetime']

df2 = (df.merge(df_existing[cols],
                on=cols, how='left',
                indicator=True)
         .loc[lambda d: d.pop('_merge').eq('left_only')
     )

huangapple
  • 本文由 发表于 2023年7月11日 04:02:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/76656976.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定