英文:
Retain original number of rows with dataset to be matched, when pairing values from two different datasets in Pandas
问题
Data
df1
ID stat
AA1 exzone
BB2 exzone5
CC4 limit5
df2
name state
AA1 NY
AA1 NY
AA1 NY
AA1 NY
BB2 GA
BB2 GA
BB2 GA
CC4 CA
CC4 CA
Desired
name stat state
AA1 exzone NY
BB2 exzone5 GA
CC4 limit5 CA
Doing
out = pd.merge(df1, df2, left_on=['ID'], right_on=['name'], how='left')
然而,上面的脚本输出结果爆炸了,并且不保留原始左侧数据框的行数。欢迎任何建议。
英文:
Data
df1
ID stat
AA1 exzone
BB2 exzone5
CC4 limit5
df2
name state
AA1 NY
AA1 NY
AA1 NY
AA1 NY
BB2 GA
BB2 GA
BB2 GA
CC4 CA
CC4 CA
Desired
name stat state
AA1 exzone NY
BB2 exzone5 GA
CC4 limit5 CA
Doing
out = pd.merge(df1,df2, left_on=['ID'], right_on= ['name'], how="left")
however, the above script is giving an exploded output and does not retain the original Left dataframe row count. Any suggestion is appreciated.
答案1
得分: 1
左连接并不意味着结构将与原始左DataFrame相同。它意味着所有左侧的键将被保留,即使在右侧DataFrame中不存在。在你的情况下,右侧的重复键会强制merge
计算所有行的组合。
你首先需要移除重复项:
out = pd.merge(df1, df2.drop_duplicates(), left_on=['ID'], right_on=['name'], how='left')
如果出现某种原因,你有每个名称的多个不同状态,你应该找到另一种方法来聚合(选择第一个、最后一个、将唯一状态组合为单个字符串等),或者接受有重复行的情况。
英文:
A left merge doesn't mean that the structure will be identical to that of the original left DataFrame. It means that all the left keys will be preserved, even if absent from the right DataFrame. In your case the duplicated keys on the right force the merge
to compute all combinations of the rows.
You need to first remove the duplicates:
out = pd.merge(df1, df2.drop_duplicates(), left_on=['ID'], right_on= ['name'], how="left")
If for some reason you have several different states per name, you should find another way to aggregate (pick the first, last, combine the unique states as a single string, etc.), or accept to have row duplications.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论