保留原始数据集中的行数,在Pandas中将两个不同数据集的值进行配对时。

huangapple go评论72阅读模式
英文:

Retain original number of rows with dataset to be matched, when pairing values from two different datasets in Pandas

问题

Data

df1

ID                  stat
AA1                 exzone
BB2                 exzone5
CC4                 limit5

df2

name                state
AA1                 NY
AA1                 NY
AA1                 NY
AA1                 NY
BB2                 GA
BB2                 GA
BB2                 GA
CC4                 CA
CC4                 CA

Desired

name                stat          state
AA1                 exzone        NY
BB2                 exzone5       GA
CC4                 limit5        CA

Doing

out = pd.merge(df1, df2, left_on=['ID'], right_on=['name'], how='left')

然而,上面的脚本输出结果爆炸了,并且不保留原始左侧数据框的行数。欢迎任何建议。

英文:

Data

df1

ID                  stat
AA1                 exzone
BB2                 exzone5
CC4                 limit5

df2

name                state
AA1                 NY
AA1                 NY
AA1                 NY
AA1                 NY
BB2                 GA
BB2                 GA
BB2                 GA
CC4                 CA
CC4                 CA

Desired

name                stat          state
AA1                 exzone        NY
BB2                 exzone5       GA
CC4                 limit5        CA

Doing

out = pd.merge(df1,df2, left_on=['ID'], right_on= ['name'], how="left")

however, the above script is giving an exploded output and does not retain the original Left dataframe row count. Any suggestion is appreciated.

答案1

得分: 1

左连接并不意味着结构将与原始左DataFrame相同。它意味着所有左侧的键将被保留,即使在右侧DataFrame中不存在。在你的情况下,右侧的重复键会强制merge计算所有行的组合。

你首先需要移除重复项:

out = pd.merge(df1, df2.drop_duplicates(), left_on=['ID'], right_on=['name'], how='left')

如果出现某种原因,你有每个名称的多个不同状态,你应该找到另一种方法来聚合(选择第一个、最后一个、将唯一状态组合为单个字符串等),或者接受有重复行的情况。

英文:

A left merge doesn't mean that the structure will be identical to that of the original left DataFrame. It means that all the left keys will be preserved, even if absent from the right DataFrame. In your case the duplicated keys on the right force the merge to compute all combinations of the rows.

You need to first remove the duplicates:

out = pd.merge(df1, df2.drop_duplicates(), left_on=['ID'], right_on= ['name'], how="left")

If for some reason you have several different states per name, you should find another way to aggregate (pick the first, last, combine the unique states as a single string, etc.), or accept to have row duplications.

huangapple
  • 本文由 发表于 2023年7月14日 02:13:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/76682210.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定