用另一个数据框中的值替换数据框中的逗号分隔值。

huangapple go评论105阅读模式
英文:

Replace comma-separated values in a dataframe with values from another dataframe

问题

I understand your request. Here is the translated content you provided:

这是我在StackOverflow上的第一个问题,所以如果我不够清楚,请原谅。通常我能在这里找到答案,但这一次我没有运气。也许我太迟钝了,但让我们开始吧。

我有两个格式如下的pandas数据帧

df1

  1. +------------+-------------+
  2. | References | Description |
  3. +------------+-------------+
  4. | 1,2 | Descr 1 |
  5. | 3 | Descr 2 |
  6. | 2,3,5 | Descr 3 |
  7. +------------+-------------+

df2

  1. +--------+--------------+
  2. | Ref_ID | ShortRef |
  3. +--------+--------------+
  4. | 1 | Smith (2006) |
  5. | 2 | Mike (2009) |
  6. | 3 | John (2014) |
  7. | 4 | Cole (2007) |
  8. | 5 | Jill (2019) |
  9. | 6 | Tom (2007) |
  10. +--------+--------------+

基本上,df2 中的 Ref_ID 包含在 df1References 字段中形成的字符串

我想要做的是将df1中的 References 字段中的值替换为如下所示:

  1. +-------------------------------------+-------------+
  2. | References | Description |
  3. +-------------------------------------+-------------+
  4. | Smith (2006); Mike (2009) | Descr 1 |
  5. | John (2014) | Descr 2 |
  6. | Mike (2009);John (2014);Jill (2019) | Descr 3 |
  7. +-------------------------------------+-------------+

到目前为止,我只需要处理具有1对1关系的列和ID,这完美地运行。
https://stackoverflow.com/questions/53818434/pandas-replacing-values-by-looking-up-in-an-another-dataframe

但是,我无法理解这个稍微不同的问题。我能想到的唯一解决方法是重新迭代循环,将df1的每个字符串与df2进行比较并进行替换。

我担心这将非常慢,因为我有大约2000个唯一的Ref_ID,我必须在类似于References的几个列中重复执行此操作。

是否有人愿意指导我走向正确的方向?

非常感谢您提前的帮助。

英文:

this is my first question on StackOverflow, so please pardon if I am not clear enough. I usually find my answers here but this time I had no luck. Maybe I am being dense, but here we go.

I have two pandas dataframes formatted as follows

df1

  1. +------------+-------------+
  2. | References | Description |
  3. +------------+-------------+
  4. | 1,2 | Descr 1 |
  5. | 3 | Descr 2 |
  6. | 2,3,5 | Descr 3 |
  7. +------------+-------------+

df2

  1. +--------+--------------+
  2. | Ref_ID | ShortRef |
  3. +--------+--------------+
  4. | 1 | Smith (2006) |
  5. | 2 | Mike (2009) |
  6. | 3 | John (2014) |
  7. | 4 | Cole (2007) |
  8. | 5 | Jill (2019) |
  9. | 6 | Tom (2007) |
  10. +--------+--------------+

Basically, Ref_ID in df2 contains IDs that form the string contained in the field References in df1

What I would like to do is to replace values in the References field in df1 so it looks like this:

  1. +-------------------------------------+-------------+
  2. | References | Description |
  3. +-------------------------------------+-------------+
  4. | Smith (2006); Mike (2009) | Descr 1 |
  5. | John (2014) | Descr 2 |
  6. | Mike (2009);John (2014);Jill (2019) | Descr 3 |
  7. +-------------------------------------+-------------+

So far, I had to deal with columns and IDs with a 1-1 relationship, and this works perfectly
https://stackoverflow.com/questions/53818434/pandas-replacing-values-by-looking-up-in-an-another-dataframe

But I cannot get my mind around this slightly different problem. The only solution I could think of is to re-iterate a for and if cycles that compare every string of df1 to df2 and make the substitution.

This would be, I am afraid, very slow as I have ca. 2000 unique Ref_IDs and I have to repeat this operation in several columns similar to the References one.

Anyone is willing to point me in the right direction?

Many thanks in advance.

答案1

得分: 3

你可以使用一些列表推导和字典查找,我认为这不会太慢。

首先,创建一个快速访问的 id 到 short_ref 的映射:

  1. mapping_dict = df2.set_index('Ref_ID')['ShortRef'].to_dict()

然后,让我们通过逗号分割引用:

  1. df1_values = [v.split(',') for v in df1['References']]

最后,我们可以迭代并进行字典查找,然后再拼接成字符串:

  1. df1['References'] = pd.Series([';'.join([mapping_dict[v] for v in values]) for values in df1_values])

这个方法可行吗,还是会太慢?

英文:

you can use some list comprehension and dict lookups and I dont think this will be too slow

First, making a fast-to-access mapping for id to short_ref

  1. mapping_dict = df2.set_index('Ref_ID')['ShortRef'].to_dict()

Then, lets split references by commas

  1. df1_values = [v.split(',') for v in df1['References']]

Finally, we can iterate over and do dictionary lookups, before concatenating back to strings

  1. df1['References'] = pd.Series([';'.join([mapping_dict[v] for v in values]) for values in df1_values])

Is this usable or is it too slow?

答案2

得分: 3

Here is the translated code:

  1. 让我们尝试一下
  2. df1 = pd.DataFrame({'Reference':['1,2','3','1,3,5'], 'Description':['Descr 1', 'Descr 2', 'Descr 3']})
  3. df2 = pd.DataFrame({'Ref_ID':[1,2,3,4,5,6], 'ShortRef':['Smith (2006)',
  4. 'Mike (2009)',
  5. 'John (2014)',
  6. 'Cole (2007)',
  7. 'Jill (2019)',
  8. 'Tom (2007)']})
  9. df1['Reference2'] = (df1['Reference'].str.split(',')
  10. .explode()
  11. .map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
  12. .set_index('Ref_ID')['ShortRef'])
  13. .groupby(level=0).agg(list))
  14. 输出
  15. Reference Description Reference2
  16. 0 1,2 Descr 1 [Smith (2006), Mike (2009)]
  17. 1 3 Descr 2 [John (2014)]
  18. 2 1,3,5 Descr 3 [Smith (2006), John (2014), Jill (2019)]
  19. @Datanovice 感谢更新
  20. df1['Reference2'] = (df1['Reference'].str.split(',')
  21. .explode()
  22. .map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
  23. .set_index('Ref_ID')['ShortRef'])
  24. .groupby(level=0).agg(';'.join))
  25. 输出
  26. Reference Description Reference2
  27. 0 1,2 Descr 1 Smith (2006);Mike (2009)
  28. 1 3 Descr 2 John (2014)
  29. 2 1,3,5 Descr 3 Smith (2006);John (2014);Jill (2019)

I've translated the provided code for you.

英文:

Let's try this:

  1. df1 = pd.DataFrame({'Reference':['1,2','3','1,3,5'], 'Description':['Descr 1', 'Descr 2', 'Descr 3']})
  2. df2 = pd.DataFrame({'Ref_ID':[1,2,3,4,5,6], 'ShortRef':['Smith (2006)',
  3. 'Mike (2009)',
  4. 'John (2014)',
  5. 'Cole (2007)',
  6. 'Jill (2019)',
  7. 'Tom (2007)']})
  8. df1['Reference2'] = (df1['Reference'].str.split(',')
  9. .explode()
  10. .map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
  11. .set_index('Ref_ID')['ShortRef'])
  12. .groupby(level=0).agg(list))

Output:

  1. Reference Description Reference2
  2. 0 1,2 Descr 1 [Smith (2006), Mike (2009)]
  3. 1 3 Descr 2 [John (2014)]
  4. 2 1,3,5 Descr 3 [Smith (2006), John (2014), Jill (2019)]

@Datanovice thanks for the update.

  1. df1['Reference2'] = (df1['Reference'].str.split(',')
  2. .explode()
  3. .map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
  4. .set_index('Ref_ID')['ShortRef'])
  5. .groupby(level=0).agg(';'.join))

Output:

  1. Reference Description Reference2
  2. 0 1,2 Descr 1 Smith (2006);Mike (2009)
  3. 1 3 Descr 2 John (2014)
  4. 2 1,3,5 Descr 3 Smith (2006);John (2014);Jill (2019)

答案3

得分: 1

Another solution is using str.get_dummies and dot

  1. df3 = (df1.set_index('Description').Reference.str.get_dummies(',')
  2. .reindex(columns=df2.Ref_ID.astype(str).values, fill_value=0))
  3. df_final = (df3.dot(df2.ShortRef.values+';').str.strip(';').rename('References')
  4. .reset_index())
  5. Out[462]:
  6. Description References
  7. 0 Descr 1 Smith (2006);Mike (2009)
  8. 1 Descr 2 John (2014)
  9. 2 Descr 3 Mike (2009);John (2014);Jill (2019)
英文:

Another solution is using str.get_dummies and dot

  1. df3 = (df1.set_index('Description').Reference.str.get_dummies(',')
  2. .reindex(columns=df2.Ref_ID.astype(str).values, fill_value=0))
  3. df_final = (df3.dot(df2.ShortRef.values+';').str.strip(';').rename('References')
  4. .reset_index())
  5. Out[462]:
  6. Description References
  7. 0 Descr 1 Smith (2006);Mike (2009)
  8. 1 Descr 2 John (2014)
  9. 2 Descr 3 Mike (2009);John (2014);Jill (2019)

huangapple
  • 本文由 发表于 2020年1月7日 02:19:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/59617019.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定