英文:
Replace comma-separated values in a dataframe with values from another dataframe
问题
I understand your request. Here is the translated content you provided:
这是我在StackOverflow上的第一个问题,所以如果我不够清楚,请原谅。通常我能在这里找到答案,但这一次我没有运气。也许我太迟钝了,但让我们开始吧。
我有两个格式如下的pandas数据帧
df1
+------------+-------------+
| References | Description |
+------------+-------------+
| 1,2 | Descr 1 |
| 3 | Descr 2 |
| 2,3,5 | Descr 3 |
+------------+-------------+
df2
+--------+--------------+
| Ref_ID | ShortRef |
+--------+--------------+
| 1 | Smith (2006) |
| 2 | Mike (2009) |
| 3 | John (2014) |
| 4 | Cole (2007) |
| 5 | Jill (2019) |
| 6 | Tom (2007) |
+--------+--------------+
基本上,df2 中的 Ref_ID 包含在 df1 的 References 字段中形成的字符串
我想要做的是将df1中的 References 字段中的值替换为如下所示:
+-------------------------------------+-------------+
| References | Description |
+-------------------------------------+-------------+
| Smith (2006); Mike (2009) | Descr 1 |
| John (2014) | Descr 2 |
| Mike (2009);John (2014);Jill (2019) | Descr 3 |
+-------------------------------------+-------------+
到目前为止,我只需要处理具有1对1关系的列和ID,这完美地运行。
https://stackoverflow.com/questions/53818434/pandas-replacing-values-by-looking-up-in-an-another-dataframe
但是,我无法理解这个稍微不同的问题。我能想到的唯一解决方法是重新迭代循环,将df1的每个字符串与df2进行比较并进行替换。
我担心这将非常慢,因为我有大约2000个唯一的Ref_ID,我必须在类似于References的几个列中重复执行此操作。
是否有人愿意指导我走向正确的方向?
非常感谢您提前的帮助。
英文:
this is my first question on StackOverflow, so please pardon if I am not clear enough. I usually find my answers here but this time I had no luck. Maybe I am being dense, but here we go.
I have two pandas dataframes formatted as follows
df1
+------------+-------------+
| References | Description |
+------------+-------------+
| 1,2 | Descr 1 |
| 3 | Descr 2 |
| 2,3,5 | Descr 3 |
+------------+-------------+
df2
+--------+--------------+
| Ref_ID | ShortRef |
+--------+--------------+
| 1 | Smith (2006) |
| 2 | Mike (2009) |
| 3 | John (2014) |
| 4 | Cole (2007) |
| 5 | Jill (2019) |
| 6 | Tom (2007) |
+--------+--------------+
Basically, Ref_ID in df2 contains IDs that form the string contained in the field References in df1
What I would like to do is to replace values in the References field in df1 so it looks like this:
+-------------------------------------+-------------+
| References | Description |
+-------------------------------------+-------------+
| Smith (2006); Mike (2009) | Descr 1 |
| John (2014) | Descr 2 |
| Mike (2009);John (2014);Jill (2019) | Descr 3 |
+-------------------------------------+-------------+
So far, I had to deal with columns and IDs with a 1-1 relationship, and this works perfectly
https://stackoverflow.com/questions/53818434/pandas-replacing-values-by-looking-up-in-an-another-dataframe
But I cannot get my mind around this slightly different problem. The only solution I could think of is to re-iterate a for and if cycles that compare every string of df1 to df2 and make the substitution.
This would be, I am afraid, very slow as I have ca. 2000 unique Ref_IDs and I have to repeat this operation in several columns similar to the References one.
Anyone is willing to point me in the right direction?
Many thanks in advance.
答案1
得分: 3
你可以使用一些列表推导和字典查找,我认为这不会太慢。
首先,创建一个快速访问的 id 到 short_ref 的映射:
mapping_dict = df2.set_index('Ref_ID')['ShortRef'].to_dict()
然后,让我们通过逗号分割引用:
df1_values = [v.split(',') for v in df1['References']]
最后,我们可以迭代并进行字典查找,然后再拼接成字符串:
df1['References'] = pd.Series([';'.join([mapping_dict[v] for v in values]) for values in df1_values])
这个方法可行吗,还是会太慢?
英文:
you can use some list comprehension and dict lookups and I dont think this will be too slow
First, making a fast-to-access mapping for id to short_ref
mapping_dict = df2.set_index('Ref_ID')['ShortRef'].to_dict()
Then, lets split references by commas
df1_values = [v.split(',') for v in df1['References']]
Finally, we can iterate over and do dictionary lookups, before concatenating back to strings
df1['References'] = pd.Series([';'.join([mapping_dict[v] for v in values]) for values in df1_values])
Is this usable or is it too slow?
答案2
得分: 3
Here is the translated code:
让我们尝试一下:
df1 = pd.DataFrame({'Reference':['1,2','3','1,3,5'], 'Description':['Descr 1', 'Descr 2', 'Descr 3']})
df2 = pd.DataFrame({'Ref_ID':[1,2,3,4,5,6], 'ShortRef':['Smith (2006)',
'Mike (2009)',
'John (2014)',
'Cole (2007)',
'Jill (2019)',
'Tom (2007)']})
df1['Reference2'] = (df1['Reference'].str.split(',')
.explode()
.map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
.set_index('Ref_ID')['ShortRef'])
.groupby(level=0).agg(list))
输出:
Reference Description Reference2
0 1,2 Descr 1 [Smith (2006), Mike (2009)]
1 3 Descr 2 [John (2014)]
2 1,3,5 Descr 3 [Smith (2006), John (2014), Jill (2019)]
@Datanovice 感谢更新。
df1['Reference2'] = (df1['Reference'].str.split(',')
.explode()
.map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
.set_index('Ref_ID')['ShortRef'])
.groupby(level=0).agg(';'.join))
输出:
Reference Description Reference2
0 1,2 Descr 1 Smith (2006);Mike (2009)
1 3 Descr 2 John (2014)
2 1,3,5 Descr 3 Smith (2006);John (2014);Jill (2019)
I've translated the provided code for you.
英文:
Let's try this:
df1 = pd.DataFrame({'Reference':['1,2','3','1,3,5'], 'Description':['Descr 1', 'Descr 2', 'Descr 3']})
df2 = pd.DataFrame({'Ref_ID':[1,2,3,4,5,6], 'ShortRef':['Smith (2006)',
'Mike (2009)',
'John (2014)',
'Cole (2007)',
'Jill (2019)',
'Tom (2007)']})
df1['Reference2'] = (df1['Reference'].str.split(',')
.explode()
.map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
.set_index('Ref_ID')['ShortRef'])
.groupby(level=0).agg(list))
Output:
Reference Description Reference2
0 1,2 Descr 1 [Smith (2006), Mike (2009)]
1 3 Descr 2 [John (2014)]
2 1,3,5 Descr 3 [Smith (2006), John (2014), Jill (2019)]
@Datanovice thanks for the update.
df1['Reference2'] = (df1['Reference'].str.split(',')
.explode()
.map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
.set_index('Ref_ID')['ShortRef'])
.groupby(level=0).agg(';'.join))
Output:
Reference Description Reference2
0 1,2 Descr 1 Smith (2006);Mike (2009)
1 3 Descr 2 John (2014)
2 1,3,5 Descr 3 Smith (2006);John (2014);Jill (2019)
答案3
得分: 1
Another solution is using str.get_dummies
and dot
df3 = (df1.set_index('Description').Reference.str.get_dummies(',')
.reindex(columns=df2.Ref_ID.astype(str).values, fill_value=0))
df_final = (df3.dot(df2.ShortRef.values+';').str.strip(';').rename('References')
.reset_index())
Out[462]:
Description References
0 Descr 1 Smith (2006);Mike (2009)
1 Descr 2 John (2014)
2 Descr 3 Mike (2009);John (2014);Jill (2019)
英文:
Another solution is using str.get_dummies
and dot
df3 = (df1.set_index('Description').Reference.str.get_dummies(',')
.reindex(columns=df2.Ref_ID.astype(str).values, fill_value=0))
df_final = (df3.dot(df2.ShortRef.values+';').str.strip(';').rename('References')
.reset_index())
Out[462]:
Description References
0 Descr 1 Smith (2006);Mike (2009)
1 Descr 2 John (2014)
2 Descr 3 Mike (2009);John (2014);Jill (2019)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论