2020年1月7日 02:19:40go评论105阅读模式

英文:

Replace comma-separated values in a dataframe with values from another dataframe

问题

I understand your request. Here is the translated content you provided:

这是我在StackOverflow上的第一个问题，所以如果我不够清楚，请原谅。通常我能在这里找到答案，但这一次我没有运气。也许我太迟钝了，但让我们开始吧。

我有两个格式如下的pandas数据帧

df1

+------------+-------------+
| References | Description |
+------------+-------------+
| 1,2        | Descr 1     |
| 3          | Descr 2     |
| 2,3,5      | Descr 3     |
+------------+-------------+

df2

+--------+--------------+
| Ref_ID |   ShortRef   |
+--------+--------------+
|      1 | Smith (2006) |
|      2 | Mike (2009)  |
|      3 | John (2014)  |
|      4 | Cole (2007)  |
|      5 | Jill (2019)  |
|      6 | Tom (2007)   |
+--------+--------------+

基本上，df2 中的 Ref_ID 包含在 df1 的 References 字段中形成的字符串

我想要做的是将df1中的 References 字段中的值替换为如下所示：

+-------------------------------------+-------------+
|             References              | Description |
+-------------------------------------+-------------+
| Smith (2006); Mike (2009)           | Descr 1     |
| John (2014)                         | Descr 2     |
| Mike (2009);John (2014);Jill (2019) | Descr 3     |
+-------------------------------------+-------------+

到目前为止，我只需要处理具有1对1关系的列和ID，这完美地运行。
https://stackoverflow.com/questions/53818434/pandas-replacing-values-by-looking-up-in-an-another-dataframe

但是，我无法理解这个稍微不同的问题。我能想到的唯一解决方法是重新迭代循环，将df1的每个字符串与df2进行比较并进行替换。

我担心这将非常慢，因为我有大约2000个唯一的Ref_ID，我必须在类似于References的几个列中重复执行此操作。

是否有人愿意指导我走向正确的方向？

非常感谢您提前的帮助。

英文:

this is my first question on StackOverflow, so please pardon if I am not clear enough. I usually find my answers here but this time I had no luck. Maybe I am being dense, but here we go.

I have two pandas dataframes formatted as follows

df1

+------------+-------------+
| References | Description |
+------------+-------------+
| 1,2        | Descr 1     |
| 3          | Descr 2     |
| 2,3,5      | Descr 3     |
+------------+-------------+

df2

+--------+--------------+
| Ref_ID |   ShortRef   |
+--------+--------------+
|      1 | Smith (2006) |
|      2 | Mike (2009)  |
|      3 | John (2014)  |
|      4 | Cole (2007)  |
|      5 | Jill (2019)  |
|      6 | Tom (2007)   |
+--------+--------------+

Basically, Ref_ID in df2 contains IDs that form the string contained in the field References in df1

What I would like to do is to replace values in the References field in df1 so it looks like this:

+-------------------------------------+-------------+
|             References              | Description |
+-------------------------------------+-------------+
| Smith (2006); Mike (2009)           | Descr 1     |
| John (2014)                         | Descr 2     |
| Mike (2009);John (2014);Jill (2019) | Descr 3     |
+-------------------------------------+-------------+

So far, I had to deal with columns and IDs with a 1-1 relationship, and this works perfectly
https://stackoverflow.com/questions/53818434/pandas-replacing-values-by-looking-up-in-an-another-dataframe

But I cannot get my mind around this slightly different problem. The only solution I could think of is to re-iterate a for and if cycles that compare every string of df1 to df2 and make the substitution.

This would be, I am afraid, very slow as I have ca. 2000 unique Ref_IDs and I have to repeat this operation in several columns similar to the References one.

Anyone is willing to point me in the right direction?

Many thanks in advance.

答案1

得分: 3

你可以使用一些列表推导和字典查找，我认为这不会太慢。

首先，创建一个快速访问的 id 到 short_ref 的映射：

mapping_dict = df2.set_index('Ref_ID')['ShortRef'].to_dict()

然后，让我们通过逗号分割引用：

df1_values = [v.split(',') for v in df1['References']]

最后，我们可以迭代并进行字典查找，然后再拼接成字符串：

df1['References'] = pd.Series([';'.join([mapping_dict[v] for v in values]) for values in df1_values])

这个方法可行吗，还是会太慢？

英文:

you can use some list comprehension and dict lookups and I dont think this will be too slow

First, making a fast-to-access mapping for id to short_ref

mapping_dict = df2.set_index(&#39;Ref_ID&#39;)[&#39;ShortRef&#39;].to_dict()

Then, lets split references by commas

df1_values = [v.split(&#39;,&#39;) for v in df1[&#39;References&#39;]]

Finally, we can iterate over and do dictionary lookups, before concatenating back to strings

df1[&#39;References&#39;] = pd.Series([&#39;;&#39;.join([mapping_dict[v] for v in values]) for values in df1_values])

Is this usable or is it too slow?

答案2

得分: 3

Here is the translated code:

让我们尝试一下：
    df1 = pd.DataFrame({'Reference':['1,2','3','1,3,5'], 'Description':['Descr 1', 'Descr 2', 'Descr 3']})
    df2 = pd.DataFrame({'Ref_ID':[1,2,3,4,5,6], 'ShortRef':['Smith (2006)',
                                                           'Mike (2009)',
                                                           'John (2014)',
                                                           'Cole (2007)',
                                                           'Jill (2019)',
                                                           'Tom (2007)']})
    df1['Reference2'] = (df1['Reference'].str.split(',')
                                         .explode()
                                         .map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
                                                 .set_index('Ref_ID')['ShortRef'])
                                         .groupby(level=0).agg(list))
输出：
      Reference Description                                Reference2
    0       1,2     Descr 1               [Smith (2006), Mike (2009)]
    1         3     Descr 2                             [John (2014)]
    2     1,3,5     Descr 3  [Smith (2006), John (2014), Jill (2019)]
@Datanovice 感谢更新。
    df1['Reference2'] = (df1['Reference'].str.split(',')
                                         .explode()
                                         .map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
                                                 .set_index('Ref_ID')['ShortRef'])
                                         .groupby(level=0).agg(';'.join))
输出：
      Reference Description                            Reference2
    0       1,2     Descr 1              Smith (2006);Mike (2009)
    1         3     Descr 2                           John (2014)
    2     1,3,5     Descr 3  Smith (2006);John (2014);Jill (2019)

I've translated the provided code for you.

英文:

Let's try this:

df1 = pd.DataFrame({&#39;Reference&#39;:[&#39;1,2&#39;,&#39;3&#39;,&#39;1,3,5&#39;], &#39;Description&#39;:[&#39;Descr 1&#39;, &#39;Descr 2&#39;, &#39;Descr 3&#39;]})
df2 = pd.DataFrame({&#39;Ref_ID&#39;:[1,2,3,4,5,6], &#39;ShortRef&#39;:[&#39;Smith (2006)&#39;,
                                                       &#39;Mike (2009)&#39;,
                                                       &#39;John (2014)&#39;,
                                                       &#39;Cole (2007)&#39;,
                                                       &#39;Jill (2019)&#39;,
                                                       &#39;Tom (2007)&#39;]})
df1[&#39;Reference2&#39;] = (df1[&#39;Reference&#39;].str.split(&#39;,&#39;)
                                     .explode()
                                     .map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
                                             .set_index(&#39;Ref_ID&#39;)[&#39;ShortRef&#39;])
                                     .groupby(level=0).agg(list))

Output:

  Reference Description                                Reference2
0       1,2     Descr 1               [Smith (2006), Mike (2009)]
1         3     Descr 2                             [John (2014)]
2     1,3,5     Descr 3  [Smith (2006), John (2014), Jill (2019)]

@Datanovice thanks for the update.

df1[&#39;Reference2&#39;] = (df1[&#39;Reference&#39;].str.split(&#39;,&#39;)
                                     .explode()
                                     .map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
                                             .set_index(&#39;Ref_ID&#39;)[&#39;ShortRef&#39;])
                                     .groupby(level=0).agg(&#39;;&#39;.join))

Output:

  Reference Description                            Reference2
0       1,2     Descr 1              Smith (2006);Mike (2009)
1         3     Descr 2                           John (2014)
2     1,3,5     Descr 3  Smith (2006);John (2014);Jill (2019)

答案3

得分: 1

Another solution is using str.get_dummies and dot

df3 = (df1.set_index('Description').Reference.str.get_dummies(',')
          .reindex(columns=df2.Ref_ID.astype(str).values, fill_value=0))
df_final = (df3.dot(df2.ShortRef.values+';').str.strip(';').rename('References')
               .reset_index())
Out[462]:
  Description                           References
0     Descr 1             Smith (2006);Mike (2009)
1     Descr 2                          John (2014)
2     Descr 3  Mike (2009);John (2014);Jill (2019)

英文:

Another solution is using str.get_dummies and dot

df3 = (df1.set_index(&#39;Description&#39;).Reference.str.get_dummies(&#39;,&#39;)
          .reindex(columns=df2.Ref_ID.astype(str).values, fill_value=0))
df_final = (df3.dot(df2.ShortRef.values+&#39;;&#39;).str.strip(&#39;;&#39;).rename(&#39;References&#39;)
               .reset_index())
Out[462]:
  Description                           References
0     Descr 1             Smith (2006);Mike (2009)
1     Descr 2                          John (2014)
2     Descr 3  Mike (2009);John (2014);Jill (2019)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

用另一个数据框中的值替换数据框中的逗号分隔值。

问题

答案1

答案2

答案3

数据框列基于简单多数进行聚合。

为什么我无法导入这个模块？

如何将类导入其他文件？

AttributeError: ‘str’ 对象的属性 ‘format’ 是只读的

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。