2023年3月21日 02:15:50go评论104阅读模式

英文:

Extract automatically equality between two strings Python on a new dataframe

问题

以下是已经翻译好的内容：

我有一个数据框，如下所示：
    d = {'col1': ["url/a/b/c/d", "url/b/c/d", "url/j/k", "url/t/y", 'url/r/a/y'],
         'id':   [1, 2, 3, 4, 5]}
    df = pd.DataFrame(data=d)
我想基于原始数据框创建另一个数据框，其中我只有字符串中重复部分。
我的想法是在每个“/”处拆分，然后将数据框的第一行与数据框的其余部分进行比较（对所有行都执行此操作），以检查它们是否相等。因此，对于我在此问题的初始示例的结果将是：
    result = {'col1': [["a", "b", "c", "d"], ["b", "c", "d"], [""], ["y"], ["a", "y"]],
              'id':   [1, 2, 3, 4, 5]}
    df_result = pd.DataFrame(data=result)
此外，我无法构建此函数而没有错误...有什么想法？

英文:

I have a data frame like this:

d = {&#39;col1&#39;: [&quot;url/a/b/c/d&quot;, &quot;url/b/c/d&quot;, &quot;url/j/k&quot;, &quot;url/t/y&quot;, &#39;url/r/a/y&#39;],
     &#39;id&#39;:   [1, 2, 3, 4, 5]}
df = pd.DataFrame(data=d)

And I want to create another dataframe based on the original one where I have only the part of the strings that repeat.

My idea was to split on each / and then compare the first line of the dataframe with the rest of the dataframe (and so one for all lines) to check the equality. Therefore the result for the my initial example on this question will be:

result = {&#39;col1&#39;: [[&quot;a&quot;, &quot;b&quot;, &quot;c&quot;, &quot;d&quot;], [&quot;b&quot;, &quot;c&quot;, &quot;d&quot;], [&quot;&quot;], [&quot;y&quot;], [&quot;a&quot;, &quot;y&quot;]],
          &#39;id&#39;:   [1, 2, 3, 4, 5]}
df_result = pd.DataFrame(data=result)

Moreover I could not build this function without error... any idea?

答案1

得分: 2

你可以提取所有需要的部分（有多种方法可行），然后仅保留重复的值，使用 reindex 来添加缺失的空列表：

df['col1'] = (df['col1']
 .str.extractall('/([^/]+)')[0]
 .loc[lambda x: x.duplicated(keep=False)]
 .groupby(level=0).agg(list)
 .reindex(df.index, fill_value=[])
 )

输出：

           col1  id
0  [a, b, c, d]   1
1     [b, c, d]   2
2            []   3
3           [y]   4
4        [a, y]   5

英文:

You can extract all the wanted parts (several methods are possible), then keep only the duplicated values, reindex to add the missing empty lists:

df[&#39;col1&#39;] = (df[&#39;col1&#39;]
 .str.extractall(&#39;/([^/]+)&#39;)[0]
 .loc[lambda x: x.duplicated(keep=False)]
 .groupby(level=0).agg(list)
 .reindex(df.index, fill_value=[])
 )

Output:

           col1  id
0  [a, b, c, d]   1
1     [b, c, d]   2
2            []   3
3           [y]   4
4        [a, y]   5

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

提取两个字符串在Python中的相等性，并将其存储在新的数据框中。

问题

答案1

基于特定条件在Python中筛选DataFrame

pandas – 在多列中筛选具有相同值的行

如何在Python中使用文本文件创建列表

When shallow copying a dictionary in Python, why is modifying a list value reflected in the original but a string value is not?

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。