提取两个字符串在Python中的相等性,并将其存储在新的数据框中。

huangapple go评论64阅读模式
英文:

Extract automatically equality between two strings Python on a new dataframe

问题

以下是已经翻译好的内容:

我有一个数据框如下所示

    d = {'col1': ["url/a/b/c/d", "url/b/c/d", "url/j/k", "url/t/y", 'url/r/a/y'],
         'id':   [1, 2, 3, 4, 5]}
    df = pd.DataFrame(data=d)

我想基于原始数据框创建另一个数据框其中我只有字符串中重复部分

我的想法是在每个/处拆分然后将数据框的第一行与数据框的其余部分进行比较对所有行都执行此操作),以检查它们是否相等因此对于我在此问题的初始示例的结果将是

    result = {'col1': [["a", "b", "c", "d"], ["b", "c", "d"], [""], ["y"], ["a", "y"]],
              'id':   [1, 2, 3, 4, 5]}
    df_result = pd.DataFrame(data=result)

此外我无法构建此函数而没有错误...有什么想法
英文:

I have a data frame like this:

d = {'col1': ["url/a/b/c/d", "url/b/c/d", "url/j/k", "url/t/y", 'url/r/a/y'],
     'id':   [1, 2, 3, 4, 5]}
df = pd.DataFrame(data=d)

And I want to create another dataframe based on the original one where I have only the part of the strings that repeat.

My idea was to split on each / and then compare the first line of the dataframe with the rest of the dataframe (and so one for all lines) to check the equality. Therefore the result for the my initial example on this question will be:

result = {'col1': [["a", "b", "c", "d"], ["b", "c", "d"], [""], ["y"], ["a", "y"]],
          'id':   [1, 2, 3, 4, 5]}
df_result = pd.DataFrame(data=result)

Moreover I could not build this function without error... any idea?

答案1

得分: 2

你可以提取所有需要的部分(有多种方法可行),然后仅保留重复的值,使用 reindex 来添加缺失的空列表:

df['col1'] = (df['col1']
 .str.extractall('/([^/]+)')[0]
 .loc[lambda x: x.duplicated(keep=False)]
 .groupby(level=0).agg(list)
 .reindex(df.index, fill_value=[])
 )

输出:

           col1  id
0  [a, b, c, d]   1
1     [b, c, d]   2
2            []   3
3           [y]   4
4        [a, y]   5
英文:

You can extract all the wanted parts (several methods are possible), then keep only the duplicated values, reindex to add the missing empty lists:

df['col1'] = (df['col1']
 .str.extractall('/([^/]+)')[0]
 .loc[lambda x: x.duplicated(keep=False)]
 .groupby(level=0).agg(list)
 .reindex(df.index, fill_value=[])
 )

Output:

           col1  id
0  [a, b, c, d]   1
1     [b, c, d]   2
2            []   3
3           [y]   4
4        [a, y]   5

huangapple
  • 本文由 发表于 2023年3月21日 02:15:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/75793914.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定