英文:
Extract automatically equality between two strings Python on a new dataframe
问题
以下是已经翻译好的内容:
我有一个数据框,如下所示:
d = {'col1': ["url/a/b/c/d", "url/b/c/d", "url/j/k", "url/t/y", 'url/r/a/y'],
'id': [1, 2, 3, 4, 5]}
df = pd.DataFrame(data=d)
我想基于原始数据框创建另一个数据框,其中我只有字符串中重复部分。
我的想法是在每个“/”处拆分,然后将数据框的第一行与数据框的其余部分进行比较(对所有行都执行此操作),以检查它们是否相等。因此,对于我在此问题的初始示例的结果将是:
result = {'col1': [["a", "b", "c", "d"], ["b", "c", "d"], [""], ["y"], ["a", "y"]],
'id': [1, 2, 3, 4, 5]}
df_result = pd.DataFrame(data=result)
此外,我无法构建此函数而没有错误...有什么想法?
英文:
I have a data frame like this:
d = {'col1': ["url/a/b/c/d", "url/b/c/d", "url/j/k", "url/t/y", 'url/r/a/y'],
'id': [1, 2, 3, 4, 5]}
df = pd.DataFrame(data=d)
And I want to create another dataframe based on the original one where I have only the part of the strings that repeat.
My idea was to split on each /
and then compare the first line of the dataframe with the rest of the dataframe (and so one for all lines) to check the equality. Therefore the result for the my initial example on this question will be:
result = {'col1': [["a", "b", "c", "d"], ["b", "c", "d"], [""], ["y"], ["a", "y"]],
'id': [1, 2, 3, 4, 5]}
df_result = pd.DataFrame(data=result)
Moreover I could not build this function without error... any idea?
答案1
得分: 2
你可以提取所有需要的部分(有多种方法可行),然后仅保留重复的值,使用 reindex
来添加缺失的空列表:
df['col1'] = (df['col1']
.str.extractall('/([^/]+)')[0]
.loc[lambda x: x.duplicated(keep=False)]
.groupby(level=0).agg(list)
.reindex(df.index, fill_value=[])
)
输出:
col1 id
0 [a, b, c, d] 1
1 [b, c, d] 2
2 [] 3
3 [y] 4
4 [a, y] 5
英文:
You can extract all the wanted parts (several methods are possible), then keep only the duplicated values, reindex
to add the missing empty lists:
df['col1'] = (df['col1']
.str.extractall('/([^/]+)')[0]
.loc[lambda x: x.duplicated(keep=False)]
.groupby(level=0).agg(list)
.reindex(df.index, fill_value=[])
)
Output:
col1 id
0 [a, b, c, d] 1
1 [b, c, d] 2
2 [] 3
3 [y] 4
4 [a, y] 5
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论