英文:
Concatenating similar items in a list - Python
问题
I have a list of similar and unique words. The similar words are appeared in one string and are separated by " | ".
input = ["car | cat", "cat | caat", "car | caar", "dog", "ant | ants"]
I want to get the following output so that we could find out car, cat, caat, and caar are all similar instead of having pairs of similar words that have been repeated. So target output is like this:
output= ["car | cat | caat | caar", "dog" , "ant | ants"]
So far, I've managed to get ["car | cat | caat | caar", "dog", "ant", "ants"]. But I want to keep "ant | ants" intact since it doesn't have any word in common with any other pairs.
Is someone able to write a python code to solve this problem?
英文:
I have a list of similar and unique words. The similar words are appeared in one string and are separated by "|".
input = ["car | cat", "cat | caat", "car | caar", "dog", "ant | ants"]
I want to get the following output so that we could find out car, cat, caat, and caar are all similar instead of having pairs of similar words that have been repeated. So target output is like this:
output= ["car | cat | caat | caar", "dog" , "ant | ants"]
So far, I've managed to get ["car | cat | caat | caar", "dog", "ant", "ants"]. But I want to keep "ant | ants" intact since it doesn't have any word in common with any other pairs.
Is someone able to write a python code to solve this problem?
Edit:
Here is the code to my attempt but I don't want to make you feel that you should use the same approach.
def concat_common_words(input):
my_list = input
split_my_list = [x.split(" | ") for x in my_list]
flat_my_list = [i for j in split_my_list for i in j]
count_my_list = Counter(flat_my_list)
common = [k for k, v in count_my_list.items() if v > 1]
target_my_list = [x for x in my_list if any(c in x for c in common)]
flat_target_my_list = set(sf for sfs in target_my_list for sf in sfs.split(" | "))
merged = [" | ".join(flat_target_my_list)] \
+ list(set(flat_my_list) - flat_target_my_list)
return merged
concat_common_words(["car | cat", "cat | caat", "car | caar", "dog", "ant | ants"])
It returns ["car | cat | caat | caar", "dog" , "ant" , "ants"]
. But as I mentioned, I ant to keep "ant | ants" intact.
答案1
得分: 1
# 为每个组创建一个 set(),例如 car | cat
# 当添加新组时,如果它们有交集,我将与任何现有组合并。
data = ["car | cat", "cat | caat", "car | caar", "dog", "ant | ants"]
groups = []
for item in data:
words = set(item.split(" | "))
to_remove = []
for existing_group in groups:
if words.intersection(existing_group):
words.update(existing_group)
to_remove.append(existing_group)
for removal in to_remove:
groups.remove(removal)
groups.append(words)
# 将组转换回用竖线分隔的形式
final_groups = " | ".join(group for group in groups)
英文:
# I would create a set() for each group e.g. car | cat
# when adding a new group I would then merge with any existing group if
# they intersect.
data = ["car | cat", "cat | caat", "car | caar", "dog", "ant | ants"]
groups = []
for item in data:
words = set(item.split(" | "))
to_remove = []
for existing_group in groups:
if words.intersection(existing_group):
words.update(existing_group)
to_remove.append(existing_group)
for removal in to_remove:
groups.remove(removal)
groups.append(words)
# convert groups back to pipe separated
final_groups = [" | ".join(group) for group in groups]
答案2
得分: 1
以下是翻译好的部分:
如果你想要使用Levenshtein距离,请按照以下步骤进行:
from Levenshtein import distance as lev
data = ["car | cat", "cat | caat", "car | caar", "dog", "ant | ants"]
# 设置所需的阈值
threshold = 1
# 创建一个修剪后字符串的唯一集合
dataset = set([e.strip() for ls in 展开收缩 for e in ls])
# 创建一个用于检查已经选择的字符串的字典列表
dl = [{'name': s, 'taken': False} for s in dataset]
dd = []
for i in range(0, len(dl)):
# 检查是否未被选择
if dl[i]['taken'] is False:
ds = set()
dl[i]['taken'] = True
ds.add(dl[i]['name'])
for j in range(i + 1, len(dl)):
# 检查是否未被选择并且满足距离条件
if dl[j]['taken'] is False and lev(dl[i]['name'], dl[j]['name']) <= threshold:
dl[j]['taken'] = True
ds add(dl[j]['name'])
dd.append(' | '.join(ds))
print(dd)
# 输出: ['caar | cat | caat', 'caar | car', 'car', 'ant | ants']
希望这对你有所帮助。
英文:
If you want to use the Levenshtein distance, proceed as follows:
from Levenshtein import distance as lev
data = ["car | cat", "cat | caat", "car | caar", "dog", "ant | ants"]
# set the desired threshold
threshold = 1
# create a unique set of trimmed strings
dataset = set([e.strip() for ls in [ s.split('|') for s in data ] for e in ls ])
# create a list of dicts to check already take strings
dl = [ { 'name': s, 'taken': False } for s in dataset ]
dd = []
for i in range(0, len(dl)):
# check whether it is not taken
if dl[i]['taken'] is False:
ds = set()
dl[i]['taken'] = True
ds.add(dl[i]['name'])
for j in range(i + 1, len(dl)):
# check whether it is not taken and satisfying distance
if dl[j]['taken'] is False and lev(dl[i]['name'], dl[j]['name']) <= threshold:
dl[j]['taken'] = True
ds.add(dl[j]['name'])
dd.append(' | '.join(ds))
print(dd)
# output: ['caar | cat | caat', 'caar | car', 'car', 'ant | ants']
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论