Concatenating similar items in a list – Python

huangapple go评论59阅读模式
英文:

Concatenating similar items in a list - Python

问题

I have a list of similar and unique words. The similar words are appeared in one string and are separated by " | ".

input = ["car | cat", "cat | caat", "car | caar", "dog", "ant | ants"]

I want to get the following output so that we could find out car, cat, caat, and caar are all similar instead of having pairs of similar words that have been repeated. So target output is like this:

output= ["car | cat | caat | caar", "dog" , "ant | ants"]

So far, I've managed to get ["car | cat | caat | caar", "dog", "ant", "ants"]. But I want to keep "ant | ants" intact since it doesn't have any word in common with any other pairs.

Is someone able to write a python code to solve this problem?

英文:

I have a list of similar and unique words. The similar words are appeared in one string and are separated by "|".

input = ["car | cat", "cat | caat", "car | caar", "dog", "ant | ants"]

I want to get the following output so that we could find out car, cat, caat, and caar are all similar instead of having pairs of similar words that have been repeated. So target output is like this:

output= ["car | cat | caat | caar", "dog" , "ant | ants"]

So far, I've managed to get ["car | cat | caat | caar", "dog", "ant", "ants"]. But I want to keep "ant | ants" intact since it doesn't have any word in common with any other pairs.

Is someone able to write a python code to solve this problem?

Edit:

Here is the code to my attempt but I don't want to make you feel that you should use the same approach.

def concat_common_words(input):
    my_list = input
    split_my_list = [x.split(" | ") for x in my_list]

    flat_my_list = [i for j in split_my_list for i in j]

    count_my_list = Counter(flat_my_list)

    common = [k for k, v in count_my_list.items() if v > 1]

    target_my_list = [x for x in my_list if any(c in x for c in common)]

    flat_target_my_list = set(sf for sfs in target_my_list for sf in sfs.split(" | "))

    merged = [" | ".join(flat_target_my_list)] \
    + list(set(flat_my_list) - flat_target_my_list) 

    return merged
concat_common_words(["car | cat", "cat | caat", "car | caar", "dog", "ant | ants"])

It returns ["car | cat | caat | caar", "dog" , "ant" , "ants"]
. But as I mentioned, I ant to keep "ant | ants" intact.

答案1

得分: 1

# 为每个组创建一个 set(),例如 car | cat
# 当添加新组时,如果它们有交集,我将与任何现有组合并。

data = ["car | cat", "cat | caat", "car | caar", "dog", "ant | ants"]

groups = []
for item in data:
    words = set(item.split(" | "))
    to_remove = []
    for existing_group in groups:
        if words.intersection(existing_group):
            words.update(existing_group)
            to_remove.append(existing_group)
    for removal in to_remove:
        groups.remove(removal)
    groups.append(words)

# 将组转换回用竖线分隔的形式
final_groups = " | ".join(group for group in groups)
英文:
# I would create a set() for each group e.g. car | cat
# when adding a new group I would then merge with any existing group if
# they intersect.

data = ["car | cat", "cat | caat", "car | caar", "dog", "ant | ants"]


groups = []
for item in data:
    words = set(item.split(" | "))
    to_remove = []
    for existing_group in groups:
        if words.intersection(existing_group):
            words.update(existing_group)
            to_remove.append(existing_group)
    for removal in to_remove:
        groups.remove(removal)
    groups.append(words)

# convert groups back to pipe separated
final_groups = [" | ".join(group) for group in groups]

答案2

得分: 1

以下是翻译好的部分:

如果你想要使用Levenshtein距离请按照以下步骤进行

from Levenshtein import distance as lev

data = ["car | cat", "cat | caat", "car | caar", "dog", "ant | ants"]

# 设置所需的阈值
threshold = 1

# 创建一个修剪后字符串的唯一集合
dataset = set([e.strip() for ls in 
展开收缩
for e in ls])
# 创建一个用于检查已经选择的字符串的字典列表 dl = [{'name': s, 'taken': False} for s in dataset] dd = [] for i in range(0, len(dl)): # 检查是否未被选择 if dl[i]['taken'] is False: ds = set() dl[i]['taken'] = True ds.add(dl[i]['name']) for j in range(i + 1, len(dl)): # 检查是否未被选择并且满足距离条件 if dl[j]['taken'] is False and lev(dl[i]['name'], dl[j]['name']) <= threshold: dl[j]['taken'] = True ds add(dl[j]['name']) dd.append(' | '.join(ds)) print(dd) # 输出: ['caar | cat | caat', 'caar | car', 'car', 'ant | ants']

希望这对你有所帮助。

英文:

If you want to use the Levenshtein distance, proceed as follows:

from Levenshtein import distance as lev

data = [&quot;car | cat&quot;, &quot;cat | caat&quot;, &quot;car | caar&quot;, &quot;dog&quot;, &quot;ant | ants&quot;]

# set the desired threshold
threshold = 1

# create a unique set of trimmed strings
dataset = set([e.strip() for ls in [ s.split(&#39;|&#39;) for s in data ] for e in ls ])

# create a list of dicts to check already take strings
dl = [ { &#39;name&#39;: s, &#39;taken&#39;: False } for s in dataset ]

dd = []

for i in range(0, len(dl)):
    # check whether it is not taken
    if dl[i][&#39;taken&#39;] is False:
        ds = set()
        dl[i][&#39;taken&#39;] = True
        ds.add(dl[i][&#39;name&#39;])
        for j in range(i + 1, len(dl)):
            # check whether it is not taken and satisfying distance
            if dl[j][&#39;taken&#39;] is False and lev(dl[i][&#39;name&#39;], dl[j][&#39;name&#39;]) &lt;= threshold:
                dl[j][&#39;taken&#39;] = True
                ds.add(dl[j][&#39;name&#39;])
        dd.append(&#39; | &#39;.join(ds))
    
print(dd)

# output: [&#39;caar | cat | caat&#39;, &#39;caar | car&#39;, &#39;car&#39;, &#39;ant | ants&#39;]

huangapple
  • 本文由 发表于 2023年3月3日 23:21:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/75628900.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定