2023年3月3日 23:21:17go评论87阅读模式

英文:

Concatenating similar items in a list - Python

问题

I have a list of similar and unique words. The similar words are appeared in one string and are separated by " | ".

input = ["car | cat", "cat | caat", "car | caar", "dog", "ant | ants"]

I want to get the following output so that we could find out car, cat, caat, and caar are all similar instead of having pairs of similar words that have been repeated. So target output is like this:

output= ["car | cat | caat | caar", "dog" , "ant | ants"]

So far, I've managed to get ["car | cat | caat | caar", "dog", "ant", "ants"]. But I want to keep "ant | ants" intact since it doesn't have any word in common with any other pairs.

Is someone able to write a python code to solve this problem?

英文:

I have a list of similar and unique words. The similar words are appeared in one string and are separated by "|".

input = ["car | cat", "cat | caat", "car | caar", "dog", "ant | ants"]

I want to get the following output so that we could find out car, cat, caat, and caar are all similar instead of having pairs of similar words that have been repeated. So target output is like this:

output= ["car | cat | caat | caar", "dog" , "ant | ants"]

So far, I've managed to get ["car | cat | caat | caar", "dog", "ant", "ants"]. But I want to keep "ant | ants" intact since it doesn't have any word in common with any other pairs.

Is someone able to write a python code to solve this problem?

Edit:

Here is the code to my attempt but I don't want to make you feel that you should use the same approach.

def concat_common_words(input):
    my_list = input
    split_my_list = [x.split(&quot; | &quot;) for x in my_list]
    flat_my_list = [i for j in split_my_list for i in j]
    count_my_list = Counter(flat_my_list)
    common = [k for k, v in count_my_list.items() if v &gt; 1]
    target_my_list = [x for x in my_list if any(c in x for c in common)]
    flat_target_my_list = set(sf for sfs in target_my_list for sf in sfs.split(&quot; | &quot;))
    merged = [&quot; | &quot;.join(flat_target_my_list)] \
    + list(set(flat_my_list) - flat_target_my_list) 
    return merged
concat_common_words([&quot;car | cat&quot;, &quot;cat | caat&quot;, &quot;car | caar&quot;, &quot;dog&quot;, &quot;ant | ants&quot;])

It returns ["car | cat | caat | caar", "dog" , "ant" , "ants"]
. But as I mentioned, I ant to keep "ant | ants" intact.

答案1

得分: 1

# 为每个组创建一个 set()，例如 car | cat
# 当添加新组时，如果它们有交集，我将与任何现有组合并。
data = ["car | cat", "cat | caat", "car | caar", "dog", "ant | ants"]
groups = []
for item in data:
    words = set(item.split(" | "))
    to_remove = []
    for existing_group in groups:
        if words.intersection(existing_group):
            words.update(existing_group)
            to_remove.append(existing_group)
    for removal in to_remove:
        groups.remove(removal)
    groups.append(words)
# 将组转换回用竖线分隔的形式
final_groups = " | ".join(group for group in groups)

英文:

# I would create a set() for each group e.g. car | cat
# when adding a new group I would then merge with any existing group if
# they intersect.
data = [&quot;car | cat&quot;, &quot;cat | caat&quot;, &quot;car | caar&quot;, &quot;dog&quot;, &quot;ant | ants&quot;]
groups = []
for item in data:
    words = set(item.split(&quot; | &quot;))
    to_remove = []
    for existing_group in groups:
        if words.intersection(existing_group):
            words.update(existing_group)
            to_remove.append(existing_group)
    for removal in to_remove:
        groups.remove(removal)
    groups.append(words)
# convert groups back to pipe separated
final_groups = [&quot; | &quot;.join(group) for group in groups]

答案2

得分: 1

以下是翻译好的部分：

如果你想要使用Levenshtein距离，请按照以下步骤进行：
from Levenshtein import distance as lev
data = ["car | cat", "cat | caat", "car | caar", "dog", "ant | ants"]
# 设置所需的阈值
threshold = 1
# 创建一个修剪后字符串的唯一集合
dataset = set([e.strip() for ls in 展开收缩
 for e in ls])
# 创建一个用于检查已经选择的字符串的字典列表
dl = [{'name': s, 'taken': False} for s in dataset]
dd = []
for i in range(0, len(dl)):
    # 检查是否未被选择
    if dl[i]['taken'] is False:
        ds = set()
        dl[i]['taken'] = True
        ds.add(dl[i]['name'])
        for j in range(i + 1, len(dl)):
            # 检查是否未被选择并且满足距离条件
            if dl[j]['taken'] is False and lev(dl[i]['name'], dl[j]['name']) <= threshold:
                dl[j]['taken'] = True
                ds add(dl[j]['name'])
        dd.append(' | '.join(ds))
print(dd)
# 输出: ['caar | cat | caat', 'caar | car', 'car', 'ant | ants']

希望这对你有所帮助。

英文:

If you want to use the Levenshtein distance, proceed as follows:

from Levenshtein import distance as lev
data = [&quot;car | cat&quot;, &quot;cat | caat&quot;, &quot;car | caar&quot;, &quot;dog&quot;, &quot;ant | ants&quot;]
# set the desired threshold
threshold = 1
# create a unique set of trimmed strings
dataset = set([e.strip() for ls in [ s.split(&#39;|&#39;) for s in data ] for e in ls ])
# create a list of dicts to check already take strings
dl = [ { &#39;name&#39;: s, &#39;taken&#39;: False } for s in dataset ]
dd = []
for i in range(0, len(dl)):
    # check whether it is not taken
    if dl[i][&#39;taken&#39;] is False:
        ds = set()
        dl[i][&#39;taken&#39;] = True
        ds.add(dl[i][&#39;name&#39;])
        for j in range(i + 1, len(dl)):
            # check whether it is not taken and satisfying distance
            if dl[j][&#39;taken&#39;] is False and lev(dl[i][&#39;name&#39;], dl[j][&#39;name&#39;]) &lt;= threshold:
                dl[j][&#39;taken&#39;] = True
                ds.add(dl[j][&#39;name&#39;])
        dd.append(&#39; | &#39;.join(ds))
    
print(dd)
# output: [&#39;caar | cat | caat&#39;, &#39;caar | car&#39;, &#39;car&#39;, &#39;ant | ants&#39;]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Concatenating similar items in a list – Python

问题

答案1

答案2

有一种方法可以生成组合并增加数值吗？

在数据框中迭代行和分组

如何处理yolov8中`model.predict`的结果？

ValueError: 数据必须是一维的，而不是形状为 (6, 1) 的 ndarray。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。