尝试根据条件从字典列表中移除重复项。

huangapple go评论54阅读模式
英文:

Trying to remove double based on condition from list of dictionaries

问题

以下是翻译好的部分:

我有这个字典列表:

list_dict = [
    {'title':'abc defg hij', 'situation':'other'},
    {'title':'c defg', 'situation':'other'},
    {'title':'defg hij', 'situation':'other'},
    {'title':'defg hij', 'situation':'deleted'}]

我试图移除具有标题中重复元素和相同情况的每个字典,仅保留标题键中最长字符串的字典。

期望的输出如下:

[{'title':'abc defg hij', 'situation':'other'},
 {'title':'defg hij', 'situation':'deleted'}]
英文:

I have this list of dictionaries:

list_dict = [
    {'title':'abc defg hij', 'situation':'other'},
    {'title':'c defg', 'situation':'other'},
    {'title':'defg hij', 'situation':'other'},
    {'title':'defg hij', 'situation':'deleted'}]

I'm trying to remove every dictionnary that has some reccuring elements in the title AND the same situation, keeping only the one with the longest string in the title key.

The desired output would be as follows:

[{'title':'abc defg hij', 'situation':'other'},
 {'title':'defg hij', 'situation':'deleted'}]

答案1

得分: 1

I'm assuming that by "has some recurring elements in the title", you mean "is a substring of any other title" (within a given situation).

I'm assuming also that you're dealing with relatively small datasets so you won't be concerned with a quadratic algorithm for eliminating redundant strings. Nothing fancy – just construct a set of compatible strings adding one string at a time, checking for substrings:

def find_distinct_strs(all_strs):
    distinct_strs = set()

    for new_str in all_strs:
        for existing_str in distinct_strs:
            if new_str in existing_str:
                # new_str is redundant, go to next
                break
            elif existing_str in new_str:
                # new_str supersedes existing_str
                distinct.remove(existing_str)
        else:
            distinct_strs.add(new_str)
            continue

        break

    return list(distinct_strs)

You can then group all the entries by situation, find the distinct titles, and construct a suitably thinned list:

from collections import groupby
def filter_list_dict(list_dict):
    return [
        dict(title=title, situation=situation)
            for situation, entries in groupby(list_dict, lambda entry: entry["situation"])
                for title in find_distinct_strs(entry["title"] for entry in entries)
    ]

Test the output:

> list_dict = [
    {'title':'abc defg hij', 'situation':'other'},
    {'title':'c defg', 'situation':'other'},
    {'title':'defg hij', 'situation':'other'},
    {'title':'defg hij', 'situation':'deleted'}
]
> print(filter_list_dict(list_dict))
[{'title': 'abc defg hij', 'situation': 'other'},
   {'title': 'defg hij', 'situation': 'deleted'}]
英文:

I'm assuming that by "has some recurring elements in the title", you mean "is a substring of any other title" (within a given situation).

I'm assuming also that you're dealing with relatively small datasets so you won't be concerned with a quadratic algorithm for eliminating redundant strings. Nothing fancy – just construct a set of compatible strings adding one string at a time, checking for substrings:

def find_distinct_strs(all_strs):
    distinct_strs = set()

    for new_str in all_strs:
        for existing_str in distinct_strs:
            if new_str in existing_str:
                # new_str is redundant, go to next
                break
            elif existing_str in new_str:
                # new_str supersedes existing_str
                distinct.remove(existing_str)
        else:
            distinct_strs.add(new_str)
            continue

        break

    return list(distinct_strs)

You can then group all the entries by situation, find the distinct titles, and construct a suitably thinned list:

from collections import groupby
def filter_list_dict(list_dict):
    return [
        dict(title=title, situation=situation)
            for situation, entries in groupby(list_dict, lambda entry: entry["situation"])
                for title in find_distinct_strs(entry["title"] for entry in entries)
    ]

Test the output:

> list_dict = [
    {'title':'abc defg hij', 'situation':'other'},
    {'title':'c defg', 'situation':'other'},
    {'title':'defg hij', 'situation':'other'},
    {'title':'defg hij', 'situation':'deleted'}
]
> print(filter_list_dict(list_dict))
[{'title': 'abc defg hij', 'situation': 'other'},
   {'title': 'defg hij', 'situation': 'deleted'}]

huangapple
  • 本文由 发表于 2023年4月11日 03:21:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/75980051.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定