英文:
Trying to remove double based on condition from list of dictionaries
问题
以下是翻译好的部分:
我有这个字典列表:
list_dict = [
{'title':'abc defg hij', 'situation':'other'},
{'title':'c defg', 'situation':'other'},
{'title':'defg hij', 'situation':'other'},
{'title':'defg hij', 'situation':'deleted'}]
我试图移除具有标题中重复元素和相同情况的每个字典,仅保留标题键中最长字符串的字典。
期望的输出如下:
[{'title':'abc defg hij', 'situation':'other'},
{'title':'defg hij', 'situation':'deleted'}]
英文:
I have this list of dictionaries:
list_dict = [
{'title':'abc defg hij', 'situation':'other'},
{'title':'c defg', 'situation':'other'},
{'title':'defg hij', 'situation':'other'},
{'title':'defg hij', 'situation':'deleted'}]
I'm trying to remove every dictionnary that has some reccuring elements in the title AND the same situation, keeping only the one with the longest string in the title key.
The desired output would be as follows:
[{'title':'abc defg hij', 'situation':'other'},
{'title':'defg hij', 'situation':'deleted'}]
答案1
得分: 1
I'm assuming that by "has some recurring elements in the title", you mean "is a substring of any other title" (within a given situation).
I'm assuming also that you're dealing with relatively small datasets so you won't be concerned with a quadratic algorithm for eliminating redundant strings. Nothing fancy – just construct a set of compatible strings adding one string at a time, checking for substrings:
def find_distinct_strs(all_strs):
distinct_strs = set()
for new_str in all_strs:
for existing_str in distinct_strs:
if new_str in existing_str:
# new_str is redundant, go to next
break
elif existing_str in new_str:
# new_str supersedes existing_str
distinct.remove(existing_str)
else:
distinct_strs.add(new_str)
continue
break
return list(distinct_strs)
You can then group all the entries by situation
, find the distinct titles, and construct a suitably thinned list:
from collections import groupby
def filter_list_dict(list_dict):
return [
dict(title=title, situation=situation)
for situation, entries in groupby(list_dict, lambda entry: entry["situation"])
for title in find_distinct_strs(entry["title"] for entry in entries)
]
Test the output:
> list_dict = [
{'title':'abc defg hij', 'situation':'other'},
{'title':'c defg', 'situation':'other'},
{'title':'defg hij', 'situation':'other'},
{'title':'defg hij', 'situation':'deleted'}
]
> print(filter_list_dict(list_dict))
[{'title': 'abc defg hij', 'situation': 'other'},
{'title': 'defg hij', 'situation': 'deleted'}]
英文:
I'm assuming that by "has some recurring elements in the title", you mean "is a substring of any other title" (within a given situation).
I'm assuming also that you're dealing with relatively small datasets so you won't be concerned with a quadratic algorithm for eliminating redundant strings. Nothing fancy – just construct a set of compatible strings adding one string at a time, checking for substrings:
def find_distinct_strs(all_strs):
distinct_strs = set()
for new_str in all_strs:
for existing_str in distinct_strs:
if new_str in existing_str:
# new_str is redundant, go to next
break
elif existing_str in new_str:
# new_str supersedes existing_str
distinct.remove(existing_str)
else:
distinct_strs.add(new_str)
continue
break
return list(distinct_strs)
You can then group all the entries by situation
, find the distinct titles, and construct a suitably thinned list:
from collections import groupby
def filter_list_dict(list_dict):
return [
dict(title=title, situation=situation)
for situation, entries in groupby(list_dict, lambda entry: entry["situation"])
for title in find_distinct_strs(entry["title"] for entry in entries)
]
Test the output:
> list_dict = [
{'title':'abc defg hij', 'situation':'other'},
{'title':'c defg', 'situation':'other'},
{'title':'defg hij', 'situation':'other'},
{'title':'defg hij', 'situation':'deleted'}
]
> print(filter_list_dict(list_dict))
[{'title': 'abc defg hij', 'situation': 'other'},
{'title': 'defg hij', 'situation': 'deleted'}]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论