Python: 如何使用不区分大小写的匹配从一组字符串中删除/丢弃一个字符串?

huangapple go评论63阅读模式
英文:

Python: How to remove/discard a string from a set of strings using case-insensitive match?

问题

我有一个来自Wikidata的案例,在其中字符串“Articles containing video clips”显示在一组“categories”中并需要被移除。问题在于,它还以小写的“articles containing video clips”(“a”小写)形式出现在其他集合中。

删除它的简单/安全方法似乎是

   setA.discard("Articles containing video clips").discard("articles containing video clips")

完全足够,但在复杂情况下不会扩展。是否有其他方式可以不同地完成这项任务,而不是明显的循环或列表/集合推导,例如使用casefold进行比较?

  unwantedString = 'Articles containing video clip'
  setA = {'tsunami', 'articles containing video clip'}

  reducedSetA = {nonmatch for nonmatch in setA if nonmatch.casefold() != 
      unwantedString.casefold }

  print(reducedSetA)
  {'tsunami'}

请注意,这 是字符串替换的情况 - 它是从一组字符串中删除一个字符串。

英文:

I have a case from Wikidata where the string Articles containing video clips shows up in a set of categories and needs to be removed. Trouble is, it also shows up in other sets as articles containing video clips (lowercase "a").

The simple/safe way to remove it seems to be

   setA.discard("Articles containing video clips").discard("articles containing video clips")

Perfectly adequate, but doesn't scale in complex cases. Is there any way to do this differently, other than the obvious loop or list/set comprehension using, say, casefold for the comparison?

  unwantedString = 'Articles containing video clip'
  setA = {'tsunami', 'articles containing video clip'}

  reducedSetA = {nonmatch for nonmatch in setA if nonmatch.casefold() != 
      unwantedString.casefold }

  print(reducedSetA)
  {'tsunami'}

Note that this is not a string replacement situation - it is removal of a string from a set of strings.

答案1

得分: 0

使用集合理解来实现这个的问题是,它将一个 O(1) 操作转变为 O(N) 操作,因为你需要为集合中的每个 item 检查 item.casefold() != unwanted_String.casefold()

一个解决方法是保留一个字典,将字符串存储在一个以小写形式为键的集合中。当你想要删除一个元素时,找到所有具有相同小写值的元素,并将它们一起删除。你可以编写一个处理这个的类,看起来像这样:

class EasyRemoveSet(set):
    def __init__(self, *args, key_func=str.casefold, **kwargs):
        super().__init__(*args, **kwargs)
        self.__key_func = key_func
        self.__lookup = {}
        self.__add_to_lookup(self)
        
    def __add_to_lookup(self, elems):
        for elem in elems:
            self.__lookup.setdefault(self.__key_func(elem), set()).add(elem)

    def add(self, elem):
        super().add(elem)
        self.__add_to_lookup([elem])

    def remove(self, elem):
        elems_to_remove = self.__lookup.pop(self.__key_func(elem))
        for e in elems_to_remove:
            super().remove(e)

    def discard(self, elem):
        elems_to_remove = self.__lookup.pop(self.__key_func(elem), [])
        for e in elems_to_remove:
            super().discard(e)
            
    def clear(self):
        super().clear()
        self.__lookup = {}

然后,你可以这样做:

setA = EasyRemoveSet(["abc", "Abc", "def", "DeF", "ABC", "abC", "DEF", "abc"])
print(setA)  # EasyRemoveSet({'abc', 'DEF', 'DeF', 'ABC', 'abC', 'def', 'Abc'})

setA.remove("Abc")
print(setA)  # EasyRemoveSet({'DEF', 'DeF', 'def'})

关键字参数 key_func 允许你指定一个可调用对象,其返回值将用作用于识别重复项的键。例如,如果你想要使用这个类处理整数,并一次删除负数和正数:

num_set = EasyRemoveSet([1, 2, 3, 4, 5, -1, -2, -3, -4, -5], key_func=abs)
print(num_set)
# EasyRemoveSet({1, 2, 3, 4, 5, -2, -5, -4, -3, -1})

num_set.discard(-5)
print(num_set)
# EasyRemoveSet({1, 2, 3, 4, -2, -4, -3, -1})
英文:

The problem with implementing this using a set comprehension as you do is that an O(1) operation is turned into an O(N) operation, since you need to check item.casefold() != unwanted_String.casefold() for each item in the set.

One option to work around this would be to keep a dictionary that stores the strings in a set with a lowercased key. When you want to delete an element, find all elements that have the same lowercase value, and delete those too.
You could write a class to handle this that would look like so:

class EasyRemoveSet(set):
    def __init__(self, *args, key_func=str.casefold, **kwargs):
        super().__init__(*args, **kwargs)
        self.__key_func = key_func
        self.__lookup = {}
        self.__add_to_lookup(self)
        
    def __add_to_lookup(self, elems):
        for elem in elems:
            self.__lookup.setdefault(self.__key_func(elem), set()).add(elem)

    def add(self, elem):
        super().add(elem)
        self.__add_to_lookup([elem])

    def remove(self, elem):
        elems_to_remove = self.__lookup.pop(self.__key_func(elem))
        for e in elems_to_remove:
            super().remove(e)

    def discard(self, elem):
        elems_to_remove = self.__lookup.pop(self.__key_func(elem), [])
        for e in elems_to_remove:
            super().discard(e)
            
    def clear(self):
        super().clear()
        self.__lookup = {}

Then, you can do:

setA = EasyRemoveSet(["abc", "Abc", "def", "DeF", "ABC", "abC", "DEF", "abc"])
print(setA) # EasyRemoveSet({'abc', 'DEF', 'DeF', 'ABC', 'abC', 'def', 'Abc'})

setA.remove("Abc")
print(setA) # EasyRemoveSet({'DEF', 'DeF', 'def'})

The keyword-only argument key_func allows you to specify a callable whose return value will be used as the key to identify duplicates. For example, if you wanted to use this class for integers, and remove negative and positive integers in one go:

num_set = EasyRemoveSet([1, 2, 3, 4, 5, -1, -2, -3, -4, -5], key_func=abs)
print(num_set)
# EasyRemoveSet({1, 2, 3, 4, 5, -2, -5, -4, -3, -1})

num_set.discard(-5)
print(num_set)
# EasyRemoveSet({1, 2, 3, 4, -2, -4, -3, -1})

答案2

得分: 0

您还可以使用正则表达式。

import re

unwantedStrings = {"Articles containing video clip", "asdf"}
setA = {"tsunami", "articles containing video clip", "asdf", "asdfasdf", "asdfasddf"}

# 从集合中删除不需要的字符串
regex = re.compile("|".join(map(lambda s: "^" + s + "$", unwantedStrings)), re.IGNORECASE)
reducedSetA = set(filter(lambda x: not regex.search(x), setA))

print(reducedSetA)
# {'tsunami', 'asdfasddf', 'asdfasdf'}

上述代码仅会删除完全匹配的字符串。如果您还希望删除"asdfasdf",因为不需要字符串中包含"asdf",您可以将正则表达式行更改为以下行。

...
regex = re.compile("|".join(unwantedStrings), re.IGNORECASE)
...
# {'tsunami'}
英文:

You can also use regex.

import re

unwantedStrings = {"Articles containing video clip", "asdf"}
setA = {"tsunami", "articles containing video clip", "asdf", "asdfasdf", "asdfasddf"}

# remove the unwanted strings from the set
regex = re.compile("|".join(map(lambda s: "^" + s + "$", unwantedStrings)), re.IGNORECASE)
reducedSetA = set(filter(lambda x: not regex.search(x), setA))

print(reducedSetA)
# {'tsunami', 'asdfasddf', 'asdfasdf'}

The above code will remove only the exact matches. If you also want to remove the "asdfasdf" because you have "asdf" in unwanted string. You can change the regex line to this line.

...
regex = re.compile("|".join(unwantedStrings), re.IGNORECASE)
...
# {'tsunami'}

huangapple
  • 本文由 发表于 2023年3月7日 05:58:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/75656211.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定