2023年3月7日 05:58:19go评论63阅读模式

英文:

Python: How to remove/discard a string from a set of strings using case-insensitive match?

问题

我有一个来自Wikidata的案例，在其中字符串“Articles containing video clips”显示在一组“categories”中并需要被移除。问题在于，它还以小写的“articles containing video clips”（“a”小写）形式出现在其他集合中。

删除它的简单/安全方法似乎是

   setA.discard("Articles containing video clips").discard("articles containing video clips")

完全足够，但在复杂情况下不会扩展。是否有其他方式可以不同地完成这项任务，而不是明显的循环或列表/集合推导，例如使用casefold进行比较？

  unwantedString = 'Articles containing video clip'
  setA = {'tsunami', 'articles containing video clip'}

  reducedSetA = {nonmatch for nonmatch in setA if nonmatch.casefold() != 
      unwantedString.casefold }

  print(reducedSetA)
  {'tsunami'}

请注意，这不是字符串替换的情况 - 它是从一组字符串中删除一个字符串。

英文:

I have a case from Wikidata where the string Articles containing video clips shows up in a set of categories and needs to be removed. Trouble is, it also shows up in other sets as articles containing video clips (lowercase "a").

The simple/safe way to remove it seems to be

   setA.discard(&quot;Articles containing video clips&quot;).discard(&quot;articles containing video clips&quot;)

Perfectly adequate, but doesn't scale in complex cases. Is there any way to do this differently, other than the obvious loop or list/set comprehension using, say, casefold for the comparison?

  unwantedString = &#39;Articles containing video clip&#39;
  setA = {&#39;tsunami&#39;, &#39;articles containing video clip&#39;}

  reducedSetA = {nonmatch for nonmatch in setA if nonmatch.casefold() != 
      unwantedString.casefold }

  print(reducedSetA)
  {&#39;tsunami&#39;}

Note that this is not a string replacement situation - it is removal of a string from a set of strings.

答案1

得分: 0

使用集合理解来实现这个的问题是，它将一个 O(1) 操作转变为 O(N) 操作，因为你需要为集合中的每个 item 检查 item.casefold() != unwanted_String.casefold()。

一个解决方法是保留一个字典，将字符串存储在一个以小写形式为键的集合中。当你想要删除一个元素时，找到所有具有相同小写值的元素，并将它们一起删除。你可以编写一个处理这个的类，看起来像这样：

class EasyRemoveSet(set):
    def __init__(self, *args, key_func=str.casefold, **kwargs):
        super().__init__(*args, **kwargs)
        self.__key_func = key_func
        self.__lookup = {}
        self.__add_to_lookup(self)
        
    def __add_to_lookup(self, elems):
        for elem in elems:
            self.__lookup.setdefault(self.__key_func(elem), set()).add(elem)

    def add(self, elem):
        super().add(elem)
        self.__add_to_lookup([elem])

    def remove(self, elem):
        elems_to_remove = self.__lookup.pop(self.__key_func(elem))
        for e in elems_to_remove:
            super().remove(e)

    def discard(self, elem):
        elems_to_remove = self.__lookup.pop(self.__key_func(elem), [])
        for e in elems_to_remove:
            super().discard(e)
            
    def clear(self):
        super().clear()
        self.__lookup = {}

然后，你可以这样做：

setA = EasyRemoveSet(["abc", "Abc", "def", "DeF", "ABC", "abC", "DEF", "abc"])
print(setA)  # EasyRemoveSet({'abc', 'DEF', 'DeF', 'ABC', 'abC', 'def', 'Abc'})

setA.remove("Abc")
print(setA)  # EasyRemoveSet({'DEF', 'DeF', 'def'})

关键字参数 key_func 允许你指定一个可调用对象，其返回值将用作用于识别重复项的键。例如，如果你想要使用这个类处理整数，并一次删除负数和正数：

num_set = EasyRemoveSet([1, 2, 3, 4, 5, -1, -2, -3, -4, -5], key_func=abs)
print(num_set)
# EasyRemoveSet({1, 2, 3, 4, 5, -2, -5, -4, -3, -1})

num_set.discard(-5)
print(num_set)
# EasyRemoveSet({1, 2, 3, 4, -2, -4, -3, -1})

英文:

The problem with implementing this using a set comprehension as you do is that an O(1) operation is turned into an O(N) operation, since you need to check item.casefold() != unwanted_String.casefold() for each item in the set.

One option to work around this would be to keep a dictionary that stores the strings in a set with a lowercased key. When you want to delete an element, find all elements that have the same lowercase value, and delete those too.
You could write a class to handle this that would look like so:

class EasyRemoveSet(set):
    def __init__(self, *args, key_func=str.casefold, **kwargs):
        super().__init__(*args, **kwargs)
        self.__key_func = key_func
        self.__lookup = {}
        self.__add_to_lookup(self)
        
    def __add_to_lookup(self, elems):
        for elem in elems:
            self.__lookup.setdefault(self.__key_func(elem), set()).add(elem)

    def add(self, elem):
        super().add(elem)
        self.__add_to_lookup([elem])

    def remove(self, elem):
        elems_to_remove = self.__lookup.pop(self.__key_func(elem))
        for e in elems_to_remove:
            super().remove(e)

    def discard(self, elem):
        elems_to_remove = self.__lookup.pop(self.__key_func(elem), [])
        for e in elems_to_remove:
            super().discard(e)
            
    def clear(self):
        super().clear()
        self.__lookup = {}

Then, you can do:

setA = EasyRemoveSet([&quot;abc&quot;, &quot;Abc&quot;, &quot;def&quot;, &quot;DeF&quot;, &quot;ABC&quot;, &quot;abC&quot;, &quot;DEF&quot;, &quot;abc&quot;])
print(setA) # EasyRemoveSet({&#39;abc&#39;, &#39;DEF&#39;, &#39;DeF&#39;, &#39;ABC&#39;, &#39;abC&#39;, &#39;def&#39;, &#39;Abc&#39;})

setA.remove(&quot;Abc&quot;)
print(setA) # EasyRemoveSet({&#39;DEF&#39;, &#39;DeF&#39;, &#39;def&#39;})

The keyword-only argument key_func allows you to specify a callable whose return value will be used as the key to identify duplicates. For example, if you wanted to use this class for integers, and remove negative and positive integers in one go:

num_set = EasyRemoveSet([1, 2, 3, 4, 5, -1, -2, -3, -4, -5], key_func=abs)
print(num_set)
# EasyRemoveSet({1, 2, 3, 4, 5, -2, -5, -4, -3, -1})

num_set.discard(-5)
print(num_set)
# EasyRemoveSet({1, 2, 3, 4, -2, -4, -3, -1})

答案2

得分: 0

您还可以使用正则表达式。

import re

unwantedStrings = {"Articles containing video clip", "asdf"}
setA = {"tsunami", "articles containing video clip", "asdf", "asdfasdf", "asdfasddf"}

# 从集合中删除不需要的字符串
regex = re.compile("|".join(map(lambda s: "^" + s + "$", unwantedStrings)), re.IGNORECASE)
reducedSetA = set(filter(lambda x: not regex.search(x), setA))

print(reducedSetA)
# {'tsunami', 'asdfasddf', 'asdfasdf'}

上述代码仅会删除完全匹配的字符串。如果您还希望删除"asdfasdf"，因为不需要字符串中包含"asdf"，您可以将正则表达式行更改为以下行。

...
regex = re.compile("|".join(unwantedStrings), re.IGNORECASE)
...
# {'tsunami'}

英文:

You can also use regex.

import re

unwantedStrings = {&quot;Articles containing video clip&quot;, &quot;asdf&quot;}
setA = {&quot;tsunami&quot;, &quot;articles containing video clip&quot;, &quot;asdf&quot;, &quot;asdfasdf&quot;, &quot;asdfasddf&quot;}

# remove the unwanted strings from the set
regex = re.compile(&quot;|&quot;.join(map(lambda s: &quot;^&quot; + s + &quot;$&quot;, unwantedStrings)), re.IGNORECASE)
reducedSetA = set(filter(lambda x: not regex.search(x), setA))

print(reducedSetA)
# {&#39;tsunami&#39;, &#39;asdfasddf&#39;, &#39;asdfasdf&#39;}

The above code will remove only the exact matches. If you also want to remove the "asdfasdf" because you have "asdf" in unwanted string. You can change the regex line to this line.

...
regex = re.compile(&quot;|&quot;.join(unwantedStrings), re.IGNORECASE)
...
# {&#39;tsunami&#39;}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python: 如何使用不区分大小写的匹配从一组字符串中删除/丢弃一个字符串？

问题

答案1

答案2

为什么机器人不发送照片？

Running multiple scripts in sequence in Python.

如何从一个文件夹导入另一个文件夹中的另一个Python文件

按照条件排序等级字母

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论