为什么这个 Python 正则表达式没有忽略重音符号?

huangapple go评论64阅读模式
英文:

Why is this python regular expression not ignoring accents?

问题

我正在使用以下正则表达式作为连接到MongoDB数据库的应用程序的筛选器:

{"$regex": re.compile(r'\b' + re.escape(value) + r'\b', re.IGNORECASE | re.UNICODE)}

这个正则表达式满足我的搜索条件,但我遇到了一个问题,即它不会忽略重音。例如:

数据库条目是:“Escobar,el patrón del mal Colombia historia”。

然后我搜索“El patron”。

我没有得到任何结果,因为字母O中的“重音”不允许我获取记录。我该如何修复它?我以为使用了re.UNICODE部分会忽略这个问题。

英文:

I am using the following regular expression for a filter of an application that connects to a MongoDB database:

{"$regex": re.compile(r'\b' + re.escape(value) + r'\b', re.IGNORECASE | re.UNICODE)}

The regular expression meets my search criteria however I have a problem and that is that it does not ignore accents. For example:

The database entry is: "Escobar, el patrón del mal Colombia historia".

And I search for "El patron".

I do not get any result because the "accent" in the letter O does not let me fetch the record. How can I fix it? I thought that with the re.UNICODE part I would ignore this.

答案1

得分: 2

因为 oó 是不同的字符。 re.UNICODE 并不是你所想的那样工作,你可以在这里阅读关于它的信息:https://docs.python.org/3/library/re.html#re.ASCII

在使用正则表达式搜索之前,你可以通过首先预处理字符串将所有这些字符转换为它们关联的 ASCII 字符来解决这个问题。请参考:https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-normalize-in-a-python-unicode-string

英文:

Because o and ó are different characters. re.UNICODE does not do what you think it does, you can read about it here: https://docs.python.org/3/library/re.html#re.ASCII

You can solve this issue by first preprocessing strings to convert all such characters to their associated ascii counterparts before searching through with a regex. See: https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-normalize-in-a-python-unicode-string

huangapple
  • 本文由 发表于 2023年6月1日 00:44:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/76375697.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定