英文:
Why is this python regular expression not ignoring accents?
问题
我正在使用以下正则表达式作为连接到MongoDB数据库的应用程序的筛选器:
{"$regex": re.compile(r'\b' + re.escape(value) + r'\b', re.IGNORECASE | re.UNICODE)}
这个正则表达式满足我的搜索条件,但我遇到了一个问题,即它不会忽略重音。例如:
数据库条目是:“Escobar,el patrón del mal Colombia historia”。
然后我搜索“El patron”。
我没有得到任何结果,因为字母O中的“重音”不允许我获取记录。我该如何修复它?我以为使用了re.UNICODE部分会忽略这个问题。
英文:
I am using the following regular expression for a filter of an application that connects to a MongoDB database:
{"$regex": re.compile(r'\b' + re.escape(value) + r'\b', re.IGNORECASE | re.UNICODE)}
The regular expression meets my search criteria however I have a problem and that is that it does not ignore accents. For example:
The database entry is: "Escobar, el patrón del mal Colombia historia".
And I search for "El patron".
I do not get any result because the "accent" in the letter O does not let me fetch the record. How can I fix it? I thought that with the re.UNICODE part I would ignore this.
答案1
得分: 2
因为 o
和 ó
是不同的字符。 re.UNICODE
并不是你所想的那样工作,你可以在这里阅读关于它的信息:https://docs.python.org/3/library/re.html#re.ASCII
在使用正则表达式搜索之前,你可以通过首先预处理字符串将所有这些字符转换为它们关联的 ASCII 字符来解决这个问题。请参考:https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-normalize-in-a-python-unicode-string
英文:
Because o
and ó
are different characters. re.UNICODE
does not do what you think it does, you can read about it here: https://docs.python.org/3/library/re.html#re.ASCII
You can solve this issue by first preprocessing strings to convert all such characters to their associated ascii counterparts before searching through with a regex. See: https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-normalize-in-a-python-unicode-string
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论