Python Docx 在引号内的正则表达式中使用了文字 ‘?’.

huangapple go评论58阅读模式
英文:

Python Docx using literal '?' in regex within quotes

问题

我正在使用Python的docx库编写代码以查找并突出显示文本中的所有引用它可以找到"How are you?"但在"Hello? How are you?"中返回NoneType对象并在下面的`location.span()`处引发一个AttributionError

我尝试过对正则表达式进行调整包括([\w\?]*?)和\\?,但似乎没有什么效果 - 例如下面的正则表达式对"Hello? How are you?"通过了正则表达式检查器但我的程序却无法工作
英文:

I am writing code using the Python docx library to find and highlight all quotes within the text. It works except if there is a question mark within the quote (e.g. it can find "How are you?" but returns NoneType object on "Hello? How are you?" and throws an AttributionError on location.span() below).

I tried fiddling around with the regex, including “([\w?]*?)” and \? into some tries, but nothing seems to work - my regex as below passes the regex checker with "Hello? How are you?" for example, but my program will not

document = Document(filepath)

def highlight_quotes(document):
    for paragraph in document.paragraphs:
        if matches := re.findall(r'“(.*?)”', paragraph.text):
            quotes = []
            for i in range(len(matches)):
                location = re.search(matches[i], paragraph.text)
                start_index, end_index = location.span()
                quotes.append((start_index, end_index))

答案1

得分: 1

问题在于当你执行re.search(matches[i], paragraph.text)时,你试图将文本的一部分用作正则表达式。但它不是一个正则表达式,如果包含正则表达式中具有特殊含义的字符(例如 ?),它将无法正确匹配,甚至可能引发异常。

你可以使用 re.escape() 来转义所有特殊字符,但这仍然不正确。如果存在重复的匹配项,它只会返回第一个匹配项的位置。

如果你想要获取匹配项的位置,请使用 re.finditer(),而不是返回文本,它会返回match对象,你可以从中获取每个匹配项的范围。

def highlight_quotes(document):
    for paragraph in document.paragraphs:
        quotes = [match.span() for match in re.finditer(r'“(.*?)”', paragraph.text)]
        # 其余的代码
英文:

The problem is that you're trying to use a portion of the text as a regular expression when you do re.search(matches[i], paragraph.text). But it's not a regular expression, and if it contains characters that have special meaning in regular expressions (e.g. ?) it will not match correctly, or may even raise an exception.

You could use re.escape() to escape all the special characters, but this still wouldn't be correct. If there are any duplicate matches, it will only return the position of the first one.

If you want to get the positions of the matches, use re.finditer() -- instead of returning the texts, it returns match objects, from which you can get the spans of each match.

def highlight_quotes(document):
    for paragraph in document.paragraphs:
        quotes = [match.span() for match in re.finditer(r'“(.*?)”', paragraph.text)]
        # rest of code

huangapple
  • 本文由 发表于 2023年6月29日 01:04:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/76575320.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定