问题

我正在使用Python的docx库编写代码，以查找并突出显示文本中的所有引用。它可以找到&quot;How are you?&quot;，但在&quot;Hello? How are you?&quot;中返回NoneType对象，并在下面的`location.span()`处引发一个AttributionError。
我尝试过对正则表达式进行调整，包括“([\w\?]*?)”和\\?，但似乎没有什么效果 - 例如，下面的正则表达式对&quot;Hello? How are you?&quot;通过了正则表达式检查器，但我的程序却无法工作。

英文:

I am writing code using the Python docx library to find and highlight all quotes within the text. It works except if there is a question mark within the quote (e.g. it can find "How are you?" but returns NoneType object on "Hello? How are you?" and throws an AttributionError on location.span() below).

I tried fiddling around with the regex, including “([\w?]*?)” and \? into some tries, but nothing seems to work - my regex as below passes the regex checker with "Hello? How are you?" for example, but my program will not

document = Document(filepath)
def highlight_quotes(document):
    for paragraph in document.paragraphs:
        if matches := re.findall(r&#39;“(.*?)”&#39;, paragraph.text):
            quotes = []
            for i in range(len(matches)):
                location = re.search(matches[i], paragraph.text)
                start_index, end_index = location.span()
                quotes.append((start_index, end_index))

答案1

得分: 1

问题在于当你执行re.search(matches[i], paragraph.text)时，你试图将文本的一部分用作正则表达式。但它不是一个正则表达式，如果包含正则表达式中具有特殊含义的字符（例如 ?），它将无法正确匹配，甚至可能引发异常。

你可以使用 re.escape() 来转义所有特殊字符，但这仍然不正确。如果存在重复的匹配项，它只会返回第一个匹配项的位置。

如果你想要获取匹配项的位置，请使用 re.finditer()，而不是返回文本，它会返回match对象，你可以从中获取每个匹配项的范围。

def highlight_quotes(document):
    for paragraph in document.paragraphs:
        quotes = [match.span() for match in re.finditer(r'“(.*?)”', paragraph.text)]
        # 其余的代码

英文:

The problem is that you're trying to use a portion of the text as a regular expression when you do re.search(matches[i], paragraph.text). But it's not a regular expression, and if it contains characters that have special meaning in regular expressions (e.g. ?) it will not match correctly, or may even raise an exception.

You could use re.escape() to escape all the special characters, but this still wouldn't be correct. If there are any duplicate matches, it will only return the position of the first one.

If you want to get the positions of the matches, use re.finditer() -- instead of returning the texts, it returns match objects, from which you can get the spans of each match.

def highlight_quotes(document):
    for paragraph in document.paragraphs:
        quotes = [match.span() for match in re.finditer(r&#39;“(.*?)”&#39;, paragraph.text)]
        # rest of code

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python Docx 在引号内的正则表达式中使用了文字 ‘?’.

问题

答案1

pymc5 – 寻找模型比较的 AIC、BIC、LOO

处理接受“静默”输入的子进程期间的 KeyboardInterrupt 的适当方法是什么？

使用列中的值来更新数据框中的其他值。

Django模板中使用args的get_absolute_url()不起作用。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。