英文:
Python Docx using literal '?' in regex within quotes
问题
我正在使用Python的docx库编写代码,以查找并突出显示文本中的所有引用。它可以找到"How are you?",但在"Hello? How are you?"中返回NoneType对象,并在下面的`location.span()`处引发一个AttributionError。
我尝试过对正则表达式进行调整,包括“([\w\?]*?)”和\\?,但似乎没有什么效果 - 例如,下面的正则表达式对"Hello? How are you?"通过了正则表达式检查器,但我的程序却无法工作。
英文:
I am writing code using the Python docx library to find and highlight all quotes within the text. It works except if there is a question mark within the quote (e.g. it can find "How are you?" but returns NoneType object on "Hello? How are you?" and throws an AttributionError on location.span()
below).
I tried fiddling around with the regex, including “([\w?]*?)” and \? into some tries, but nothing seems to work - my regex as below passes the regex checker with "Hello? How are you?" for example, but my program will not
document = Document(filepath)
def highlight_quotes(document):
for paragraph in document.paragraphs:
if matches := re.findall(r'“(.*?)”', paragraph.text):
quotes = []
for i in range(len(matches)):
location = re.search(matches[i], paragraph.text)
start_index, end_index = location.span()
quotes.append((start_index, end_index))
答案1
得分: 1
问题在于当你执行re.search(matches[i], paragraph.text)
时,你试图将文本的一部分用作正则表达式。但它不是一个正则表达式,如果包含正则表达式中具有特殊含义的字符(例如 ?
),它将无法正确匹配,甚至可能引发异常。
你可以使用 re.escape()
来转义所有特殊字符,但这仍然不正确。如果存在重复的匹配项,它只会返回第一个匹配项的位置。
如果你想要获取匹配项的位置,请使用 re.finditer()
,而不是返回文本,它会返回match
对象,你可以从中获取每个匹配项的范围。
def highlight_quotes(document):
for paragraph in document.paragraphs:
quotes = [match.span() for match in re.finditer(r'“(.*?)”', paragraph.text)]
# 其余的代码
英文:
The problem is that you're trying to use a portion of the text as a regular expression when you do re.search(matches[i], paragraph.text)
. But it's not a regular expression, and if it contains characters that have special meaning in regular expressions (e.g. ?
) it will not match correctly, or may even raise an exception.
You could use re.escape()
to escape all the special characters, but this still wouldn't be correct. If there are any duplicate matches, it will only return the position of the first one.
If you want to get the positions of the matches, use re.finditer()
-- instead of returning the texts, it returns match
objects, from which you can get the spans of each match.
def highlight_quotes(document):
for paragraph in document.paragraphs:
quotes = [match.span() for match in re.finditer(r'“(.*?)”', paragraph.text)]
# rest of code
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论