如何修复这个正则表达式以正确匹配以两个##开头的任何字符串

huangapple go评论76阅读模式
英文:

How to fix this regex to properly match any string starting with two ##

问题

我想匹配所有以两个##开头的字符串并进行一些替换。这意味着如果字符串以超过两个##开头,例如###,它不应该被匹配,如果它以一个#开头,也不应该被匹配。

import re
text = '''
# 一些字符串
简要描述你的写作内容,在这里寻找多少人?
## 一些第二段字符串
简要描述你的写作内容,在这里寻找多少人?简要描述你的写作内容,在这里寻找多少人?
简要描述你的写作内容,在这里寻找多少人?
## 一些其他带有问号的第二段字符串
简要描述你的写作内容,在这里寻找多少人?包含除了与正式参数对应的关键字参数之外的所有关键字参数。这可以与下一节中描述的形式参数*name(接收超出正式参数列表的位置参数的元组)相结合使用。 (*name必须出现在**name之前。)例如,如果我们定义一个这样的函数
## 一些其他带有.和:的部分
简要描述你的写作内容,在这里寻找多少人?简要描述你的写作内容,在这里寻找多少人?
'''
pattern = r"##(.+?.*)"
list_with_sections_ = list(dict.fromkeys(re.findall(pattern, text)))
print(list_with_sections_)
if list_with_sections_:
    for item in list_with_sections_:
        text = re.sub(item, f'<a href="#" class="section-header title" id="{item.replace(" ", "-").strip()}">{item}</a>', text)
print(text)

这个方法似乎可以工作,但当字符串以问号或其他特殊字符结尾时,re.sub 会出现一些不一致。例如,当匹配以问号(?)结尾时,re.sub 会在<a>标签之后添加额外的?。运行上面的代码后的输出如下图所示:

如何修复这个正则表达式以正确匹配以两个##开头的任何字符串

英文:

I want to match all strings starting with two ## and do some substitution. That means if the string starts with more than two ## say ###, it shouldn't be a match and if it starts with just one # it should also not be a match.

import re
text = &#39;&#39;&#39;
# some one string
Describe your writing briefly here, what ihow many people are you looking for?
## some section two string
Describe your writing briefly here, what ihow many people are you looking for?Describe your writing briefly here, what ihow many people are you looking for?
Describe your writing briefly here, what ihow many people are you looking for?
## some other section two string with question sign?
Describe your writing briefly here, what ihow many people are you looking for? containing all keyword arguments except for those corresponding to a formal parameter. This may be combined with a formal parameter of the form *name (described in the next subsection) which receives a tuple containing the positional arguments beyond the formal parameter list. (*name must occur before **name.) For example, if we define a function like this
## some other section with . and : colon
Describe your writing briefly here, what ihow many people are you looking for?Describe your writing briefly here, what ihow many people are you looking for?
&#39;&#39;&#39;
pattern = r&quot;##(.+?.*)&quot;
list_with_sections_ = list(dict.fromkeys(re.findall(pattern, text)))
print(list_with_sections_)
if list_with_sections_:
    for item in list_with_sections_:
        text = re.sub(item, f&#39;&lt;a href=&quot;#&quot; class=&quot;section-header title&quot; id=&quot;{item.replace(&quot; &quot;, &quot;-&quot;).strip()}_&quot;&gt;{item}&lt;/a&gt;&#39;, text)
print(text)

This seems to work but the re.sub returns some inconsistency when a string ends with a question mark or has some special character. For instance, when a match ends with a question mark(?), the re.sub adds an additional ? after the a tag.

Output when I run the above:
如何修复这个正则表达式以正确匹配以两个##开头的任何字符串

答案1

得分: 2

这个问题是由正则表达式中如何处理'?'字符引起的。在这里:text = re.sub(item, f'<a href="#"" class="section-header title..."',你将'item'(它实际上是输入文本的一部分,可能包含'?'字符)视为正则表达式公式。但在正则表达式公式中,'?'字符具有特殊含义。因此,你匹配了末尾没有'?'的相关文本片段。

你可以通过像这样转义'item'中的特殊字符来解决这个问题:text = re.sub(re.escape(item), f'<a href="#"" class="section-header title..."'

英文:

This issue is caused by how '?' character is treated in regex. <br>Here: text = re.sub(item, f&#39;&lt;a href=&quot;#&quot; class=&quot;section-header title...&quot; you treat item (which is essentially a part of input text and may contain '?' character) as regex formula. But '?' character in regex formulas has special meaning. As a result you are matching relevant piece of text without ? at the end.
You can address this by escaping special characters in 'item' like this: text = re.sub(re.escape(item), f&#39;&lt;a href=&quot;#&quot; class=&quot;section-header title...&quot;

huangapple
  • 本文由 发表于 2023年3月7日 22:10:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/75663063.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定