Python正则表达式模式构建

huangapple go评论64阅读模式
英文:

Python regex pattern building

问题

我正在尝试在Python中使用可重用的模式组件逐步构建以下正则表达式模式。我期望模式p能够完全匹配lines中的文本,但它最终只匹配了第一行。

import re
nbr = re.compile(r'\d+')
string = re.compile(r'(\w+[ \t]+)*(\w+)')
p1 = re.compile(rf"{string.pattern}\s+{nbr.pattern}\s+{string.pattern}")
p2 = re.compile(rf"{nbr.pattern}\s+{string.pattern}")
p1orp2 = re.compile(rf"{p1.pattern}|{p2.pattern}")
p = re.compile(rf"({p1orp2.pattern}\n)+")

lines = (f"aaaa 100284 aaaa\n"
         f"aaaa 365870 bbbb\n"
         f"757166 cccc\n"
         f"111054 cccc\n"
         f"999657 dddd\n"
         f"999 eeee\n"
         f"2955 ffff\n")

match = p.search(lines)
print(match)
print(match.group(0))

这是输出的内容:

<re.Match object; span=(0, 25), match='aaaa 100284 aaaa\naaaa 365870 bbbb\n757166 cccc\n'>
aaaa 100284 aaaa
英文:

I'm trying to incrementally build the following regex pattern in python using reusable pattern components. I'd expect the pattern p to match the text in lines completely but it ends up matching only the first line..

import re
nbr = re.compile(r&#39;\d+&#39;)
string = re.compile(r&#39;(\w+[ \t]+)*(\w+)&#39;)
p1 = re.compile(rf&quot;{string.pattern}\s+{nbr.pattern}\s+{string.pattern}&quot;)
p2 = re.compile(rf&quot;{nbr.pattern}\s+{string.pattern}&quot;)
p1orp2 = re.compile(rf&quot;{p1.pattern}|{p2.pattern}&quot;)
p = re.compile(rf&quot;({p1orp2.pattern}\n)+&quot;)

lines = (f&quot;aaaa 100284 aaaa\n&quot;
         f&quot;aaaa 365870 bbbb\n&quot;
         f&quot;757166 cccc\n&quot;
         f&quot;111054 cccc\n&quot;
         f&quot;999657 dddd\n&quot;
         f&quot;999 eeee\n&quot;
         f&quot;2955 ffff\n&quot;)

match = p.search(lines)
print(match)
print(match.group(0))

here's what gets printed:
<re.Match object; span=(0, 14), match='aaaa 1284 aaaa'>
aaaa 1284 aaaa

答案1

得分: 0

正则表达式模式的问题在于p1中的捕获组仅捕获由空格或制表符分隔的单词序列中的最后一个单词。因此,p1的第二部分仅匹配第二行中的最后一个单词,而p1和p2的第一部分不匹配不以单词开头的行。结果是,p1orp2不匹配整个输入。

要解决这个问题,您需要修改字符串以捕获序列中的所有单词,而不仅仅是最后一个单词。以下是您代码的更新版本:

import re

word_sequence = re.compile(r"\w+(?:[ \t]+\w+)*")
nbr = re.compile(r"\d+")
p1 = re.compile(rf"{word_sequence.pattern}\s+{nbr.pattern}\s+{word_sequence.pattern}")
p2 = re.compile(rf"{nbr.pattern}\s+{word_sequence.pattern}")
p1orp2 = re.compile(rf"{p1.pattern}|{p2.pattern}")
p = re.compile(rf"({p1orp2.pattern}\n)+")

lines = (
    f"aaaa 1284 aaaa\n"
    f"aaaa 3650 bbbb\n"
    f"75071 cccc\n"
    f"111872214054 cccc\n"
    f"999 dddd\n"
    f"999 eeee\n"
    f"295255 ffff\n"
)

match = p.search(lines)
print(match)
print(match.group(0))

请注意,只提供了代码部分的翻译。

英文:

The issue with the regex pattern is that the capturing group in p1 only captures the last word in the sequence of words separated by whitespace or tabs. Therefore, the second part of p1 matches only the last word in the second line, and the first part of p1 and p2 don't match the lines that don't start with a word. As a result, p1orp2 doesn't match the entire input.

To fix this, you need to modify string to capture all the words in the sequence, not just the last one. Here's an updated version of your code:

     word_sequence = re.compile(r&quot;\w+(?:[ \t]+\w+)*&quot;)
            nbr = re.compile(r&quot;\d+&quot;)
            p1 = re.compile(rf&quot;{word_sequence.pattern}\s+{nbr.pattern}\s+ 
           {word_sequence.pattern}&quot;)
           p2 = re.compile(rf&quot;{nbr.pattern}\s+{word_sequence.pattern}&quot;)
           p1orp2 = re.compile(rf&quot;{p1.pattern}|{p2.pattern}&quot;)
           p = re.compile(rf&quot;({p1orp2.pattern}\n)+&quot;)
    
           lines = (
              f&quot;aaaa 1284 aaaa\n&quot;
              f&quot;aaaa 3650 bbbb\n&quot;
              f&quot;75071 cccc\n&quot;
              f&quot;111872214054 cccc\n&quot;
              f&quot;999 dddd\n&quot;
              f&quot;999 eeee\n&quot;
              f&quot;295255 ffff\n&quot;
                 )
    
         match = p.search(lines)
         print(match)
         print(match.group(0))

答案2

得分: 0

问题出在这里

```python
p1orp2 = re.compile(rf"{p1.pattern}|{p2.pattern}")
p = re.compile(rf"({p1orp2.pattern}\n)+")

p中,\n被附加到了p1orp2,但这影响了p1orp2|的范围:添加的\n属于第二个选项,而不是第一个选项。如果你在p1orp2的定义中已经添加了\n,情况就一样了:

p1orp2 = re.compile(rf"{p1.pattern}|{p2.pattern}\n")
p = re.compile(rf"({p1orp2.pattern})+")

...而你真正想要的是允许p1模式后面跟着\n

p1orp2 = re.compile(rf"{p1.pattern}\n|{p2.pattern}\n")
p = re.compile(rf"({p1orp2.pattern})+")

为了在原位置实现这一点,你可以在p1orp2的定义中使用括号来限制|运算符的范围:

p1orp2 = re.compile(rf"({p1.pattern}|{p2.pattern})")
p = re.compile(rf"({p1orp2.pattern}\n)+")
英文:

The problem is here:

p1orp2 = re.compile(rf&quot;{p1.pattern}|{p2.pattern}&quot;)
p = re.compile(rf&quot;({p1orp2.pattern}\n)+&quot;)

In p the \n is appended to p1orp2, but this influences the scope of the | in p1orp2: the added \n belongs to the second option, not to the first option. It is the same if you would have attached that \n already in the definition of p1orp2:

p1orp2 = re.compile(rf&quot;{p1.pattern}|{p2.pattern}\n&quot;)
p = re.compile(rf&quot;({p1orp2.pattern})+&quot;)

...while you really want to allow the p1 pattern to be followed by \n as well:

p1orp2 = re.compile(rf&quot;{p1.pattern}\n|{p2.pattern}\n&quot;)
p = re.compile(rf&quot;({p1orp2.pattern})+&quot;)

To achieve that with the \n where it was, you could use parentheses in the definition of p1orp2 so it limits the scope of the | operator:

p1orp2 = re.compile(rf&quot;({p1.pattern}|{p2.pattern})&quot;)
p = re.compile(rf&quot;({p1orp2.pattern}\n)+&quot;)

With this change it will work as you intended.

huangapple
  • 本文由 发表于 2023年2月18日 02:33:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/75488087.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定