2023年2月18日 02:33:33go评论93阅读模式

英文:

Python regex pattern building

问题

我正在尝试在Python中使用可重用的模式组件逐步构建以下正则表达式模式。我期望模式p能够完全匹配lines中的文本，但它最终只匹配了第一行。

import re
nbr = re.compile(r'\d+')
string = re.compile(r'(\w+[ \t]+)*(\w+)')
p1 = re.compile(rf"{string.pattern}\s+{nbr.pattern}\s+{string.pattern}")
p2 = re.compile(rf"{nbr.pattern}\s+{string.pattern}")
p1orp2 = re.compile(rf"{p1.pattern}|{p2.pattern}")
p = re.compile(rf"({p1orp2.pattern}\n)+")
lines = (f"aaaa 100284 aaaa\n"
         f"aaaa 365870 bbbb\n"
         f"757166 cccc\n"
         f"111054 cccc\n"
         f"999657 dddd\n"
         f"999 eeee\n"
         f"2955 ffff\n")
match = p.search(lines)
print(match)
print(match.group(0))

这是输出的内容:

<re.Match object; span=(0, 25), match='aaaa 100284 aaaa\naaaa 365870 bbbb\n757166 cccc\n'>
aaaa 100284 aaaa

英文:

I'm trying to incrementally build the following regex pattern in python using reusable pattern components. I'd expect the pattern p to match the text in lines completely but it ends up matching only the first line..

import re
nbr = re.compile(r&#39;\d+&#39;)
string = re.compile(r&#39;(\w+[ \t]+)*(\w+)&#39;)
p1 = re.compile(rf&quot;{string.pattern}\s+{nbr.pattern}\s+{string.pattern}&quot;)
p2 = re.compile(rf&quot;{nbr.pattern}\s+{string.pattern}&quot;)
p1orp2 = re.compile(rf&quot;{p1.pattern}|{p2.pattern}&quot;)
p = re.compile(rf&quot;({p1orp2.pattern}\n)+&quot;)
lines = (f&quot;aaaa 100284 aaaa\n&quot;
         f&quot;aaaa 365870 bbbb\n&quot;
         f&quot;757166 cccc\n&quot;
         f&quot;111054 cccc\n&quot;
         f&quot;999657 dddd\n&quot;
         f&quot;999 eeee\n&quot;
         f&quot;2955 ffff\n&quot;)
match = p.search(lines)
print(match)
print(match.group(0))

here's what gets printed:
<re.Match object; span=(0, 14), match='aaaa 1284 aaaa'>
aaaa 1284 aaaa

答案1

得分: 0

正则表达式模式的问题在于p1中的捕获组仅捕获由空格或制表符分隔的单词序列中的最后一个单词。因此，p1的第二部分仅匹配第二行中的最后一个单词，而p1和p2的第一部分不匹配不以单词开头的行。结果是，p1orp2不匹配整个输入。

要解决这个问题，您需要修改字符串以捕获序列中的所有单词，而不仅仅是最后一个单词。以下是您代码的更新版本：

import re
word_sequence = re.compile(r"\w+(?:[ \t]+\w+)*")
nbr = re.compile(r"\d+")
p1 = re.compile(rf"{word_sequence.pattern}\s+{nbr.pattern}\s+{word_sequence.pattern}")
p2 = re.compile(rf"{nbr.pattern}\s+{word_sequence.pattern}")
p1orp2 = re.compile(rf"{p1.pattern}|{p2.pattern}")
p = re.compile(rf"({p1orp2.pattern}\n)+")
lines = (
    f"aaaa 1284 aaaa\n"
    f"aaaa 3650 bbbb\n"
    f"75071 cccc\n"
    f"111872214054 cccc\n"
    f"999 dddd\n"
    f"999 eeee\n"
    f"295255 ffff\n"
)
match = p.search(lines)
print(match)
print(match.group(0))

请注意，只提供了代码部分的翻译。

英文:

The issue with the regex pattern is that the capturing group in p1 only captures the last word in the sequence of words separated by whitespace or tabs. Therefore, the second part of p1 matches only the last word in the second line, and the first part of p1 and p2 don't match the lines that don't start with a word. As a result, p1orp2 doesn't match the entire input.

To fix this, you need to modify string to capture all the words in the sequence, not just the last one. Here's an updated version of your code:

     word_sequence = re.compile(r&quot;\w+(?:[ \t]+\w+)*&quot;)
            nbr = re.compile(r&quot;\d+&quot;)
            p1 = re.compile(rf&quot;{word_sequence.pattern}\s+{nbr.pattern}\s+ 
           {word_sequence.pattern}&quot;)
           p2 = re.compile(rf&quot;{nbr.pattern}\s+{word_sequence.pattern}&quot;)
           p1orp2 = re.compile(rf&quot;{p1.pattern}|{p2.pattern}&quot;)
           p = re.compile(rf&quot;({p1orp2.pattern}\n)+&quot;)
    
           lines = (
              f&quot;aaaa 1284 aaaa\n&quot;
              f&quot;aaaa 3650 bbbb\n&quot;
              f&quot;75071 cccc\n&quot;
              f&quot;111872214054 cccc\n&quot;
              f&quot;999 dddd\n&quot;
              f&quot;999 eeee\n&quot;
              f&quot;295255 ffff\n&quot;
                 )
    
         match = p.search(lines)
         print(match)
         print(match.group(0))

答案2

得分: 0

问题出在这里：
```python
p1orp2 = re.compile(rf"{p1.pattern}|{p2.pattern}")
p = re.compile(rf"({p1orp2.pattern}\n)+")

在p中，\n被附加到了p1orp2，但这影响了p1orp2中|的范围：添加的\n属于第二个选项，而不是第一个选项。如果你在p1orp2的定义中已经添加了\n，情况就一样了：

p1orp2 = re.compile(rf"{p1.pattern}|{p2.pattern}\n")
p = re.compile(rf"({p1orp2.pattern})+")

...而你真正想要的是允许p1模式后面跟着\n：

p1orp2 = re.compile(rf"{p1.pattern}\n|{p2.pattern}\n")
p = re.compile(rf"({p1orp2.pattern})+")

为了在原位置实现这一点，你可以在p1orp2的定义中使用括号来限制|运算符的范围：

p1orp2 = re.compile(rf"({p1.pattern}|{p2.pattern})")
p = re.compile(rf"({p1orp2.pattern}\n)+")

英文:

The problem is here:

p1orp2 = re.compile(rf&quot;{p1.pattern}|{p2.pattern}&quot;)
p = re.compile(rf&quot;({p1orp2.pattern}\n)+&quot;)

In p the \n is appended to p1orp2, but this influences the scope of the | in p1orp2: the added \n belongs to the second option, not to the first option. It is the same if you would have attached that \n already in the definition of p1orp2:

p1orp2 = re.compile(rf&quot;{p1.pattern}|{p2.pattern}\n&quot;)
p = re.compile(rf&quot;({p1orp2.pattern})+&quot;)

...while you really want to allow the p1 pattern to be followed by \n as well:

p1orp2 = re.compile(rf&quot;{p1.pattern}\n|{p2.pattern}\n&quot;)
p = re.compile(rf&quot;({p1orp2.pattern})+&quot;)

To achieve that with the \n where it was, you could use parentheses in the definition of p1orp2 so it limits the scope of the | operator:

p1orp2 = re.compile(rf&quot;({p1.pattern}|{p2.pattern})&quot;)
p = re.compile(rf&quot;({p1orp2.pattern}\n)+&quot;)

With this change it will work as you intended.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python正则表达式模式构建

问题

答案1

答案2

如何在Kafka消费者中移动到特定偏移量，而不会遇到ValueError？

Bootstrap单选按钮在Flask中无法正常工作。

压缩pandas DataFrame中的数据，通过移除NaN值并向左移动数值以减少列数。

如何创建一个交互式窗口，其中显示图像变化？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。