在Perl和Python中的`m`标志正则表达式

huangapple go评论82阅读模式
英文:

Regex with m flag in Perl vs. Python

问题

在你提供的Perl代码中,正则表达式替换使用了$stamp变量,而在Python等效代码中,你直接将"[stamp]"硬编码作为替换文本。这就解释了为什么Python代码的输出结果末尾多了一个[stamp]。为了在Python中获得相同的行为,你应该在正则替换中使用与Perl中的$stamp相等的文本。下面是更新后的Python代码:

import re

stamp = "[stamp]"
message = "message\n"
message = re.sub(re.compile("^", re.M), stamp, message, count=0)
print message

这将产生与Perl代码相同的输出结果。

英文:

I'm trying to automatically translate some simple Perl code with a regex to Python, and I'm having an issue. Here is the Perl code:

$stamp='[stamp]';
$message = "message\n";
$message =~ s/^/$stamp/gm;
print "$message";
[stamp]message

Here is my Python equivalent:

>>> import re
>>> re.sub(re.compile("^", re.M), "[stamp]", "message\n", count=0)
'[stamp]message\n[stamp]'

Note the answer is different (it has an extra [stamp] at the end). How do I generate code that has the same behavior for the regex?

答案1

得分: 2

Perl 和 Python 的正则表达式引擎在“行”的定义上略有不同;Perl 不将输入字符串中尾随换行符后的空字符串视为一行,而 Python 则会。

我能想到的最佳解决方案是将 ^ 更改为 r"^(?=.|\n)"(注意字符串前面的 r 前缀,以使其成为原始字符串文字;所有正则表达式应该使用原始字符串文字)。您还可以通过在已编译的正则表达式上调用方法,或者使用未编译的模式调用 re.sub,并且由于 count=0 已经是默认值,可以省略它。因此,最终的代码可以是以下之一:

re.compile(r"^(?=.|\n)", re.M).sub("[stamp]", "message\n")

或:

re.sub(r"^(?=.|\n)", "[stamp]", "message\n", flags=re.M)

甚至更好的方法是:

start_of_line = re.compile(r"^(?=.|\n)", re.M)  # 提前编译一次

start_of_line.sub("[stamp]", "message\n")  # 根据需要执行

通过创建已编译的正则表达式一次并重复使用它,避免每次重新编译/检查已编译的正则表达式缓存。

替代方案:

  1. 将行拆分成与 Perl 行定义匹配的方式,然后对每行使用非re.MULTILINE版本的正则表达式,然后将它们拼接在一起,例如:
start_of_line = re.compile(r"^")  # 提前编译一次,不使用 re.M

# 将行拆分,保留行尾,以匹配 Perl 的行定义
''.join([start_of_line.sub("[stamp]", line) for line in "message\n".splitlines(keepends=True)])
  1. 如果可能存在单个尾随换行符,请提前将其删除,执行正则表达式替代,然后添加回换行符(如果适用):
message = '...'
if message.endswith('\n'):
    result = start_of_line.sub("[stamp]", message[:-1]) + '\n'
else:
    result = start_of_line.sub("[stamp]", message)

这两种选项都不如尝试调整正则表达式来得简洁/高效,但如果必须处理任意用户提供的正则表达式,总会存在边界情况,将其预处理为删除 Perl/Python 不兼容性的内容更加安全。

英文:

Perl and Python's regex engines differ slightly on the definition of a "line"; Perl does not consider the empty string following a trailing newline in the input string to be a line, Python does.

Best solution I can come up with is to change "^" to r"^(?=.|\n)" (note r prefix on string to make it a raw literal; all regex should use raw literals). You can also simplify a bit by just calling methods on the compiled regex or call re.sub with the uncompiled pattern, and since count=0 is already the default, you can omit it. Thus, the final code would be either:

re.compile(r"^(?=.|\n)", re.M).sub("[stamp]", "message\n")

or:

re.sub(r"^(?=.|\n)", "[stamp]", "message\n", flags=re.M)

Even better would be:

start_of_line = re.compile(r"^(?=.|\n)", re.M)  # Done once up front

start_of_line.sub("[stamp]", "message\n")  # Done on demand

avoiding recompiling/rechecking compiled regex cache each time, by creating the compiled regex just once and reusing it.

Alternative solutions:

  1. Split up the lines in a way that will match Perl's definition of a line, then use the non-re.MULTILINE version of the regex per line, then shove them back together, e.g.:

    start_of_line = re.compile(r"^")  # Compile once up front without re.M
    
    # Split lines, keeping ends, in a way that matches Perl's definition of a line
    # then substitute on line-by-line basis
    ''.join([start_of_line.sub("[stamp]", line) for line in "message\n".splitlines(keepends=True)])
    
  2. Strip a single trailing newline, if it exists, up-front, perform regex substitution, add back newline (if applicable):

    message = '...'
    if message.endswith('\n'):
        result = start_of_line.sub("[stamp]", message[:-1]) + '\n'
    else:
        result = start_of_line.sub("[stamp]", message)
    

Neither option is as succinct/efficient as trying to tweak the regex, but if arbitrary user-supplied regex must be handled, there's always going to be a corner case, and pre-processing to something that removes the Perl/Python incompatibility is a lot safer.

答案2

得分: 1

Perl的多行模式不会将最后一个换行符之后的空字符串视为独立的一行。也就是说,它将A\nBA\nB\n都视为两行,而A\nB\nC视为三行。这与Python的多行模式不同,Python将每个换行符都视为开始新行。

re.M:当指定了此选项时,模式字符^会匹配字符串的开头和每行的开头(紧跟着每个换行符之后)。

要模仿Perl的多行模式行为,可以在行的开头添加一个前瞻断言,确保至少有一个字符:

(?=.|\n)

请注意,我们需要显式允许\n,使用|运算符来表示,因为默认情况下.不匹配\n。如果不这样做,模式将无法匹配紧跟着\n的行的开头。

以下是该模式在你的示例中的行为:

>>> re.sub(re.compile(r"^(?=.|\n)", re.M), "[stamp]", "message\n", count=0)
'[stamp]message\n'
>>> re.sub(re.compile(r"^(?=.|\n)", re.M), "[stamp]", "message\nmessage", count=0)
'[stamp]message\n[stamp]message'
>>> re.sub(re.compile(r"^(?=.|\n)", re.M), "[stamp]", "message\nmessage\n", count=0)
'[stamp]message\n[stamp]message\n'
>>> re.sub(re.compile(r"^(?=.|\n)", re.M), "[stamp]", "message\n\n", count=0)
'message\n[stamp]\n'
英文:

Perl's multiline mode doesn't consider an empty string after the last newline to be a line of its own. That is, it treats A\nB and A\nB\n as both being two lines, while A\nB\nC as being three lines. This differs from Python's multine mode, which treats every newline as starting a new line:

> re.M: When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline)

You can mimic the behavior of Perl's multiline mode by adding a lookahead assertion for at least one character at the start of the line:

(?=.|\n)

Note that we need to explicitly permit \n with an alternative |'d pattern since by default . does not match \n. Without this, the pattern would fail to match starts of lines that are immediately followed by a \n.

Here's how this pattern behaves in your example:

>>> re.sub(re.compile(r"^(?=.|\n)", re.M), "[stamp]", "message\n", count=0)
'[stamp]message\n'
>>> re.sub(re.compile(r"^(?=.|\n)", re.M), "[stamp]", "message\nmessage", count=0)
'[stamp]message\n[stamp]message'
>>> re.sub(re.compile(r"^(?=.|\n)", re.M), "[stamp]", "message\nmessage\n", count=0)
'[stamp]message\n[stamp]message\n'
>>> re.sub(re.compile(r"^(?=.|\n)", re.M), "[stamp]", "message\n\n", count=0)
'message\n[stamp]\n'

答案3

得分: 1

问题在于Perl不认为空格字符串后的\n为文本行,但Python则认为是文本行。因此,Perl的正则表达式代码将"message\n"视为一行文本,而Python的正则表达式代码将其视为两行。

你可以通过让Python代码检查最后的\n来解决这个差异。如果检测到\n,则在运行正则表达式之前删除该\n,然后在运行后添加回去。

你可能还想检查所有边缘情况。例如,如果整个消息本身是一个空字符串,Perl和Python代码会如何行为?(Perl代码会在这种情况下执行任何操作吗?Python代码呢?)

如果可以保证所有消息都是以非零长度的文本并以换行符结尾,那么你可以只删除最后的\n,应用Python的正则表达式代码,然后再添加回\n。但最好还是考虑所有边缘情况。


附加信息:

虽然诱人的是在Python中编写与Perl中完全相同的正则表达式,但我不一定推荐这样做,特别是如果Python正则表达式非常复杂,不容易一眼或两眼就能理解它在做什么。

正则表达式并不总是能够很优雅地处理算法逻辑。所以如果你可以通过引入一些简单的算法逻辑来消除正则表达式的复杂性,我建议这样做。

换句话说,使用过于复杂的正则表达式不会赢得任何奖项,为什么不使用简单的正则表达式与简单的非正则表达式算法逻辑相配呢?(你代码的未来维护者会感激你的!)


以下是一个建议:

import re
message = "message\n"
message = re.sub(r"(?m:^)", "[stamp]", message)
# 如果它单独出现在末尾,则删除最后的"[stamp]"
message = message.removesuffix("[stamp]")

这很简单,易于理解,唯一可能令人困惑的部分是如果你以前没有看到在正则表达式中使用类似 (?m:...)m 标志。

在决定使用它之前,请确保在边缘情况(如空消息)上进行测试。你不希望将程序的逻辑留给你没有意识到的行为。

英文:

The problem is that Perl doesn't consider the empty string after the \n to be a line of text, but Python does. So the Perl's RegEx code sees "message\n" as one line of text, but Python's RegEx code sees it as two.

You can resolve this difference by having the Python code check for a final \n. If it detects one, remove that \n before running the regular expression, and then add it back in after.

You will probably want to check all edge cases, too. For example, how do the Perl and Python code behave if the entire message itself is an empty string? (Will the Perl code do anything in that case? How about the Python code?)

If all your messages are guaranteed to be non-zero-length text ending in a newline, then you can probably get away with just removing the final \n, applying the Python regex code, and then appending that \n back in. But it would still be good form to consider all edge cases.


ADDITION:

Although it's tempting to come up with a regular expression in Python that exactly mimics the one in Perl, I wouldn't necessarily recommend it -- especially if the Python regex is so complicated that it's not easy to tell what it's doing at first or second glance.

Regular expressions don't always handle algorithmic logic very gracefully. So if you can eliminate regexp complexity by introducing some simple algorithmic logic, I would recommend doing that instead.

In other words, you won't win any awards by using overly-complicated regular expressions, so why not use a simple regex paired with simple non-regex algorithmic logic instead? (The future maintainers of your code will thank you!)


Here's a recommendation:

import re
message = "message\n"
message = re.sub(r"(?m:^)", "[stamp]", message)
# Remove the final "[stamp]" if it appears alone at the end:
message = message.removesuffix("[stamp]")

It's simple and easy to follow, and the only part of it that might be confusing is if you've never seen the m flag used inside of a regular expression like (?m:...) before.

Just be sure to test this out on edge cases (like empty messages) before you decide to use it. You don't want to leave your program's logic to behavior you didn't realize existed.

huangapple
  • 本文由 发表于 2023年2月14日 05:52:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/75441557.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定