Regex for terminating string sequences syntax analysis

huangapple go评论76阅读模式
英文:

Regex for terminating string sequences syntax analysis

问题

我正在逐行阅读源代码文件,并尝试检测给定行是否存在非终止字符串(这意味着它会继续到下一行)。为了实现这一目标,我想使用正则表达式来尝试匹配非终止字符串,但遇到了一个问题,即字符串可以以"'开头,我不清楚如何在正则表达式本身中引用之前的匹配,以便它不会尝试匹配其他标记(不用作定义的标记)。

值得注意的是,在我的测试字符串中,我将字符加倍视为转义它,这就是为什么有一个对""''的异常处理。

到目前为止,我有这个(Python风格的正则表达式,所以"通过\"转义):

(?:\"(?:[^\"]|\"\")+$)|(?:'(?:[^']|'')+$)

这对大多数非终止字符串做得很好,但这个测试案例揭示了缺陷:

"this i""s a '  '' nnoying"

这会匹配字符13-27,因为正则表达式不知道我们当前正在另一个字符串内。

所以我试图将正则表达式压缩为一个匹配情况,如下所示:

(?:(\"|')(?:[^\"]|\"\")+$)

这几乎可以工作,但它会在这种情况下失败:

'test string which " is totally invalid  "

我需要一种方法来匹配"',然后在正则表达式的其余上下文中,知道我匹配的是哪一个,以便能够理解字符串是否终止。否则,嵌套一个虚假字符串将导致检测失败。

英文:

I am reading a code source file line by line and am attempting to detect whether or not there is a non-terminating string on a given line (which implies it continues to the next line). To accomplish this, I want to use regex to try and match non-terminating strings, but am running into an issue where strings can begin with " or ' and it's unclear to me how to reference my prior match within the regex itself so that it doesn't try to match the other token (the one not being used as the definition).

It is worth noting in my test strings that I am treating doubling up a character as escaping it, which is why there's an exception for "" or '' baked in.

So far I have this (Python flavored regex, so " is escaped via \"):

(?:\"(?:[^\"]|\"\")+$)|(?:'(?:[^']|'')+$)

Which does a good job on most non-terminated strings, however this test case revealed the flaw:

"this i""s a '  '' nnoying"

This matches from character 13-27, since the regex is unaware that we are currently within another string.

So I am trying to condense the regex into a single match case like so:

(?:(\"|')(?:[^\"]|\"\")+$)

Which almost works, but it will fail on this case:

'test string which " is totally invalid  "

I need a way to match either " or ', and then in the context of the rest of the regex, know which one I did match to be able to understand if the string is terminating or not. Otherwise, nesting a fake string will cause the detection to fail.

答案1

得分: 1

这里有很多潜在的陷阱,介于你第一次尝试和我认为你需要的解决方案之间。

首先,在尝试忽略转义字符序列时,我总是将所有的“忽略”项放在备选列表的前面,这样它们就会被匹配并视为一个“单个”字符:

\"(\\"|[^\\"])+\"|\'(\'|[^\'])+\'

另外,对于你的目的可能可以运行,但你也应该考虑引号和行尾字符之间的零个字符……但这偏离了更好的解决方案,因为你似乎想忽略在无效引号出现在行之前的任何有效引号:

w = 'hi, ' + "hello
x = 'hi,  + "hello"
y = "hi, " + "hello"
z = "hi, " + "hello 'Bubba'"

应该在 x 和 y 上触发,我相信,但 y 和 z 不应该匹配

所以你需要匹配以下内容,并进行正向回溯:

一行的开头,后面跟着
(   一个或多个非引号字符
       或者
    任何有效的引号
)-- 任意次数
任何无效的引号
行的结尾

如果我们定义如下:

一个或多个非引号字符:
   [^\"']+

任何有效引号:
   (\"|\')(?:|\\\1|(?!).)+

一个无效的引号:
   (\"|\')(?:|$|(?!).)*

结构内部:
   (?>^(?:<non-quote-chars>|<valid quote>)*<invalid quote>$

如果我们调整第二种引号的后向引用,我们得到:

   (?>^(?:[^\"']+|(\"|\')(?:|\\\1|(?!).)+)*)(\"|\')(?:|$|(?!).)*$

(?&gt; 结构指定了原子分组,这在减少不必要的(甚至最终会变得灾难性)回溯方面非常重要 - Python 应该在 3.11 版本后支持它,但如果将其设置为非匹配组,它仍然可以适用于非常小的行(可能会很慢)。你没有指定 Python 的版本

如果你想的话,我可以尝试添加更多细节

如果你搜索大量文本而不是逐行搜索,还有一种潜在的改进方案:

(?>^(?:(?:(?!$|\"|').)+|(\"|\')(?:|\\\1|(?!).)+)*)(\"|\')(?:|$|(?!).)*$
英文:

There's a bunch of potential gotchas between your first attempt here and what I think you need.

First, when trying to ignore escaped character sequences, I always put all the "ignore" items earlier in the list of alternates, so they are matched and treated as a 'single' character:

\&quot;(\&quot;\&quot;|[^\&quot;])+\&quot;|\&#39;(\&#39;\&#39;|[^\&#39;])+\&#39;

Also, it may work for your purposes, but you should also consider zero characters between a quote and end of line character... but that is getting away from a better solution, as it seems you want to ignore any valid quotes if they appear in the line before an invalid quote:

w = &#39;hi, &#39; + &quot;hello
x = &#39;hi,  + &quot;hello&quot;
y = &quot;hi, &quot; + &quot;hello&quot;
z = &quot;hi, &quot; + &quot;hello &#39;Bubba&#39;&quot;

should trigger on x and y, I believe, but y and z should not be matched

so you need to match the following, with positive look-behind:

a beginning of line, followed by
(   one or more non-quote characters
       OR
    any valid quotations
) -- any number of times
any invalid quotation
end of line

If we define these as follows:

one or more non-quote characters:
   [^\&quot;&#39;]+

any valid quotation:
   (\&quot;|&#39;)(?:|\\\1|(?!).)+

an invalid quotation:
   (\&quot;|&#39;)(?:|$|(?!).)*

inside the structure:
   (?&gt;^(?:&lt;non-quote-chars&gt;|&lt;valid quote&gt;)*&lt;invalid quote&gt;$

And if we adjust the back references for the second type of quote, we get:

   (?&gt;^(?:[^\&quot;&#39;]+|(\&quot;|&#39;)(?:|\\\1|(?!).)+)*)(\&quot;|&#39;)(?:|$|(?!).)*$

The (?&gt; structure specifies atomic grouping which is fairly important in reducing unnecessary (and eventually even catastrophic) backtracking - python should support it after version 3.11, but if you make it a non-matching group, it can still work for very small lines (might just be slow). you didn't specify version of Python

I can try to add some more detail at some point if you like

One more potential improvement if you're searching large amounts of text instead of line by line:

(?&gt;^(?:(?:(?!$|\&quot;|&#39;).)+|(\&quot;|&#39;)(?:|\\\1|(?!).)+)*)(\&quot;|&#39;)(?:|$|(?!).)*$

huangapple
  • 本文由 发表于 2023年6月9日 00:06:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/76433795.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定