正则表达式用于查找双引号括起的字符串,但排除在单引号内的双引号。

huangapple go评论66阅读模式
英文:

Regex to find double-quoted strings, excluding double quotes within single quotes

问题

我正在尝试编写一个PCRE正则表达式,用于搜索PHP代码以查找双引号中的字符串,处理转义的双引号,并排除双引号和单引号字符串重叠的情况,例如在构建一些HTML时,如以下示例:

$str = '<elem prop="' . $var . '&quot>';
$str = '<div class="my-class ' . $my_var_class . ' my-other-class&quot>';

到目前为止,我已经能够创建一个可靠处理转义双引号的正则表达式:

&quot;(.*?)(?<!\\)&quot;

这适用于以下代码行:

$str = &quot;this is something&quot;;
$str = &quot;this is {$another}&quot;;
$str = &quot;could be {$hello[&#39;world&#39;]}&quot;;
$str = &quot;and $hello[world] another&quot;;
$str = &quot;&#39;single quotes in double quotes&#39;&quot;;
$str = &quot;building &lt;div style=\&quot;width: 100%\&quot; data-var=\&quot;{$var}\&quot;&gt;&lt;/div&gt;&quot;;

但它不适用于你上面示例中的代码行,它会匹配&quot;&#39; . $var . &#39;&quot;,但我不希望它匹配该示例行的任何部分。

我尝试过使用https://stackoverflow.com/a/62558215和https://stackoverflow.com/a/6464500中讨论的原则,但仅使用前瞻不足以解决问题,而且我很难想出一个不会导致编译错误的后顾断言,报错内容为“后顾断言的长度不固定”。我觉得https://stackoverflow.com/a/36186925/3404349中的回答可能接近我所寻找的内容,但似乎它匹配了与我的目标相反(在某种程度上)的内容。

英文:

I'm trying to write a PCRE regular expression to search PHP code to find strings in double-quotes, handling escaped double-quotes, and to exclude situations where double-quoted and single-quoted strings overlap, e.g. when building some HTML, such as these:

$str = &#39;&lt;elem prop=&quot;&#39; . $var . &#39;&quot;&gt;&#39;;
$str = &#39;&lt;div class=&quot;my-class &#39; . $my_var_class . &#39; my-other-class&quot;&gt;&#39;;

So far I've been able to come up with a reliable regex that handles escaped double-quotes:

&quot;(.*?)(?&lt;!\\)&quot;

This works for lines of code like these:

$str = &quot;this is something&quot;;
$str = &quot;this is {$another}&quot;;
$str = &quot;could be {$hello[&#39;world&#39;]}&quot;;
$str = &quot;and $hello[world] another&quot;;
$str = &quot;&#39;single quotes in double quotes&#39;&quot;;
$str = &quot;building &lt;div style=\&quot;width: 100%\&quot; data-var=\&quot;{$var}\&quot;&gt;&lt;/div&gt;&quot;;

But it doesn't work for lines of code like my first example above; it would match &quot;&#39; . $var . &#39;&quot;, but I don't want it to match anything from that example line.

I've tried using the principles discussed at https://stackoverflow.com/a/62558215 and https://stackoverflow.com/a/6464500, but a look-ahead isn't sufficient by itself, and I'm having a hard time coming up with a look-behind that doesn't give me a compilation error about "lookbehind assertion is not fixed length". I feel like the answer at https://stackoverflow.com/a/36186925/3404349 might (?) be getting close to what I'm looking for, but it seems to me that it's matching the inverse (of sorts) of my goal.

答案1

得分: 1

感谢@Michail的评论,帮助我找到正确的方向。我使用了这些建议,并进一步开发以处理内联和块注释(可能包含一个“孤立”的单引号或双引号,从而颠倒了所需的匹配)。

请注意,对于此工作,ms标志非常重要。

以下是我理解的工作原理的详细说明:

在这种用例中,有四种选择,它们由交替管道(|)分隔。任何我们不希望保留/匹配的起始/结束对都应该首先出现在交替列表中,这是因为(*SKIP)的工作方式。

.*?(?&gt;\\?.)*?:这用于匹配在相应起始和结束标记之间的0个或多个字符。在第二种情况下,它还特别包括可选的反斜杠,以处理字符串内的转义字符情况。第二种情况使用了原子组,虽然我不100%确定原因,但我知道它阻止了回溯,这似乎对这个非常重要。

(*SKIP)^:这是放在每个结束标记之后的巧妙对。(*SKIP)基本上表示如果在此点之后的某些东西导致我们在字符串中后退,只需丢弃它并继续向前移动。紧接在其后的^是“行的开始”锚点,这意味着在找到相应的前导和结束对之后,只需丢弃整个内容并继续向前移动(因为你不能让字符串的开始紧随匹配的结束)。

\/\*\*\/匹配块注释的开头和结尾。

\/\/$匹配内联注释的开始到行尾。

&#39;&quot;的成对匹配各自类型字符串的开始和结束。

由于最后一个交替不包括(*SKIP),因此它是唯一被匹配并返回的部分。

英文:

Huge thanks to @Michail for the comment on the question that got me on the right track. I used that suggestion and developed it further to also handle inline and block comments (which may contain a "orphaned" single- or double-quote, thus inverting the desired matching).

\/\*.*?\*\/(*SKIP)^|\/\/.*?$(*SKIP)^|&#39;(?&gt;\\?.)*?&#39;(*SKIP)^|&quot;(?&gt;\\?.)*?&quot;

Demo

Note that the m, and s flags are pretty important for this to work.

Here's a break-down of how this works as far as I understand it.

In this use case, there are four alternatives separated by the alternation pipe (|). Any start/end pair that we don't want to keep/match should come first in the list of alternations because of how (*SKIP) works.

.*? and (?&gt;\\?.)*?: This is used to match 0 or more of any character in between the respective start and end markers. In the second case, it also specifically includes an optional backslash to handle cases of escaped characters within strings. The second case uses an atomic group, and I'm not 100% sure why, but I know it prevents backtracking, which seems to be important for this.

(*SKIP)^: This is a clever pair placed after each end marker. (*SKIP) basically says if something after this point causes us to go backward in the string, just discard it and keep moving forward. ^ immediately after that is the "start of the line" anchor, which means that after the respective preceding start and end pair have been found, just discard the whole thing and keep moving forward (because you can't have the beginning of the string occur immediately following the end of the match).

\/\* and \*\/ match the beginning and end of block comments.

\/\/ and $ match the beginning of an inline comment to the end of the line.

The pairs of &#39; and &quot; each match the start and end of their respective type of string.

Since the last alternation does not include (*SKIP), it's the only one that gets matched and returned.

答案2

得分: -1

你可以使用负回顾后断言 (?&lt;!&#39;) 来实现:

&quot;(?&lt;!&#39;)(.*?)(?&lt;!\\)(?&lt;!&#39;)&quot;

示例在这里

英文:

You can do it using Negative Lookbehind (?&lt;!&#39;)

&quot;(?&lt;!&#39;)(.*?)(?&lt;!\\)(?&lt;!&#39;)&quot;

Demo here

huangapple
  • 本文由 发表于 2023年2月26日 19:11:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/75571580.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定