解析类似Lisp的字符串为标记和文字。

huangapple go评论58阅读模式
英文:

Parsing Lisp-Like String into Tokens and Literal Text

问题

使用PHP v8的preg_match_all($Matches)函数的"Match"参数,我需要匹配一系列文字 以及 有定界符的标记。

$x = preg_match_all($Regex, $Template, $Matches, PREG_OFFSET_CAPTURE); // 解析模板。

但是有一个问题,标记应该能够嵌套。我需要仅匹配来自嵌套的 最外层标记

例子:

This {is {{Par}m1}} plus {{Par{m3a{{Parm3b}}}} a}nd {{Parm4a||{{Par}m4b||{{Parm4c||{{Parm4d||Parm}}}}}}}}.

应该解析为:

Match 1: This {is
Match 2: {{Par}m1}}
Match 3:  plus
Match 4: {{Par{m3a{{Parm3b}}}}
Match 5:  a}nd
Match 6: {{Parm4a||{{Par}m4b||{{Parm4c||{{Parm4d||Parm}}}}}}}}
Match 7: .

请注意,上面 只有 双大括号 应该允许 在标记或文本中

只有双大括号被视为标记定界符。

迄今为止,我的正则表达式仅在文本或标记中没有单大括号时才能正常工作。

我的正则表达式:

(?:(?!(\{\{)).)+|((\{\{)((?>[^{}]+|(?2))*)(\}\}))

我无法弄清楚如何允许文本或标记中的单大括号而不破坏匹配列表。

更新

我正在继续解决这个问题,并想到了这个:

\{\{(?R)*\}\}|[^{}]+

它使用了递归运算符,但仍然受到相同问题的困扰,即单大括号会破坏解析。

正确的分隔符应该是开放和关闭的双大括号 "{{" 和 "}}"。

英文:

Using the PHP v8 preg_match_all($Matches) function's "Match" parameter, I need to match a list of literal text and delimited tokens.

$x = preg_match_all($Regex, $Template, $Matches, PREG_OFFSET_CAPTURE); // Parse the template.

The catch is that tokens should be able to be nested. I need to match only the outermost token from the nest.

Example:

This {is {{Par}m1}} plus {{Par{m3a{{Parm3b}}}} a}nd {{Parm4a||{{Par}m4b||{{Parm4c||{{Parm4d||Parm}}}}}}}}.

Should parse into this:

 Match 1: This {is
 Match 2: {{Par}m1}}
 Match 3:  plus
 Match 4: {{Par{m3a{{Parm3b}}}}
 Match 5:  a}nd
 Match 6: {{Parm4a||{{Par}m4b||{{Parm4c||{{Parm4d||Parm}}}}}}}}
 Match 7: .

Notice above that single curly braces should be allowed in tokens or in text.

Only double curly braces are considered token delimiters.

The regular expression that I have so far is working only if there are no single curly braces in the text or tokens.

My regex:

(?:(?!(\{\{)).)+|((\{\{)((?>[^{}]+|(?2))*)(\}\}))

I cannot figure out how to allow single curly braces in the text or inside tokens without breaking the list of matches.

Any help greatly appreciated!

UPDATE

I am continuing to work on this problem and came up with this:

\{\{(?R)*\}\}|[^{}]+

It uses the recursion operator but it still suffers from the same issue in that single curly braces break the parsing.

The proper delimiter is intended to be opening and closing double-curly-braces "{{" and "}}".

答案1

得分: 1

以下是翻译好的内容:

我认为我找到了解决方案。到目前为止,测试似乎正在工作。

正则表达式是

({{)(?R)*(}})|(?:(?!{{|}}).)+

测试

解析这个:

{{one}}{}这是 {{Pa}rm1}} p{}lus {{P{ar{}m2}} 和2 {{Close1}}{{Close2}} {{Par{m3a{{Parm3}b}}}} 和 {{Par{m4a||{{Parm4b||{{Parm4c||{{Parm4d||Pa}rm}}}}}}}} 结束 {{Par{}m5}}。

产生了这个:

{{one}}
{}这是
{{Pa}rm1}}
p{}lus
{{P{ar{}m2}}
和2
{{Close1}}
{{Close2}}

{{Par{m3a{{Parm3}b}}}}

{{Par{m4a||{{Parm4b||{{Parm4c||{{Parm4d||Pa}rm}}}}}}}}
结束
{{Par{}m5}}

到目前为止似乎正在工作。

英文:

I think I found the solution. So far testing appears to be working.

The regex is

(\{\{)(?R)*(\}\})|(?:(?!\{\{|\}\}).)+

Testing

Parsing this:

{{one}}{}This is {{Pa}rm1}} p{}lus {{P{ar{}m2}} and2 {{Close1}}{{Close2}} {{Par{m3a{{Parm3}b}}}} and {{Par{m4a||{{Parm4b||{{Parm4c||{{Parm4d||Pa}rm}}}}}}}} end {{Par{}m5}}.

Yields this:

{{one}}
{}This is 
{{Pa}rm1}}
 p{}lus 
{{P{ar{}m2}}
 and2 
{{Close1}}
{{Close2}}
 
{{Par{m3a{{Parm3}b}}}}
 and 
{{Par{m4a||{{Parm4b||{{Parm4c||{{Parm4d||Pa}rm}}}}}}}}
 end 
{{Par{}m5}}
.

So far seems to be working.

huangapple
  • 本文由 发表于 2023年4月17日 08:11:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/76030935.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定