Java和正则表达式词法分析器

huangapple go评论76阅读模式
英文:

Java and regex lexer

问题

我正在尝试使用Java中的正则表达式为我创建的自定义Markdown“语言”制作某种词法分析器,这是我第一次处理这些内容,所以在某些方面有点迷茫。
一个可能的语法示例是:
Some <#000000>*text* [<#ffffff>Some more](action: Other <#gradient>text) and **finally** some more <#000>text!
我能够捕获一些内容,例如,我正在使用(?<hex>&lt;#\w+&gt;)来捕获“hex”,并且使用(?<action>\[[^]]*]\([^]]*\))来获取整个“action”块。
我的问题是如何能够将它们全部组合在一起,就像如何将它们合并在一起。例如,词法分析器需要输出类似于以下内容:

TEXT - Some
HEX - &lt;#000000&gt;
TEXT - *text*
ACTION - [&lt;#ffffff&gt;Some more](action: Other &lt;#gradient&gt;text)
TEXT - and **finally** some more
HEX - &lt;#000&gt;
TEXT - text!

我稍后会处理粗体和斜体。希望提供一些如何将它们组合的建议!

英文:

I am trying to make some sort of Lexer in Java using regex for a custom markdown "language" I'm making, it's my first time working with this stuff so a little lost on a few things.
An example of a possible syntax in it is:
Some &lt;#000000&gt;*text* [&lt;#ffffff&gt;Some more](action: Other &lt;#gradient&gt;text) and **finally** some more &lt;#000&gt;text!
I was able to capture a few things, for example I'm using (?&lt;hex&gt;&lt;#\w+&gt;) to capture the "hex" and (?&lt;action&gt;\[[^]]*]\([^]]*\)) to get the entire "action" block.
My problem is being able to capture it all together, like, how to combine it all. For example the lexer needs to output something like:

TEXT - Some
HEX - &lt;#000000&gt;
TEXT - *text*
ACTION - [&lt;#ffffff&gt;Some more](action: Other &lt;#gradient&gt;text)
TEXT - and **finally** some more
HEX - &lt;#000&gt;
TEXT - text!

I'll handle the bold and italic later.
Would love just some suggestions on how to combine all of them!

答案1

得分: 2

以下是已翻译的内容:

One option could be using an alternation matching each of the separate parts, and for the text part use for example a character class [\w!* ]+

在这种情况下,可以使用交替方式匹配每个单独的部分,并且对于文本部分,例如可以使用一个字符类 [\w!* ]+

In Java, you could check for the name of the capturing group.

在Java中,您可以检查捕获组的名称。

Example code:

示例代码:

String regex = "(?<hex><#\w+>)|(?<action>\[[^]]]\([^]]\))|(?<text>[\w!* ]+)";

String string = "Some <#000000>text [<#ffffff>Some more](action: Other <#gradient>text) and finally some more <#000>text!";

Pattern pattern = Pattern.compile(regex);

Matcher matcher = pattern.matcher(string);

while (matcher.find()) {
if (matcher.group("hex") != null) {
System.out.println("HEX - " + matcher.group("hex"));
}
if (matcher.group("text") != null) {
System.out.println("TEXT - " + matcher.group("text"));
}
if (matcher.group("action") != null) {
System.out.println("ACTION - " + matcher.group("action"));
}
}

Output

输出结果:

TEXT - Some
HEX - <#000000>
TEXT - text
ACTION - [<#ffffff>Some more](action: Other <#gradient>text)
TEXT - and finally some more
HEX - <#000>
TEXT - text!

英文:

One option could be using an alternation matching each of the separate parts, and for the text part use for example a character class [\w!* ]+

In Java, you could check for the name of the capturing group.

(?&lt;hex&gt;&lt;#\w+&gt;)|(?&lt;action&gt;\[[^]]*]\([^]]*\))|(?&lt;text&gt;[\w!* ]+)

Explanation

  • (?&lt;hex&gt;&lt;#\w+&gt;) Capture group hex, match # and 1+ word chars
  • | Or
  • (?&lt;action&gt; Capture group action
    • \[[^]]*]\([^]]*\) Match [...] followed by (...)
  • ) Close group
  • | Or
  • (?&lt;text&gt;[\w!* ]+) Capture group text, match 1+ times any char listed in the character class

Regex demo | Java demo

Example code:

String regex = &quot;(?&lt;hex&gt;&lt;#\\w+&gt;)|(?&lt;action&gt;\\[[^]]*]\\([^]]*\\))|(?&lt;text&gt;[\\w!* ]+)&quot;;
String string = &quot;Some &lt;#000000&gt;*text* [&lt;#ffffff&gt;Some more](action: Other &lt;#gradient&gt;text) and **finally** some more &lt;#000&gt;text!&quot;;

Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);

while (matcher.find()) {
	if (matcher.group(&quot;hex&quot;) != null) {
		System.out.println(&quot;HEX - &quot; + matcher.group(&quot;hex&quot;));	
	}
	if (matcher.group(&quot;text&quot;) != null) {
		System.out.println(&quot;TEXT - &quot; + matcher.group(&quot;text&quot;));	
	}
	if (matcher.group(&quot;action&quot;) != null) {
		System.out.println(&quot;ACTION - &quot; + matcher.group(&quot;action&quot;));	
	}
}

Output

TEXT - Some 
HEX - &lt;#000000&gt;
TEXT - *text* 
ACTION - [&lt;#ffffff&gt;Some more](action: Other &lt;#gradient&gt;text)
TEXT -  and **finally** some more 
HEX - &lt;#000&gt;
TEXT - text!

答案2

得分: 0

你可以使用正则表达式捕获组来实现这个,就像这样
^(.*?) (?&lt;hex1&gt;&lt;#\w+&gt;)(\*[^*]*\*) (?&lt;action&gt;\[[^]]*]\([^]]*\)) (.*?) (?&lt;hex2&gt;&lt;#\w+&gt;)(.*)$
要更好地理解,请参考此链接 点击这里

英文:

You can achieve this using Regex- Capturing groups like this
^(.*?) (?&lt;hex1&gt;&lt;#\w+&gt;)(\*[^*]*\*) (?&lt;action&gt;\[[^]]*]\([^]]*\)) (.*?) (?&lt;hex2&gt;&lt;#\w+&gt;)(.*)$
To get a better understanding refer this Click here

huangapple
  • 本文由 发表于 2020年7月30日 00:32:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/63158293.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定