英文:
Java and regex lexer
问题
我正在尝试使用Java中的正则表达式为我创建的自定义Markdown“语言”制作某种词法分析器,这是我第一次处理这些内容,所以在某些方面有点迷茫。
一个可能的语法示例是:
Some <#000000>*text* [<#ffffff>Some more](action: Other <#gradient>text) and **finally** some more <#000>text!
我能够捕获一些内容,例如,我正在使用(?<hex><#\w+>)
来捕获“hex”,并且使用(?<action>\[[^]]*]\([^]]*\))
来获取整个“action”块。
我的问题是如何能够将它们全部组合在一起,就像如何将它们合并在一起。例如,词法分析器需要输出类似于以下内容:
TEXT - Some
HEX - <#000000>
TEXT - *text*
ACTION - [<#ffffff>Some more](action: Other <#gradient>text)
TEXT - and **finally** some more
HEX - <#000>
TEXT - text!
我稍后会处理粗体和斜体。希望提供一些如何将它们组合的建议!
英文:
I am trying to make some sort of Lexer in Java using regex for a custom markdown "language" I'm making, it's my first time working with this stuff so a little lost on a few things.
An example of a possible syntax in it is:
Some <#000000>*text* [<#ffffff>Some more](action: Other <#gradient>text) and **finally** some more <#000>text!
I was able to capture a few things, for example I'm using (?<hex><#\w+>)
to capture the "hex" and (?<action>\[[^]]*]\([^]]*\))
to get the entire "action" block.
My problem is being able to capture it all together, like, how to combine it all. For example the lexer needs to output something like:
TEXT - Some
HEX - <#000000>
TEXT - *text*
ACTION - [<#ffffff>Some more](action: Other <#gradient>text)
TEXT - and **finally** some more
HEX - <#000>
TEXT - text!
I'll handle the bold and italic later.
Would love just some suggestions on how to combine all of them!
答案1
得分: 2
以下是已翻译的内容:
One option could be using an alternation matching each of the separate parts, and for the text part use for example a character class [\w!* ]+
在这种情况下,可以使用交替方式匹配每个单独的部分,并且对于文本部分,例如可以使用一个字符类 [\w!* ]+
。
In Java, you could check for the name of the capturing group.
在Java中,您可以检查捕获组的名称。
Example code:
示例代码:
String regex = "(?<hex><#\w+>)|(?<action>\[[^]]]\([^]]\))|(?<text>[\w!* ]+)";
String string = "Some <#000000>text [<#ffffff>Some more](action: Other <#gradient>text) and finally some more <#000>text!";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
if (matcher.group("hex") != null) {
System.out.println("HEX - " + matcher.group("hex"));
}
if (matcher.group("text") != null) {
System.out.println("TEXT - " + matcher.group("text"));
}
if (matcher.group("action") != null) {
System.out.println("ACTION - " + matcher.group("action"));
}
}
Output
输出结果:
TEXT - Some
HEX - <#000000>
TEXT - text
ACTION - [<#ffffff>Some more](action: Other <#gradient>text)
TEXT - and finally some more
HEX - <#000>
TEXT - text!
英文:
One option could be using an alternation matching each of the separate parts, and for the text part use for example a character class [\w!* ]+
In Java, you could check for the name of the capturing group.
(?<hex><#\w+>)|(?<action>\[[^]]*]\([^]]*\))|(?<text>[\w!* ]+)
Explanation
(?<hex><#\w+>)
Capture grouphex
, match # and 1+ word chars|
Or(?<action>
Capture groupaction
\[[^]]*]\([^]]*\)
Match[
...]
followed by(...)
)
Close group|
Or(?<text>[\w!* ]+)
Capture grouptext
, match 1+ times any char listed in the character class
Example code:
String regex = "(?<hex><#\\w+>)|(?<action>\\[[^]]*]\\([^]]*\\))|(?<text>[\\w!* ]+)";
String string = "Some <#000000>*text* [<#ffffff>Some more](action: Other <#gradient>text) and **finally** some more <#000>text!";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
if (matcher.group("hex") != null) {
System.out.println("HEX - " + matcher.group("hex"));
}
if (matcher.group("text") != null) {
System.out.println("TEXT - " + matcher.group("text"));
}
if (matcher.group("action") != null) {
System.out.println("ACTION - " + matcher.group("action"));
}
}
Output
TEXT - Some
HEX - <#000000>
TEXT - *text*
ACTION - [<#ffffff>Some more](action: Other <#gradient>text)
TEXT - and **finally** some more
HEX - <#000>
TEXT - text!
答案2
得分: 0
你可以使用正则表达式捕获组来实现这个,就像这样
^(.*?) (?<hex1><#\w+>)(\*[^*]*\*) (?<action>\[[^]]*]\([^]]*\)) (.*?) (?<hex2><#\w+>)(.*)$
要更好地理解,请参考此链接 点击这里
英文:
You can achieve this using Regex- Capturing groups like this
^(.*?) (?<hex1><#\w+>)(\*[^*]*\*) (?<action>\[[^]]*]\([^]]*\)) (.*?) (?<hex2><#\w+>)(.*)$
To get a better understanding refer this Click here
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论