英文:
Java/Groovy regex parse Key-Value pairs without delimiters
问题
我在使用正则表达式提取键值对时遇到了麻烦。
迄今为止的代码:
String raw = '''
MA1
D. Mueller Gießer
MA2 Peter
Mustermann 2. Mann
MA3 Ulrike Mastorius Schmelzer
MA4 Heiner Becker
s 3.Mann
MA5 Rudolf Peters
Gießer
'''
Map map = [:]
ArrayList<String> split = raw.findAll("(MA\\d)+(.*)"){ full, name, value -> map[name] = value }
println map
输出结果为:
[MA1:, MA2: Peter, MA3: Ulrike Mastorius Schmelzer, MA4: Heiner Becker, MA5: Rudolf Peters]
在我的情况下,键是:
MA1、MA2、MA3、MA\d(即带有任意一位数字的MA)
值是直到出现下一个键为止的所有内容(包括换行、制表符、空格等)。
有人知道如何做到这一点吗?
提前感谢您,
Sebastian
英文:
I have trouble fetching Key Value pairs with my regex
Code so far:
String raw = '''
MA1
D. Mueller Gießer
MA2 Peter
Mustermann 2. Mann
MA3 Ulrike Mastorius Schmelzer
MA4 Heiner Becker
s 3.Mann
MA5 Rudolf Peters
Gießer
'''
Map map = [:]
ArrayList<String> split = raw.findAll("(MA\\d)+(.*)"){ full, name, value -> map[name] = value }
println map
Output is:
[MA1:, MA2: Peter, MA3: Ulrike Mastorius Schmelzer, MA4: Heiner Becker, MA5: Rudolf Peters]
In my case the keys are:
MA1, MA2, MA3, MA\d (so MA with any 1 digit Number)
The value is absolutely everything until the next key comes up (including line breaks, tab, spaces etc...)
Does anybody have a clue how to do this?
Thanks in advance,
Sebastian
答案1
得分: 3
你可以在第二个组中捕获所有跟在关键字后面的内容,以及所有不以关键字开头的行。
^(MA\d+)(.*(?:\R(?!MA\d).*)*)
该模式匹配:
^
字符串的开头(MA\d+)
捕获 第一组,匹配 MA 和1个或多个数字(
捕获 第二组.*
匹配行剩余的部分(?:\R(?!MA\d).*)*
匹配所有不以 MA 后跟数字开头的行,其中\R
匹配任何Unicode换行序列
)
结束第二组
在Java中使用双重转义的反斜杠:
final String regex = "^MA\\\\d+)(.*(?:\\\\R(?!MA\\\\d).*)*)";
英文:
You can capture in the second group all that follows after the key and all the lines that do not start with the key
^(MA\d+)(.*(?:\R(?!MA\d).*)*)
The pattern matches
^
Start of string(MA\d+)
Capture group 1 matching MA and 1+ digits(
Capture group 2.*
Match the rest of the line(?:\R(?!MA\d).*)*
Match all lines that do not start with MA followed by a digit, where\R
matches any unicode newline sequence
)
Close group 2
In Java with the doubled escaped backslashes
final String regex = "^(MA\\d+)(.*(?:\\R(?!MA\\d).*)*)";
答案2
得分: 0
使用
(?ms)^(MA\d+)(.*?)(?=\nMA\d|\z)
参见 [证明][1]。
**解释**
解释
--------------------------------------------------------------------------------
(?ms) 设置标志以匹配这个块(使用 ^ 和 $ 匹配行的开头和结尾)
(使用 . 匹配 \n)(区分大小写)(正常匹配空白和 #)
--------------------------------------------------------------------------------
^ 行的开头
--------------------------------------------------------------------------------
( 第1组并捕获至 :
--------------------------------------------------------------------------------
MA 'MA'
--------------------------------------------------------------------------------
\d+ 数字(0-9)(1次或多次(匹配最多数量))
--------------------------------------------------------------------------------
) 的结束
--------------------------------------------------------------------------------
( 第2组并捕获至 :
--------------------------------------------------------------------------------
.*? 任意字符(0次或多次(匹配最少数量))
--------------------------------------------------------------------------------
) 的结束
--------------------------------------------------------------------------------
(?= 向前查找以查看是否存在:
--------------------------------------------------------------------------------
\n '\n'(换行)
--------------------------------------------------------------------------------
MA 'MA'
--------------------------------------------------------------------------------
\d 数字(0-9)
--------------------------------------------------------------------------------
| 或者
--------------------------------------------------------------------------------
\z 字符串的结尾
--------------------------------------------------------------------------------
) 向前查找的结束
[1]: https://regex101.com/r/NOkli9/1
英文:
Use
(?ms)^(MA\d+)(.*?)(?=\nMA\d|\z)
See proof.
Explanation
EXPLANATION
--------------------------------------------------------------------------------
(?ms) set flags for this block (with ^ and $
matching start and end of line) (with .
matching \n) (case-sensitive) (matching
whitespace and # normally)
--------------------------------------------------------------------------------
^ the beginning of a "line"
--------------------------------------------------------------------------------
( group and capture to :
--------------------------------------------------------------------------------
MA 'MA'
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
) end of
--------------------------------------------------------------------------------
( group and capture to :
--------------------------------------------------------------------------------
.*? any character (0 or more times (matching
the least amount possible))
--------------------------------------------------------------------------------
) end of
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
MA 'MA'
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\z the end of the string
--------------------------------------------------------------------------------
) end of look-ahead
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论