Java/Groovy正则表达式解析无需分隔符的键值对

huangapple go评论101阅读模式
英文:

Java/Groovy regex parse Key-Value pairs without delimiters

问题

我在使用正则表达式提取键值对时遇到了麻烦。

迄今为止的代码:

  1. String raw = '''
  2. MA1
  3. 
D. Mueller Gießer

  4. MA2 Peter

  5. Mustermann 2. Mann


  6. MA3 Ulrike Mastorius Schmelzer

  7. MA4 Heiner Becker
s 3.Mann
  8. 
MA5 Rudolf Peters

  9. Gießer
  10. '''
  11. Map map = [:]
  12. ArrayList<String> split = raw.findAll("(MA\\d)+(.*)"){ full, name, value -> map[name] = value }
  13. println map

输出结果为:
[MA1:, MA2: Peter, MA3: Ulrike Mastorius Schmelzer, MA4: Heiner Becker, MA5: Rudolf Peters]

在我的情况下,键是:
MA1、MA2、MA3、MA\d(即带有任意一位数字的MA)

值是直到出现下一个键为止的所有内容(包括换行、制表符、空格等)。

有人知道如何做到这一点吗?

提前感谢您,
Sebastian

英文:

I have trouble fetching Key Value pairs with my regex

Code so far:

  1. String raw = &#39;&#39;&#39;
  2. MA1
  3. 
D. Mueller Gie&#223;er

  4. MA2 Peter
  5. Mustermann 2. Mann


  6. MA3 Ulrike Mastorius Schmelzer
  7. MA4 Heiner Becker
s 3.Mann
  8. 
MA5 Rudolf Peters
  9. Gie&#223;er

  10. &#39;&#39;&#39;
  11. Map map = [:]
  12. ArrayList&lt;String&gt; split = raw.findAll(&quot;(MA\\d)+(.*)&quot;){ full, name, value -&gt; map[name] = value }


  13. println map

Output is:
[MA1:, MA2: Peter, MA3: Ulrike Mastorius Schmelzer, MA4: Heiner Becker, MA5: Rudolf Peters]

In my case the keys are:
MA1, MA2, MA3, MA\d (so MA with any 1 digit Number)

The value is absolutely everything until the next key comes up (including line breaks, tab, spaces etc...)

Does anybody have a clue how to do this?

Thanks in advance,
Sebastian

答案1

得分: 3

你可以在第二个组中捕获所有跟在关键字后面的内容,以及所有不以关键字开头的行。

  1. ^(MA\d+)(.*(?:\R(?!MA\d).*)*)

该模式匹配:

  • ^ 字符串的开头
  • (MA\d+) 捕获 第一组,匹配 MA 和1个或多个数字
  • ( 捕获 第二组
    • .* 匹配行剩余的部分
    • (?:\R(?!MA\d).*)* 匹配所有不以 MA 后跟数字开头的行,其中 \R 匹配任何Unicode换行序列
  • ) 结束第二组

正则表达式演示

在Java中使用双重转义的反斜杠:

  1. final String regex = "^MA\\\\d+)(.*(?:\\\\R(?!MA\\\\d).*)*)";
英文:

You can capture in the second group all that follows after the key and all the lines that do not start with the key

  1. ^(MA\d+)(.*(?:\R(?!MA\d).*)*)

The pattern matches

  • ^ Start of string
  • (MA\d+) Capture group 1 matching MA and 1+ digits
  • ( Capture group 2
    • .* Match the rest of the line
    • (?:\R(?!MA\d).*)* Match all lines that do not start with MA followed by a digit, where \R matches any unicode newline sequence
  • ) Close group 2

Regex demo

In Java with the doubled escaped backslashes

  1. final String regex = &quot;^(MA\\d+)(.*(?:\\R(?!MA\\d).*)*)&quot;;

答案2

得分: 0

  1. 使用
  2. (?ms)^(MA\d+)(.*?)(?=\nMA\d|\z)
  3. 参见 [证明][1]。
  4. **解释**
  5. 解释
  6. --------------------------------------------------------------------------------
  7. (?ms) 设置标志以匹配这个块(使用 ^ $ 匹配行的开头和结尾)
  8. (使用 . 匹配 \n)(区分大小写)(正常匹配空白和 #)
  9. --------------------------------------------------------------------------------
  10. ^ 行的开头
  11. --------------------------------------------------------------------------------
  12. ( 1组并捕获至
  13. --------------------------------------------------------------------------------
  14. MA &#39;MA&#39;
  15. --------------------------------------------------------------------------------
  16. \d+ 数字(0-9)(1次或多次(匹配最多数量))
  17. --------------------------------------------------------------------------------
  18. ) 的结束
  19. --------------------------------------------------------------------------------
  20. ( 2组并捕获至
  21. --------------------------------------------------------------------------------
  22. .*? 任意字符(0次或多次(匹配最少数量))
  23. --------------------------------------------------------------------------------
  24. ) 的结束
  25. --------------------------------------------------------------------------------
  26. (?= 向前查找以查看是否存在:
  27. --------------------------------------------------------------------------------
  28. \n &#39;\n&#39;(换行)
  29. --------------------------------------------------------------------------------
  30. MA &#39;MA&#39;
  31. --------------------------------------------------------------------------------
  32. \d 数字(0-9
  33. --------------------------------------------------------------------------------
  34. | 或者
  35. --------------------------------------------------------------------------------
  36. \z 字符串的结尾
  37. --------------------------------------------------------------------------------
  38. ) 向前查找的结束
  39. [1]: https://regex101.com/r/NOkli9/1
英文:

Use

  1. (?ms)^(MA\d+)(.*?)(?=\nMA\d|\z)

See proof.

Explanation

  1. EXPLANATION
  2. --------------------------------------------------------------------------------
  3. (?ms) set flags for this block (with ^ and $
  4. matching start and end of line) (with .
  5. matching \n) (case-sensitive) (matching
  6. whitespace and # normally)
  7. --------------------------------------------------------------------------------
  8. ^ the beginning of a &quot;line&quot;
  9. --------------------------------------------------------------------------------
  10. ( group and capture to :
  11. --------------------------------------------------------------------------------
  12. MA &#39;MA&#39;
  13. --------------------------------------------------------------------------------
  14. \d+ digits (0-9) (1 or more times (matching
  15. the most amount possible))
  16. --------------------------------------------------------------------------------
  17. ) end of
  18. --------------------------------------------------------------------------------
  19. ( group and capture to :
  20. --------------------------------------------------------------------------------
  21. .*? any character (0 or more times (matching
  22. the least amount possible))
  23. --------------------------------------------------------------------------------
  24. ) end of
  25. --------------------------------------------------------------------------------
  26. (?= look ahead to see if there is:
  27. --------------------------------------------------------------------------------
  28. \n &#39;\n&#39; (newline)
  29. --------------------------------------------------------------------------------
  30. MA &#39;MA&#39;
  31. --------------------------------------------------------------------------------
  32. \d digits (0-9)
  33. --------------------------------------------------------------------------------
  34. | OR
  35. --------------------------------------------------------------------------------
  36. \z the end of the string
  37. --------------------------------------------------------------------------------
  38. ) end of look-ahead

huangapple
  • 本文由 发表于 2020年9月17日 16:54:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/63934578.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定