英文:
Regex alphanumeric dynamic block grouping
问题
Group 1:
988320
903040
Group 2:
TRANS DECENAL
ACETIC ACID
英文:
I need your help because I have tried in several ways to create a regular expression which I can get by groups but I have been having trouble with a variation which I don't know how to take into account:
Words:
589011 T CANNELLE FLAVOR 8
538160 TP0597C ORANGE FLAVOR
557137 APL0397C STRAWBERRY FLAVOR
556137 APL0397C STRAWBERRY FLAVOR
545129 1APL0397C MANGO FLAVOR
984320 TRANS DECENAL
533160 TP0597C ORANGE FLAVOR
530373 A PINEAPPLE FLAVOR
059311 A STRAWBERRY FLAVOR
508142 T LEMON FLAVOR
547261 A BLACKBERRY FLAVOR
536564 2T BLUEBERRY STRAWBERRY
055406 A MILK FLAVOR
054945 A BLACKBERRY FLAVOR
983347 SUSTANE
902040 ACETIC ACID
Groups 1:
589011 T
538160 TP0597C
557137 APL0397C
556137 APL0397C
545129 1APL0397C
984320
533160 TP0597C
530373 A
059311 A
508142 T
547261 A
536564 2T
055406 A
054945 A
983347
902040
Groups 2:
CANNELLE FLAVOR
ORANGE FLAVOR
STRAWBERRY FLAVOR
STRAWBERRY FLAVOR
MANGO FLAVOR
TRANS DECENAL
ORANGE FLAVOR
PINEAPPLE FLAVOR
STRAWBERRY FLAVOR
LEMON FLAVOR
BLACKBERRY FLAVOR
BLUEBERRY STRAWBERRY
MILK FLAVOR
BLACKBERRY FLAVOR
SUSTANE
ACETIC ACID
Regex Test:
(\d+\s?[A-Z0-9]*)\s+([A-Z\s]+)\s
^([0-9\s]+(?:[A-Z0-9]+)?)\s(?:[A-Z\s]+)
The difficulty I have with these two words is that the groups should be like this:
988320 TRANS DECENAL 1 KGM 12164500 JC4 20.0000 KILO 263.98/KG 5,279.60
903040 ACETIC ACID 1 KGM 12164500 DJ4 100.0000 KILO 3.17/KG 317.00
Group 1:
988320
903040
Group 2:
TRANS DECENAL
ACETIC ACID
and not like that:
Group 1:
988320 TRANS
903040 ACETIC
Group 2:
DECENAL
ACID
答案1
得分: 1
^(\d+\s?\b(?:[A-Z]|[A-Z]*\d[A-Z\d]*)?\b)\s*([A-Z ]*\b)
第1组匹配数字,后面可以选择匹配一个字母的单词,或者一个包含字母和至少一个数字的混合词。第2组获取之后所有只包含字母的单词,除了它们之间的空格。
英文:
^(\d+\s?\b(?:[A-Z]|[A-Z]*\d[A-Z\d]*)?\b)\s*([A-Z ]*\b)
Group 1 matches digits followed optionally by either a word with a single letter, or a word that contains a mixture of letters and digits with at least one digit. Group 2 gets all the words with just letters after that, except for the whitespace between them.
答案2
得分: 0
你可以尝试使用以下正则表达式(regex101链接):
(\d{6}\s(?:[A-Z]\b|(?=\S*\d)[A-Z\d]+)?)\s*(.*)
这将匹配:
- 第一组: 6位数字 + (可选)一个字符或多个字符与数字的组合
- 第二组: 行的其余部分
英文:
You can try (regex101):
(\d{6}\s(?:[A-Z]\b|(?=\S*\d)[A-Z\d]+)?)\s*(.*)
This will match
- GROUP 1: 6 digits + (optionally)one character OR multiple character with digit
- GROUP 2: the rest of the line
答案3
得分: 0
我假设Group 1和Group 2是根据以下决策规则形成的。
- 字符串必须以字符串"1"开头的六个数字(字符串"1")后跟一个空格,然后是由一个或多个字母和数字组成的字符串(字符串"2");
- 如果字符串2后跟着两个空格,后跟由大写字母组成的单词,这两个实例后跟一个空格或位于字符串的末尾,则Group 1由字符串1后跟一个空格,后跟字符串2组成,否则Group 1仅由字符串1组成; 以及
- Group 2由跟随Group 1后面的大写字母或数字字符串组成,加上零个或多个实例的空格和由大写字母组成的单词,尽可能多的实例。
可以使用以下正则表达式来执行此操作,其中捕获组1和2将分别包含Group 1和Group 2(当然,前提是有匹配的情况)。
^(\d{6}(?: [A-Z\d]+(?=(?: [A-Z][A-Z\d]){2}))?) ([A-Z]+(?: [A-Z]+))
此正则表达式可以分解如下。
^
(# 开始捕获组1
\d{6} # 匹配6位数字
(?: # 开始非捕获组
[ ] # 匹配一个空格
[A-Z\d]+ # 匹配一个或多个(+)大写字母或数字
(?= # 开始正向先行断言
(?: # 开始非捕获组
[ ][A-Z] # 匹配一个空格后跟一个大写字母
[A-Z\d]* # 匹配零个或多个()大写字母或数字
){2} # 结束非捕获组并执行两次
) # 结束正向先行断言
)? # 结束非捕获组并将其设为可选的
) # 结束捕获组1
[ ] # 匹配一个空格
( # 开始捕获组2
[A-Z]+ # 匹配一个或多个(+)大写字母
(?: # 开始非捕获组
[ ] # 匹配一个空格
[A-Z]+ # 匹配一个或多个(+)大写字母
) # 结束非捕获组并将其设为可选的
) # 结束捕获组2
在上述内容中,我将每个空格字符表示为一个包含空格的字符类,只是为了使空格可见。
英文:
I have assumed Group 1 and Group 2 are formed using the following decision rule.
- The string must contain six digits at the beginning of the string (string "1") followed by a space followed by a string comprised of one or more letters and digits (string "2");
- Group 1 is comprised of string 1 followed by a space, followed by string 2 if string 2 is followed by two instances of a space followed by a word comprised of capital letters, those instances being followed by a space or being at the end of the string, otherwise Group 1 is comprised of string 1 only; and
- Group 2 is comprised of the string of capital letters or digits that follow the space that follows Group 1, plus zero of more instances of a space and a word comprised of capital letters, as many such instances as possible.
One can do that using the following regular expression, where capture groups 1 and 2 will contain Groups 1 and 2, respectively (provided there is a match, of course).
^(\d{6}(?: [A-Z\d]+(?=(?: [A-Z][A-Z\d]*){2}))?) ([A-Z]+(?: [A-Z]+)*)
This regular expression can be broken down as follows.
^
( # begin capture group 1
\d{6} # match 6 digits
(?: # begin a non-capture group
[ ] # match a space
[A-Z\d]+ # match one or more (+) upcase ltrs or digits
(?= # begin a positive lookahead
(?: # begin a non-capture group
[ ][A-Z] # match a space followed by a upcase ltr
[A-Z\d]* # match zero or more (*) upcase ltrs or digits
){2} # end non-capture group and execute twice
) # end positive lookahead
)? # end non-capture group and make it optional
) # end capture group 1
[ ] # match a space
( # begin capture group 2
[A-Z]+ # match one or more (+) upcase ltrs
(?: # begin a non-capture group
[ ] # match a space
[A-Z]+ # match one or more (+) upcase ltrs
)* # end non-capture group and make it optional
) # end capture group 2
In the above I have represented each space character as a character class containing a space merely to make the space visible.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论