提取由空格和某些特定字符分隔的单词。

huangapple go评论53阅读模式
英文:

extracting words broken up by white space & some specific characters

问题

我正在尝试从字符串中提取“单词”,具体是“日期”字符串组件。

oct 12:30 
2023 09:05 04 
%yyyy %hh:%ii %mm 
mar 2, 1945 * 匹配“2,”而不是2“ 
mar 2,1945 * 匹配“2,1945”而不是“2”“1945” 
mar2,1945 * 理想情况下,“mar2”应该是“mar”和“2” 
01-02-03 
04:05:06 

认为 我相当接近:
((^|%|[0-9]).+?(?=[,:]|\W|$))

但这会将“2,1945”提取为一个项目。
我尝试过((^|%|[0-9]).+?(?=[[^,]:]|\W|$)),但这一点也没有帮助。

基本上,我需要每个由空格或非字母数字字符分隔的单词,以及它们是否打破字母/数字模式(例如:mar2应该分别匹配mar和2)。

英文:

I'm trying to extract 'words' from a string, specifically 'date' string components.

oct 12:30
2023 09:05 04
%yyyy %hh:%ii %mm
mar 2, 1945 * matches "2," instead of 2"
mar 2,1945  * matches "2,1945" instead of "2" "1945"
mar2,1945   * ideally, "mar2" should be "mar" "2" 
01-02-03
04:05:06

I think I'm pretty close:
((^|%|[0-9]).+?(?=[,:]|\W|$))

but this is extracting "2,1945" as one item.
I tried ((^|%|[0-9]).+?(?=[[^,]:]|\W|$)) but that didn't help at all.

basically, I need every word broken up by white space, or non alphanumeric characters. (ie: :/- etc) as well as if they break the alpha/numeric pattern (ie: mar2 should match mar and 2 separately)

答案1

得分: 0

(\d{1,4}|\w{1,10}|%\w{1,4})
\d{1,4} 匹配2到4位数字(适用于所有数字)
或
\w{1,10} 匹配1到10个字符(适用于所有月份)
或
%\w{1,4} 匹配以%开头的2-4个字符

提取由空格和某些特定字符分隔的单词。

mar2,1945 -> mar 2 1945

但如果你不想匹配%5,请将\w更改为[a-zA-Z]

英文:
(\d{1,4}|\w{1,10}|%\w{1,4})
\d{1,4} match number 2 to 4 digits (for all numbers)
or 
\w{1,10} match 1 to 10 characters (for all months)
or 
%\w{1,4} match 2-4 characters start with %

提取由空格和某些特定字符分隔的单词。

mar2,1945 -> mar 2 1945

But %5 is matched if you don't want it; change \w to [a-zA-Z] instead.

答案2

得分: 0

根据提供的示例组合,我建议使用以下正则表达式:

```%?[a-zA-Z]+|%?\d+[a-zA-Z]*```

它将匹配可选的百分号后跟字母,或者数字和可选的字母。

示例:
```none
oct 12:30 : ['oct', '12', '30']
2023 09:05 04 : ['2023', '09', '05', '04']
%yyyy %hh:%ii %mm : ['%yyyy', '%hh', '%ii', '%mm']
mar 2, 1945 : ['mar', '2', '1945']
mar 2,1945 : ['mar', '2', '1945']
mar2,1945 : ['mar', '2', '1945']
01-02-03 : ['01', '02', '03']
04:05:06 : ['04', '05', '06']
10th of April, 2023 : ['10th', 'of', 'April', '2023']
%d%Od of %MM, %yyyy : ['%d', '%Od', 'of', '%MM', '%yyyy']

演示请点击这里


<details>
<summary>英文:</summary>

It is not entirely clear what input could de provided, so I&#39;m partially guessing here.

Based on combination of provided examples I would suggest to use this:

%?[a-zA-Z]+|%?\d+[a-zA-Z]*

It will match optional `%` followed by letters, or numbers and optional letters.

Example:
```none
oct 12:30 : [&#39;oct&#39;, &#39;12&#39;, &#39;30&#39;]
2023 09:05 04 : [&#39;2023&#39;, &#39;09&#39;, &#39;05&#39;, &#39;04&#39;]
%yyyy %hh:%ii %mm : [&#39;%yyyy&#39;, &#39;%hh&#39;, &#39;%ii&#39;, &#39;%mm&#39;]
mar 2, 1945 : [&#39;mar&#39;, &#39;2&#39;, &#39;1945&#39;]
mar 2,1945 : [&#39;mar&#39;, &#39;2&#39;, &#39;1945&#39;]
mar2,1945 : [&#39;mar&#39;, &#39;2&#39;, &#39;1945&#39;]
01-02-03 : [&#39;01&#39;, &#39;02&#39;, &#39;03&#39;]
04:05:06 : [&#39;04&#39;, &#39;05&#39;, &#39;06&#39;]
10th of April, 2023 : [&#39;10th&#39;, &#39;of&#39;, &#39;April&#39;, &#39;2023&#39;]
%d%Od of %MM, %yyyy : [&#39;%d&#39;, &#39;%Od&#39;, &#39;of&#39;, &#39;%MM&#39;, &#39;%yyyy&#39;]

Demo here.

答案3

得分: 0

你可以尝试这个正则表达式,它有3个捕获组:

([a-zA-Z]+)[ ,](\d+),\s(\d{4})

演示在这里

英文:

You can try this regex with 3 capturing groups :

([a-zA-Z]+)[ ,]*(\d+)\,\s*(\d{4})

Demo here

huangapple
  • 本文由 发表于 2023年4月11日 01:09:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/75979123.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定