Python Regex to match every words in sentence until a last word has underscore in it

huangapple go评论72阅读模式
英文:

Python Regex to match every words in sentence until a last word has underscore in it

问题

我试图找到一个正则表达式,可以匹配句子中的每个单词,直到最后一个单词中有下划线为止。

例如:

13wfe + 123dg Text ldf_dfdlj_dfldjf_dfs test 123

在这个例子中,我只想得到

13wfe + 123dg Text

我尝试过类似这样的东西,

^.*?(?=_)

但它返回的是

13wfe + 123dg Text ldf

您可以在这里找到正则表达式。请在这种情况下为我提供指导。

更新:使用@liginity提供的正则表达式,我能够找到子字符串,但在某些情况下仍然失败。

例如,在这个例子中:

13wfe + 123dg Tetest_xt ldf_dfdlj_dfldjf_dfs test 123

它应该能够找到这么多:

13wfe + 123dg Tetest_xt

但它找到了多个:

13wfe + 123dg 和 _xt

英文:

I am trying to find the regex which can match every word in sentence until a last word has an underscore in it.

For example:

13wfe + 123dg Text ldf_dfdlj_dfldjf_dfs test 123

In this example, I am looking to get only

13wfe + 123dg Text

I have tried using something along the line of these,

^.*?(?=_)

but it is returning this

13wfe + 123dg Text ldf

You can find the regex here. Kindly guide me in this scenario.

Update: using the regex provided by @liginity, I am able to find the substring, but in some cases it is still failing.

Such as in this example:

 13wfe + 123dg Tetest_xt ldf_dfdlj_dfldjf_dfs test 123

It should be able to find on this much:

13wfe + 123dg Tetest_xt

But it is finding multiple:

13wfe + 123dg and _xt

答案1

得分: 1

如果您想匹配任何非空白字符作为一个"word",您可以使用\S+

  • ^ 字符串的开头
  • .* 匹配整行
  • \S 匹配一个非空白字符
  • (?= 正向预查
    • [^\S\n]+ 匹配1个或更多不包括换行符的空白字符
    • [^\s_]+_ 匹配1个或更多非空白字符,但不包括 _,然后匹配 _
  • ) 关闭预查

注意 如果_也可能出现在单词的开头,您可以使用[^\s_]*_,其中*表示重复零次或多次。

查看正则表达式演示

匹配直到最后一个单词(其中一个单词仅由单词字符\w组成),其中包含下划线(不在单词的开头或结尾),左侧和右侧有空白边界(?<!\S)(?!\S)

  • ^ 字符串的开头
  • .* 匹配整行
  • (?<!\S) 负向回顾,匹配不跟随非空白字符的位置
  • \S+ 匹配一个或多个非空白字符
  • (?= 正向预查
    • [^\S\n]+ 匹配1个或更多不包括换行符的空白字符
    • [^\W_]+_\w+ 匹配1个或更多非非单词字符(不包括 _),然后匹配单词字符\w
    • (?!\S) 负向预查,匹配不跟随非空白字符的位置

查看另一个正则表达式演示

英文:

If you want to match any non whitespace char as a "word" you can use \S+

^.*\S(?=[^\S\n]+[^\s_]+_)
  • ^ Start of string
  • .* Match the whole line
  • \S Match a non whitespace char
  • (?= Positive lookahead
    • [^\S\n]+ Match 1+ whitespace chars without newlines
    • [^\s_]+_ Match 1+ non whitespace chars without _ and then match the _
  • ) Close the lookahead

Note that if the _ can also be at the beginning of the word, you can use [^\s_]*_ where * repeats zero or more times.

See a regex demo.

Matching all until a last word (where a word consists only of word chars \w) has an underscore in it (so not at the start or the end of the word) where (?<!\S) and (?!\S) are left and right hand whitespace boundaries:

^.*(?<!\S)\S+(?=[^\S\n]+[^\W_]+_\w+(?!\S))

See another regex demo.

huangapple
  • 本文由 发表于 2023年7月3日 19:49:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/76604458.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定