使用正则表达式查找在搜索字符串之后的特定长度的数字数量。

huangapple go评论59阅读模式
英文:

python regex find number of a specific length following at some point after search string

问题

I may have solved my problem using a negative lookbehind, but I am worried I've just stumbled across a special case where this works, or have it ~correct but extremely stilted/inefficient. Would VERY much appreciate any feedback, and I'd really like to understand why the original pattern wasn't working. Here's the regex pattern that now seems to be working:

qtD = re.compile('qt[^cy].*?(?<![<|>|\\d])(\d{3})(?:\D|$)', re.I)

Below is the original message:

I am trying to accomplish the following in Python (3.2) using the standard regular expression package, re. This doesn't seem like it should be complicated, but I can't figure out what's going wrong.

Here's an example string:
s = 'EKG done this AM with QT 560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...

What I'm trying to have returned by re.findall() in this case is simply the number '560' - that is, the first number appearing at some point after 'QT' and being in the form of a 3 digit integer.

Here's the regex pattern I'm currently using:

qtD = re.compile('qt[^cy].*?[^<|>|\d](\d{3})(?:\D|$)', re.I)

...so, basically:

  • find QT (but not QTC or QTY)
  • ...possibly followed by any number of characters
  • ...and return the first 3 digit integer you find (\d{3})
  • ...but only if that 3 digit integer is not immediately preceded by "<", ">", or another digit
  • ...and is immediately followed by either the end of the line, $, or a non-digit, \D

I'm searching like this:
re.findall(qtD, s)

The above works fine but ONLY in the case that there is a string (characters or whitespace) that has a length of at least 2 between the 'QT' and the number. In other words, "QT560" returns 560. "QT interval normal at 560" returns 560. "QT: 560" returns 560.

BUT, if the string is as shown above, "...QT 560...", then the regex will keep reading and return the next 3 digit number, 535.

I have tried other things, like making the [^<|>|\d] lazy, i.e., [^<|>|\d]? or repeating 0 or 1 times [^<|>|\d]{0,1} but then it will start doing things like returning the "015" from 5/3/2015 if it doesn't find any 3 digit numbers earlier than that, i.e., if 560 and 535 weren't there, in which case I'd want an empty list returned.

Thank you for any help.

英文:

EDIT: I may have solved my problem using a negative lookbehind, but I am a) worried I've just stumbled across a special case where this works, or b) have it ~correct but extremely stilted / inefficient. Would VERY much appreciate any feedback, and I'd really like to understand why the original pattern wasn't working. Here's the regex pattern that now seems to be working:

qtD = re.compile(&#39;qt[^cy].*?(?&lt;!\&lt;|\&gt;|\d)(\d{3})(?:\D|$)&#39;, re.I)

Below is the original message:

I am trying to accomplish the following in Python (3.2) using the standard regular expression package, re. This doesn't seem like it should be complicated, but I can't figure out what's going wrong.

Here's an example string:
s = &#39;EKG done this AM with QT 560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...

What I'm trying to have returned by re.findall() in this case is simply the number '560' - that is, the first number appearing at some point after 'QT' and being in the form of a 3 digit integer.

Here's the regex pattern I'm currently using:

qtD = re.compile(&#39;qt[^cy].*?[^\&lt;\&gt;\d](\d{3})(?:\D|$)&#39;, re.I)

...so, basically:

  • find QT (but not QTC or QTY)
  • ...possibly followed by any number of characters
  • ...and return the first 3 digit integer you find (\d{3})
  • ...but only if that 3 digit integer is not immediately preceded by "<", ">", or another digit
  • ...and is immediately followed by either the end of the line, $, or a non-digit, \D

I'm searching like this:
re.findall(qtD, s)

The above works fine but ONLY in the case that there is a string (characters or whitespace) that has a length of at least 2 between the 'QT' and the number. In other words, "QT<double space>560" returns 560. "QT interval normal at 560" returns 560. "QT: 560" returns 560.

BUT, if the string is as shown above, "...QT 560...", then the regex will keep reading and return the next 3 digit number, 535.

I have tried other things, like making the [^\&lt;\&gt;\d] lazy, i.e., [^\&lt;\&gt;\d]? or repeating 0 or 1 times [^\&lt;\&gt;\d]{0,1} but then it will start doing things like returning the "015" from 5/3/2015 if it doesn't find any 3 digit numbers earlier than that, i.e., if 560 and 535 weren't there, in which case I'd want an empty list returned.

Thank you for any help.

答案1

得分: 2

在您的模式中,您正在匹配至少2个字符,其中包括[^cy][^\&lt;\&gt;\d]

相反,您可以使用2个先行断言:

\bqt(?![cy]).*?(?&lt;![\d&lt;&gt;])(\d{3})(?:\D|$)

该模式匹配:

  • \bqt 单词边界,以防止部分单词匹配,然后匹配 qt
  • (?![cy]) 负向前查找,断言右侧不是 cy
  • .*? 匹配任何字符,尽可能少
  • (?&lt;![\d&lt;&gt;]) 负向后查找,断言左侧不是数字或 <>
  • (\d{3}) 捕获3个数字在组1中(由re.findall返回)
  • (?:\D|$) 匹配非数字或断言字符串的末尾

正则表达式演示

这将匹配 QT560 中的 560,因为它可以是任意数量的字符后跟

> 可能跟随着

或首先匹配3个数字,然后是后向断言:

\bqt(?![cy]).*?(\d{3})(?&lt;![\d&lt;&gt;]\d{3})(?:\D|$)

正则表达式演示

英文:

In your pattern you are matching at least 2 characters with [^cy] and [^\&lt;\&gt;\d]

Instead you might use 2 lookaround assertions:

\bqt(?![cy]).*?(?&lt;![\d&lt;&gt;])(\d{3})(?:\D|$)

The pattern matches:

  • \bqt A word boundary to prevent a partial word match, then match qt
  • (?![cy]) Negative lookahead, assert not c or y directly to the right
  • .*? Match any character, as few as possible
  • (?&lt;![\d&lt;&gt;]) Negative lookbehind, assert not a digit or &lt; or &gt; to the left
  • (\d{3}) Capture 3 digits in group 1 (returned by re.findall)
  • (?:\D|$) Match either a non digit or assert the end of the string

Regex demo

This will then also match 560 in QT560 as it can be

> possibly followed by any number of characters

Or matching the 3 digits first, and then the lookbehind assertion:

\bqt(?![cy]).*?(\d{3})(?&lt;![\d&lt;&gt;]\d{3})(?:\D|$)

Regex demo

答案2

得分: 0

通常,问题出在这个部分:

[^cy]    # 需要1个字符
.*? 
[^&lt;&gt;\d]  # 需要1个字符

要求在 qt\d[3] 之间至少有2个字符,

然而,这个部分:

[^cy]        # 需要1个字符
.*?
(?&lt;![&lt;&gt;\d])  # 需要0个字符

要求在 qt\d[3] 之间至少有1个字符。

因为在你的示例字符串中,在 QT 560 之间只有1个字符,所以这个正则表达式:

qt[^cy].*?[^&lt;&gt;\d](\d{3})(?:\D|$)

只会匹配到 QT 560 and QTC wnl at 535m,其中 535 在捕获组1中。

如果你在 QT 560 之间再添加一个空格,你会发现这一点:

链接

这个正则表达式可能更好地解决这个问题,或类似的正则表达式:

qt(?![cy]).+?(?&lt;![&lt;&gt;\d])(\d{3})(?:\D|$)

链接

英文:

Generally, the problem is this section

[^cy]    # requires 1 character
.*? 
[^&lt;&gt;\d]  # requires 1 character  

requires a minimum of 2 characters between qt and \d[3]

whereas, this

[^cy]        # requires 1 character
.*?
(?&lt;![&lt;&gt;\d])  # requires 0 character  

requires a minimum of 1 characters between qt and \d[3]

Since there is only 1 character between QT 560 this regex
qt[^cy].*?[^&lt;&gt;\d](\d{3})(?:\D|$) will only match QT 560 and QTC wnl at 535m
in your sample string, where 535 is in capture group 1.

You can see this if you add another space inbetween QT 560
https://regex101.com/r/yMMjtD/1

This might better solve the problem, or similar
qt(?![cy]).+?(?&lt;![&lt;&gt;\d])(\d{3})(?:\D|$)
https://regex101.com/r/JjB7kL/1

答案3

得分: 0

在以下字符串值中,将包含匹配项,并捕获值"560"。
EKG done this AM with QT 560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT: 560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT abc 560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT560
EKG done this AM with QT560a
而这些则不会。
EKG done this AM with QTC 560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QTY 560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT &gt;560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT &lt;560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT 5601 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT 56 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
英文:

> "find QT (but not QTC or QTY)"

QT(?![CY])

> "possibly followed by any number of characters"

QT(?![CY])[^\d]*

> "and return the first 3 digit integer you find (\d{3})"

QT(?![CY])[^\d]*(\d{3})

> "but only if that 3 digit integer is not immediately preceded by "<", ">", or another digit"

I'm not sure I understand, "or another digit", as, the first digit would have been matched by \d{3}.

QT(?![CY])[^\d]*(?&lt;!&lt;|&gt;)(\d{3})

> "and is immediately followed by either the end of the line, $, or a non-digit, \D"

QT(?![CY])[^\d]*(?&lt;!&lt;|&gt;)(\d{3})(?:$|\D)

For example, the following string values will contain a match, and capture the value "560".

EKG done this AM with QT 560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT: 560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT abc 560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT560
EKG done this AM with QT560a

And, these will not.

EKG done this AM with QTC 560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QTY 560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT &gt;560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT &lt;560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT 5601 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT 56 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...

huangapple
  • 本文由 发表于 2023年6月2日 02:16:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/76384655.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定