英文:
python regex find number of a specific length following at some point after search string
问题
I may have solved my problem using a negative lookbehind, but I am worried I've just stumbled across a special case where this works, or have it ~correct but extremely stilted/inefficient. Would VERY much appreciate any feedback, and I'd really like to understand why the original pattern wasn't working. Here's the regex pattern that now seems to be working:
qtD = re.compile('qt[^cy].*?(?<![<|>|\\d])(\d{3})(?:\D|$)', re.I)
Below is the original message:
I am trying to accomplish the following in Python (3.2) using the standard regular expression package, re. This doesn't seem like it should be complicated, but I can't figure out what's going wrong.
Here's an example string:
s = 'EKG done this AM with QT 560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
What I'm trying to have returned by re.findall()
in this case is simply the number '560' - that is, the first number appearing at some point after 'QT' and being in the form of a 3 digit integer.
Here's the regex pattern I'm currently using:
qtD = re.compile('qt[^cy].*?[^<|>|\d](\d{3})(?:\D|$)', re.I)
...so, basically:
- find QT (but not QTC or QTY)
- ...possibly followed by any number of characters
- ...and return the first 3 digit integer you find (\d{3})
- ...but only if that 3 digit integer is not immediately preceded by "<", ">", or another digit
- ...and is immediately followed by either the end of the line, $, or a non-digit, \D
I'm searching like this:
re.findall(qtD, s)
The above works fine but ONLY in the case that there is a string (characters or whitespace) that has a length of at least 2 between the 'QT' and the number. In other words, "QT
BUT, if the string is as shown above, "...QT 560...", then the regex will keep reading and return the next 3 digit number, 535.
I have tried other things, like making the [^<|>|\d]
lazy, i.e., [^<|>|\d]?
or repeating 0 or 1 times [^<|>|\d]{0,1}
but then it will start doing things like returning the "015" from 5/3/2015 if it doesn't find any 3 digit numbers earlier than that, i.e., if 560 and 535 weren't there, in which case I'd want an empty list returned.
Thank you for any help.
英文:
EDIT: I may have solved my problem using a negative lookbehind, but I am a) worried I've just stumbled across a special case where this works, or b) have it ~correct but extremely stilted / inefficient. Would VERY much appreciate any feedback, and I'd really like to understand why the original pattern wasn't working. Here's the regex pattern that now seems to be working:
qtD = re.compile('qt[^cy].*?(?<!\<|\>|\d)(\d{3})(?:\D|$)', re.I)
Below is the original message:
I am trying to accomplish the following in Python (3.2) using the standard regular expression package, re. This doesn't seem like it should be complicated, but I can't figure out what's going wrong.
Here's an example string:
s = 'EKG done this AM with QT 560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
What I'm trying to have returned by re.findall()
in this case is simply the number '560' - that is, the first number appearing at some point after 'QT' and being in the form of a 3 digit integer.
Here's the regex pattern I'm currently using:
qtD = re.compile('qt[^cy].*?[^\<\>\d](\d{3})(?:\D|$)', re.I)
...so, basically:
- find QT (but not QTC or QTY)
- ...possibly followed by any number of characters
- ...and return the first 3 digit integer you find (\d{3})
- ...but only if that 3 digit integer is not immediately preceded by "<", ">", or another digit
- ...and is immediately followed by either the end of the line, $, or a non-digit, \D
I'm searching like this:
re.findall(qtD, s)
The above works fine but ONLY in the case that there is a string (characters or whitespace) that has a length of at least 2 between the 'QT' and the number. In other words, "QT<double space>560" returns 560. "QT interval normal at 560" returns 560. "QT: 560" returns 560.
BUT, if the string is as shown above, "...QT 560...", then the regex will keep reading and return the next 3 digit number, 535.
I have tried other things, like making the [^\<\>\d]
lazy, i.e., [^\<\>\d]?
or repeating 0 or 1 times [^\<\>\d]{0,1}
but then it will start doing things like returning the "015" from 5/3/2015 if it doesn't find any 3 digit numbers earlier than that, i.e., if 560 and 535 weren't there, in which case I'd want an empty list returned.
Thank you for any help.
答案1
得分: 2
在您的模式中,您正在匹配至少2个字符,其中包括[^cy]
和[^\<\>\d]
。
相反,您可以使用2个先行断言:
\bqt(?![cy]).*?(?<![\d<>])(\d{3})(?:\D|$)
该模式匹配:
\bqt
单词边界,以防止部分单词匹配,然后匹配qt
(?![cy])
负向前查找,断言右侧不是c
或y
.*?
匹配任何字符,尽可能少(?<![\d<>])
负向后查找,断言左侧不是数字或<
或>
(\d{3})
捕获3个数字在组1中(由re.findall返回)(?:\D|$)
匹配非数字或断言字符串的末尾
这将匹配 QT560
中的 560
,因为它可以是任意数量的字符后跟
> 可能跟随着
或首先匹配3个数字,然后是后向断言:
\bqt(?![cy]).*?(\d{3})(?<![\d<>]\d{3})(?:\D|$)
英文:
In your pattern you are matching at least 2 characters with [^cy]
and [^\<\>\d]
Instead you might use 2 lookaround assertions:
\bqt(?![cy]).*?(?<![\d<>])(\d{3})(?:\D|$)
The pattern matches:
\bqt
A word boundary to prevent a partial word match, then matchqt
(?![cy])
Negative lookahead, assert notc
ory
directly to the right.*?
Match any character, as few as possible(?<![\d<>])
Negative lookbehind, assert not a digit or<
or>
to the left(\d{3})
Capture 3 digits in group 1 (returned by re.findall)(?:\D|$)
Match either a non digit or assert the end of the string
This will then also match 560
in QT560
as it can be
> possibly followed by any number of characters
Or matching the 3 digits first, and then the lookbehind assertion:
\bqt(?![cy]).*?(\d{3})(?<![\d<>]\d{3})(?:\D|$)
答案2
得分: 0
通常,问题出在这个部分:
[^cy] # 需要1个字符
.*?
[^<>\d] # 需要1个字符
要求在 qt
和 \d[3]
之间至少有2个字符,
然而,这个部分:
[^cy] # 需要1个字符
.*?
(?<![<>\d]) # 需要0个字符
要求在 qt
和 \d[3]
之间至少有1个字符。
因为在你的示例字符串中,在 QT 560
之间只有1个字符,所以这个正则表达式:
qt[^cy].*?[^<>\d](\d{3})(?:\D|$)
只会匹配到 QT 560 and QTC wnl at 535m
,其中 535
在捕获组1中。
如果你在 QT 560
之间再添加一个空格,你会发现这一点:
这个正则表达式可能更好地解决这个问题,或类似的正则表达式:
qt(?![cy]).+?(?<![<>\d])(\d{3})(?:\D|$)
英文:
Generally, the problem is this section
[^cy] # requires 1 character
.*?
[^<>\d] # requires 1 character
requires a minimum of 2 characters between qt
and \d[3]
whereas, this
[^cy] # requires 1 character
.*?
(?<![<>\d]) # requires 0 character
requires a minimum of 1 characters between qt
and \d[3]
Since there is only 1 character between QT 560
this regex
qt[^cy].*?[^<>\d](\d{3})(?:\D|$)
will only match QT 560 and QTC wnl at 535m
in your sample string, where 535
is in capture group 1.
You can see this if you add another space inbetween QT 560
https://regex101.com/r/yMMjtD/1
This might better solve the problem, or similar
qt(?![cy]).+?(?<![<>\d])(\d{3})(?:\D|$)
https://regex101.com/r/JjB7kL/1
答案3
得分: 0
在以下字符串值中,将包含匹配项,并捕获值"560"。
EKG done this AM with QT 560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT: 560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT abc 560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT560
EKG done this AM with QT560a
而这些则不会。
EKG done this AM with QTC 560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QTY 560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT >560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT <560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT 5601 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT 56 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
英文:
> "find QT (but not QTC or QTY)"
QT(?![CY])
> "possibly followed by any number of characters"
QT(?![CY])[^\d]*
> "and return the first 3 digit integer you find (\d{3})"
QT(?![CY])[^\d]*(\d{3})
> "but only if that 3 digit integer is not immediately preceded by "<", ">", or another digit"
I'm not sure I understand, "or another digit", as, the first digit would have been matched by \d{3}.
QT(?![CY])[^\d]*(?<!<|>)(\d{3})
> "and is immediately followed by either the end of the line, $, or a non-digit, \D"
QT(?![CY])[^\d]*(?<!<|>)(\d{3})(?:$|\D)
For example, the following string values will contain a match, and capture the value "560".
EKG done this AM with QT 560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT: 560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT abc 560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT560
EKG done this AM with QT560a
And, these will not.
EKG done this AM with QTC 560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QTY 560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT >560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT <560 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT 5601 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
EKG done this AM with QT 56 and QTC wnl at 535ms, higher than on exam performed 5/3/2015 which showed...
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论