是否有一种方法可以将负向预查锚定到特定的组?

huangapple go评论61阅读模式
英文:

Is there a way to anchor a negative lookahead to a particular group?

问题

我正在尝试创建一个正则表达式来从文本字符串中提取有效的电子邮件地址。我的当前正则表达式对大多数情况都有效,但在电子邮件地址被括号(或其他字符,例如省略号)括起来时,由于负向先行断言应用的总长度限制为254个字符而失效。

是否有一种方法可以锚定/限制先行断言,使其仅计算特定组捕获的字符?或者是否有其他解决方案?

我的当前正则表达式是:

\b((?!\S{255,})[\w\.'#%+-]{1,64}@(?:(?=.{1,63}\.)[a-z0-9](?:[a-zA-Z\d\.-]*[a-z0-9])?\.)+[a-zA-Z]{2,})

以下是一个示例,使用一个达到最大字符数(我这里是254)的电子邮件地址。第一个电子邮件地址(没有括号)会匹配,但接下来的电子邮件地址(带有括号)不会匹配(因为闭括号包含在字符计数中)。我希望此示例字符串产生三个匹配项。

我的电子邮件是:averylongaddresspartthatalmostwillreachthelimitofcharsperaddress@nowwejustneedaverylongdomainpartthatwill.reachthetotallengthlimitforthewholeemailaddress.whichis254charsaccordingtothePHPvalidate-email-filter.extendingthetestlongeruntilwereachtheright.com

你可以通过电子邮件与我联系(averylongaddresspartthatalmostwillreachthelimitofcharsperaddress@nowwejustneedaverylongdomainpartthatwill.reachthetotallengthlimitforthewholeemailaddress.whichis254charsaccordingtothePHPvalidate-email-filter.extendingthetestlongeruntilwereachtheright.com)

这也不会匹配:averylongaddresspartthatalmostwillreachthelimitofcharsperaddress@nowwejustneedaverylongdomainpartthatwill.reachthetotallengthlimitforthewholeemailaddress.whichis254charsaccordingtothePHPvalidate-email-filter.extendingthetestlongeruntilwereachtheright.com...

这封电子邮件太长了averylongaddresspartthatalmostwillreachthelimitofcharsperaddress@nowwejustneedaverylongdomainpartthatwill.reachthetotallengthlimitforthewholeemailaddress.whichis254charsaccordingtothePHPvalidate-email-filter.extendingthetestlongeruntilwereachthewronglength.com(因此不应该产生匹配)
英文:

I'm trying to create a regex expression to pick out valid emails from anywhere in a string of text. My current regex works fine for most cases, but the overall length limit of 254 chars (applied using a negative lookahead) stops working when the email is enclosed in brackets (or other characters, e.g. ellipsis).

Is there a way to anchor/limit the lookahead so that it only counts characters captured by a specific group? Or is there some other solution?

My current regex is:

\b((?!\S{255,})[\w\.'#%+-]{1,64}@(?:(?=.{1,63}\.)[a-z0-9](?:[a-zA-Z\d\.-]*[a-z0-9])?\.)+[a-zA-Z]{2,})

Example below, using an email that hits the maximum chars (254 in my case). The first email (without brackets) gives a match, but the next email (with the brackets) does not match (since the closing bracket is included in the char count). I'd like this example string to result in three matches.

My email is: averylongaddresspartthatalmostwillreachthelimitofcharsperaddress@nowwejustneedaverylongdomainpartthatwill.reachthetotallengthlimitforthewholeemailaddress.whichis254charsaccordingtothePHPvalidate-email-filter.extendingthetestlongeruntilwereachtheright.com

You can contact me by email (averylongaddresspartthatalmostwillreachthelimitofcharsperaddress@nowwejustneedaverylongdomainpartthatwill.reachthetotallengthlimitforthewholeemailaddress.whichis254charsaccordingtothePHPvalidate-email-filter.extendingthetestlongeruntilwereachtheright.com)

This also won't match: averylongaddresspartthatalmostwillreachthelimitofcharsperaddress@nowwejustneedaverylongdomainpartthatwill.reachthetotallengthlimitforthewholeemailaddress.whichis254charsaccordingtothePHPvalidate-email-filter.extendingthetestlongeruntilwereachtheright.com...

This email is too long averylongaddresspartthatalmostwillreachthelimitofcharsperaddress@nowwejustneedaverylongdomainpartthatwill.reachthetotallengthlimitforthewholeemailaddress.whichis254charsaccordingtothePHPvalidate-email-filter.extendingthetestlongeruntilwereachthewronglength.com (so it should not result in a match)

答案1

得分: 2

/\b(?=\w[\w.'#%+-]{0,63}@(?:[^.\s]{1,63}\.)+[a-zA-Z]{2,}(.{3,254}))\S{3,254}(?=$)/gm
英文:

To do the trick:

  • remove the negative lookahead that checks the length.
  • put the full pattern in a lookahead (without the leading word-boundary).
  • in the same lookahead, at the end, add a capture group to capture all until the end of the line.
  • after the lookahead, write for example \S{3,254} (allowed length) and check using a reference in a lookahead if the end of the line is the same as the one you have captured.

result:

/\b(?=\w[\w.'#%+-]{0,63}@(?:(?=[^.\s]{1,63}\.)[a-z0-9](?:[a-zA-Z\d.-]*[a-z0-9])?\.)+[a-zA-Z]{2,}(.*))\S{3,254}(?=$)/gm

demo

This works because lookaheads are atomic, that means: for a same starting position, once the closing bracket of a lookahead is reached, backtracking is no more possible inside it, and the content of capture groups inside can't be changed.

答案2

得分: 1

没有。前瞻与分组无关。您能做的最好的事情是通过内部模式来限制它。

但是,这个问题与描述的问题关系不大。

您的前瞻在括号上过于匹配。它不应该使用\S,因为它包括比您的模式允许的符号更多。

请改用(?![\w\.@'#%+-]{255,}),以仅基于模式允许的符号来检查长度。

演示可以在这里看到。

英文:

>Is there a way to anchor a negative lookahead to a particular group?

No. Lookahead is independent from groups. The best you can do is limit it by internal pattern.

But this question is quite loosely related to described problem.

Your lookahead overmatches parenthesis. It shouldn't use \S, as it includes way more symbols, that your pattern allows.

Use (?![\w\.@'#%+-]{255,}) instead to check length only based on symbols allowed by pattern itself.

Demo can be seen here.

huangapple
  • 本文由 发表于 2023年6月19日 21:34:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/76507165.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定