正则表达式用于grep提取具有符号的确切出现的行

huangapple go评论58阅读模式
英文:

Regular expression for grep to extract lines with exact ocurrences of a symbol

问题

Here's the translated code part:

使用包含类似以下[SMILES](https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system)字符串的文本数据文件:

    CN1CCC2OC(C)(CO)C3=C(NN=C3)N=C12
    BrC1=CC(=O)N2C=NC(CC(=O)C#N)=CC2=C1

如何编写正则表达式以提取仅包含4个碳原子的行,意思是有4个C而没有其他大写字母,而允许数字、括号、`=`和`#`。

更新:
1. 还允许小写的C,即4个C或cs
2. `grep -E '([^C]*C){4}' filename`提取至少有4个C的行
3. 方括号、`-`和`@`也必须被排除
4. 查看一些示例[这里](https://regex101.com/r/m2Kkso/1)
英文:

Having a text data file containing SMILES strings like these:

CN1CCC2OC(C)(CO)C3=C(NN=C3)N=C12
BrC1=CC(=O)N2C=NC(CC(=O)C#N)=CC2=C1

how it would be a regular expression to extract the lines containing only 4 carbon atoms, meaning 4 Cs and no other capital letter, while numbers, parenthesis, = and # are allowed.

Update:

  1. small C is also allowed, then 4 Cs or cs
  2. grep -E '([^C]*C){4}' filename extracts lines with at least 4 Cs
  3. square brackets, - and @ must be also discarded
  4. check out some examples here

答案1

得分: 1

第一印象

我考虑了评论中的建议,使用了一些负前瞻和负后顾,但是一些正则表达式引擎通常无法处理它。

所以我们将忘记 C(?![laroudsnemf])|(?<![STMA])c,它被用来匹配带有小写字母环的碳。它本来可以与 grep -P 一起使用,以启用 PCRE 引擎,但我更喜欢避免使用它。

但这可以通过 设置一个模式来同时匹配被忽略的字符 来替代:

  • 根据
    Wikipedia 上关于 SMILES 的页面
    忽略的字符列表很长,包括 "-", "+", ".", "=", "#", "$", "@", "(", ")", "[", "]", "/", "\", ":", "%"。最终我们得到这个模式:[-+.=#$@()\[\]\/\\:%0-9]
  • 所有以 "C" 开头的两个字符元素:C[laroudsnemf]
  • 所有以 "c" 结尾的两个字符元素:[STMA]c
  • 除了 "C" 之外的任何单字符元素:[HBNOFPSKVYIWU]
  • 其他两个字符元素,没有 "C" 或 "c":[ABD-Z][abd-z]
    这里我们显然不太精确,因为某些大写和小写字符的组合可能不是存在的元素。不过,这应该不是问题。

编辑:感谢 @David542 的评论,他发现 "Sc" 没有被识别为钪元素,而是将 "S" 视为硫,然后 "c" 视为环中的碳。为了解决这个问题,我最终将模式设定为贪婪,将 * 替换为 *+

设置模式

  • 我将使用 x 标志以便在模式中添加一些注释。
  • 我不想复制粘贴或重复模式,所以我将使用子模式,并为其命名,以使事情更清晰。这可以通过命名捕获组 (?<group_name>...) 来实现,其中 ... 是您的模式。
  • PCRE 中,您可以使用 (?(DEFINE)...) 结构来声明子模式而不使用它们。这就像为以后的使用创建函数一样。

将所有内容放在一起:

/
(?(DEFINE)

  # 仅匹配碳原子。可以是大写的 "C" 或小写的 "c"(用于环)。我们将避免负向前瞻和负向后顾,而是让 "ignored" 模式消耗其他元素。
  (?<carbon>[cC])

  # 忽略其他元素和语法字符。
  # 这个模式应该使用 *+ 而不是 * 来实现贪婪匹配。这将使模式消耗 "Sc" 或 "Tc",以便我们无法将 "c" 视为碳原子。这个解决方案是为了避免在上面的碳子模式内部使用负向前瞻和负向后顾。
  (?<ignored>
    (?:                  # 非捕获的,只是为了 "or" 运算符。
      [-+.=\#$@()\[\]\/\\:%0-9] # 忽略的字符。
      |
      C[laroudsnemf]     # Cx 元素:Cl、Ca、Cr、Co、Cu 等
      |
      [STMA]c            # Xc 元素:Sc、Tc、Mc 和 Ac
      |
      [HBNOFPSKVYIWU]    # 任何单字母元素,但不包括 C。
      |
      [ABD-Z][abd-z]     # 2 字符元素(不是很精确但很短)
    )*+                  # 忽略的项目可以是 0 或 n 次,采用贪婪方式。
  )

)

# 模式是由被忽略的元素和字符包围的碳原子,重复 4 次。
^(?:\g<ignored>\g<carbon>\g<ignored>){4}$

/gmx

在操作中:https://regex101.com/r/88zReK/4

如果需要删除注释以在 grep 中使用它,这里是压缩版本,需要转义反斜杠来创建正确的字符串以传递给 grep

grep -P "(?(DEFINE)(?<carbon>[cC])(?<ignored>(?:[-+.=#$@()\\[\\]/\\\\:%0-9]|C[laroudsnemf]|[STMA]c|[HBNOFPSKVYIWU]|[ABD-Z][abd-z])*+))^(?:\\g<ignored>\\g<carbon>\\g<ignored>){4}$"

如果需要再次生成压缩模式,您可以使用我在 codepen.io 上制作的工具:链接

编辑:错误的最终想法和解决数字含义的 POC

<del>个人认为不应忽略数字,因为 "CH2" 表示两个氢原子,那么为什么不尝试处理类似 "C4" 的东西呢?我甚至不知道这是否可能,但我认为值得尝试处理它。</del>

这是一个错误的假设!"C" 后面的数字不是原子

英文:

First thoughts

I thought about my suggestions in the comments, using some negative
lookaheads and lookbehinds, but it's often not handled by some regex engines.

So we'll forget C(?![laroudsnemf])|(?&lt;![STMA])c which was used to
match carbon with a lowercase version for rings. It would have worked
with grep -P to enable the PCRE engine, but I prefer avoiding it.

But this can be replaced by setting up a pattern that will consume
the other elements in the same time as the ignored characters
:

  • According to
    Wikipedia's page about SMILES,
    the ignored chars list is quite big, leading to "-", "+", ".", "=",
    "#", "$", "@", "(", ")", "[", "]", "/", "&#92;", ":", "%". We finally
    get this pattern : [-+.=#$@()\[\]\/\\:%0-9]
  • All 2-chars elements starting with a "C": C[laroudsnemf]
  • All 2-chars elements finishing with a "c": [STMA]c
  • Any single-char element but not "C": [HBNOFPSKVYIWU]
  • Any other 2-chars element without "C" or "c": [ABD-Z][abd-z]
    Here we are clearly not very precise, as some combinations of
    upper and lowercase chars won't be existing elements. Never mind,
    it should not be a problem.

Edit: thanks to the comment of @David542, he saw that "Sc"
wasn't detected as the scandium element, but "S" was taken for
sulphur and then "c" as the carbon (in a ring). To solve that, I
finally forced the pattern to be possessive (greedy) by replacing
the * by *+.

Setting up the pattern

  • I'll use the x flag in order to add some comments in the
    pattern.
  • I don't want to copy-paste or repeat patterns, so I'll use
    sub-patterns which I will name, in order to make things clear.
    This can be done with the help of named capturing groups
    (?&lt;group_name&gt;...) where ... is your pattern.
  • In PCRE, you can use the (?(DEFINE)...) construct to declare
    sub-patterns without using them. It's like creating functions for
    a later use.

Putting it all together:

/
(?(DEFINE)

  # Match a carbon atom only. It can be a uppercase &quot;C&quot; or lowercase
  # &quot;c&quot; (for rings). We&#39;ll avoid negative lookahead and lookbehind and
  # instead let the &quot;ignored&quot; pattern consume these other elements.
  (?&lt;carbon&gt;[cC])

  # Ignored other elements and syntax chars.
  # This pattern should be greedy with *+ instead of *. This will make
  # the pattern consume &quot;Sc&quot; or &quot;Tc&quot; so that we cannot match the &quot;c&quot;
  # as a carbon atom. This solution is to avoid using negative lookbehind
  # and lookahead inside the carbon sub-pattern just above.
  (?&lt;ignored&gt;
    (?:                  # Non-capturing just for the &quot;or&quot; operator.
      [-+.=\#$@()\[\]\/\\:%0-9] # ignored chars.
      |
      C[laroudsnemf]     # Cx Elements: Cl, Ca, Cr, Co, Cu, etc
      |
      [STMA]c            # Xc Elements: Sc, Tc, Mc and Ac
      |
      [HBNOFPSKVYIWU]    # Any single-letter element, but not C.
      |
      [ABD-Z][abd-z]     # 2-chars Elements (not very precise but short)
    )*+                  # The ignored items can be 0 or n times, in a possessive way.
  )

)

# The pattern is carbon surounded by ignored elements and chars, 4 times.
^(?:\g&lt;ignored&gt;\g&lt;carbon&gt;\g&lt;ignored&gt;){4}$

/gmx

In action: https://regex101.com/r/88zReK/4

If you need to remove the comments to use it with grep, here's the
compressed version where backslashes have to be escaped to create
a correct string to pass to grep:

grep -P &quot;(?(DEFINE)(?&lt;carbon&gt;[cC])(?&lt;ignored&gt;(?:[-+.=#$@()\\[\\]/\\\\:%0-9]|C[laroudsnemf]|[STMA]c|[HBNOFPSKVYIWU]|[ABD-Z][abd-z])*+))^(?:\\g&lt;ignored&gt;\\g&lt;carbon&gt;\\g&lt;ignored&gt;){4}$&quot;

If you need to reproduce the compact pattern again, you can
use a tool I made for my personal use on codepen.io.

EDIT: wrong final thoughts and POC to solve also the meaning of digits

<del>Personally, I would not ignore numbers as "CH2" means two
hydrogen atoms, so why not trying to handle something like
"C4"&nbsp;? I don't even know if this is possible or not, but I
think it's worth trying to handle it.</del>

This was a wrong assumption! Numbers after "C" are not the
number of atoms like it's the case for "H2". It's the label of the
ring. So the PHP code below to replace "C4" by "CCCC" is totally
worthless
. I'll just leave it here because it might be helpful for
other users having to process some data with a bit more power than
the simple usage of grep&nbsp;:

&lt;?php

const INPUT_FILE = &#39;smiles_input.txt&#39;;

// Read all the file in one go into an array of lines. If the file is too large,
// you&#39;ll have to open it and read it line by line to avoid running out of memory.
// We will also get rid of attached new line chars at the end of each line.
$input_lines = file(INPUT_FILE, FILE_IGNORE_NEW_LINES);

// A pattern to match only carbon elements, in upper or lowercase, followed by
// a number. We use negative lookbehind and lookahead to avoid matching other
// elements containing the letter &quot;C&quot; or &quot;c&quot;.
const PATTERN_C_AND_NUMBER = &#39;/(?&lt;carbon&gt;C(?![laroudsnemf])|(?&lt;![STMA])c)(?&lt;number&gt;\d+)/&#39;;

// The pattern to match a molecule containing 4 carbon elements.
// I use Nowdoc string format to avoid having to escape everything.
const PATTERN_4_C_MOLECULE = &lt;&lt;&lt;&#39;END_OF_STRING&#39;
/
(?(DEFINE)

  # Match a carbon atom only. It can be a uppercase &quot;C&quot; or lowercase
  # &quot;c&quot; (for rings). We&#39;ll avoid negative lookahead and lookbehind and
  # instead let the &quot;ignored&quot; pattern consume these other elements.
  (?&lt;carbon&gt;[cC])

  # Ignored other elements and syntax chars.
  # This pattern should be greedy with *+ instead of *. This will make
  # the pattern consume &quot;Sc&quot; or &quot;Tc&quot; so that we cannot match the &quot;c&quot;
  # as a carbon atom. This solution is to avoid using negative lookbehind
  # and lookahead inside the carbon sub-pattern just above.
  (?&lt;ignored&gt;
    (?:                  # Non-capturing just for the &quot;or&quot; operator.
      [-+.=\#$@()\[\]\/\\:%0-9] # ignored chars.
      |
      C[laroudsnemf]     # Cx Elements: Cl, Ca, Cr, Co, Cu, etc
      |
      [STMA]c            # Xc Elements: Sc, Tc, Mc and Ac
      |
      [HBNOFPSKVYIWU]    # Any single-letter element, but not C.
      |
      [ABD-Z][abd-z]     # 2-chars Elements (not very precise but short)
    )*+                  # The ignored items can be 0 or n times, in a greedy way.
  )

)

# The pattern is carbon surounded by ignored elements and chars, 4 times.
^(?:\g&lt;ignored&gt;\g&lt;carbon&gt;\g&lt;ignored&gt;){4}$

/x
END_OF_STRING;

foreach ($input_lines as $line_nbr =&gt; $molecula) {
	// Replace all occurrences of C followed by a number, by the C char repeated
	// the correct number of times. Ex: &quot;C4&quot; will be replaced by &quot;CCCC&quot;.
	$changed_molecula = preg_replace_callback(
		PATTERN_C_AND_NUMBER,
		function ($matches) {
			return str_repeat($matches[&#39;carbon&#39;], (int)$matches[&#39;number&#39;]);
		},
		$molecula
	);
	
	// Check if the molecula has only 4 carbon elements.
	if (preg_match(PATTERN_4_C_MOLECULE, $changed_molecula)) {
		print ($line_nbr + 1) . &quot;: &quot; . $molecula . PHP_EOL;
	}
}

Run it here: https://onlinephp.io/c/6f062

答案2

得分: 0

以下是翻译好的部分:

"4 Cs and no other capital letter ..."
然而,提供的 regex101.com 示例都包括额外的大写字母,因此,这些示例都不会匹配。

以下将在没有其他大写字母时匹配:

如果您希望匹配 恰好 4 个,

(?i)^(?:[^C]*C){4}[^C]*$

如果您希望匹配 至少 4 个,使用以下正则表达式。

(?i)^(?:[^C]*C){4,}[^C]*$

如果您希望匹配 至多 4 个,使用以下正则表达式

(?i)^(?:[^C]*C){1,4}[^C]*$
英文:

There is a discrepancy within your constraints.

> "4 Cs and no other capital letter ..."

Yet, the provided regex101.com examples all include additional capital letters&mdash;so, none of those will match.

The following will match when there are no other capital letters.

If you're looking to match, exactly 4

(?i)^(?:[^C]*C){4}[^C]*$

If you're looking to match, at least, 4, use the following.

(?i)^(?:[^C]*C){4,}[^C]*$

And, if you're looking to match, at most, 4, use this

(?i)^(?:[^C]*C){1,4}[^C]*$

答案3

得分: -1

这是一个很棒的问题,让我想起了在高中上化学课的时候。这比看起来要复杂一些。从概念上讲,这是我们想要做的:

  • 我们想要捕获:一个大写字母 C,后面没有小写字母跟随。这必须捕获正好 4 次。
  • 我们要跳过任何(a)数字,(b)标点符号 #=, (,和 ),(c)除了 C 之外的任何大写字母(例如 F),(d)任何大写字母后跟一个或多个小写字母(例如 CaFe)。

如果我们将其转换为正则表达式,我们有:

  • 捕获:(C(?![a-z]))
  • 跳过:[#=()\dABD-Z]|[A-Z][a-z]

将所有这些放在一起,带有锚点和重复,我们得到以下结果。

正则表达式用于grep提取具有符号的确切出现的行

显然,这里有很多复制粘贴,除了使用 \1 来重复 C,在非平凡的意义下,我看不到太多简化这个正则表达式的方法。


根据您在跳过条件中要求的以下更改进行更新:允许 [, ], -, 和 @

正则表达式用于grep提取具有符号的确切出现的行

并根据 @Patrick 建议的简化进行更新(谢谢!)使用 \g 来捕获模式。请注意,此模式不包括:(a)表示碳的小写字母-C;或(b)原子中的乘法器,例如 C4 表示 CCCC

正则表达式用于grep提取具有符号的确切出现的行

英文:

This is such a wonderful question and reminds me of taking chemistry classes in high school. This is trickier than it seems. Conceptually, here is what we want to do:

  • We want to capture: a C that is not followed by a lowercase letter. This must be captured exactly 4 times.
  • And we want to skip any (a) numbers, (b) the punctuation #, =, (, and ), (c) any uppercase letter except C (such as F), (d) Any uppercase letter followed by one or more lowercase letters (such as Ca or Fe).

If we convert it into a regex we have:

  • Capture: (C(?![a-z]))
  • Skip: [#=()\dABD-Z]|[A-Z][a-z]

Putting it all together with anchors and repetitions we get the following.

正则表达式用于grep提取具有符号的确切出现的行

Obviously there's a lot of copy-paste here, and other than using \1 to repeat the C, I don't see a way to simplify this regex too much more in a non-trivial sense.


Updated with the following changes you asked for in the Skip condition: [, ], -, and @ allowed.

正则表达式用于grep提取具有符号的确切出现的行

And updated with the simplification suggested by @Patrick (thank you!) to use \g for the captured pattern. Note this pattern does not address: (a) lowercase-C for carbon; or (b) multipliers in atoms, such as C4 to mean CCCC:

正则表达式用于grep提取具有符号的确切出现的行

huangapple
  • 本文由 发表于 2023年6月8日 06:33:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/76427487.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定