如何仅删除连续重复的字符串,前提是这些字符串位于”((VERB)”和”)”之间?

huangapple go评论61阅读模式
英文:

How to remove consecutively repeated strings only if this strings are in the middle of "((VERB)" and ")"?

问题

在这段代码中,您已经成功地移除了位于((VERB))之间的连续重复的字符串"a nosotros"。虽然代码能够编辑输入字符串,但您似乎有一些疑问或需要进一步的修改。如果您需要进一步的修改或有其他问题,请提供更多具体的信息,以便我能够提供更多帮助。

英文:
import re

input_text = "((VERB) saltar a nosotros a nosotros) a nosotros a nosotros a nosotros ((VERB)correr a nosotros) sdsdsd ((VERB) saltar a nosotros a nosotros)"

input_text = re.sub(r"\(\(VERB\)" + r"((?:\w\s*)+)" + r"\)", 
                    lambda x: re.sub(r"(a nosotros)\s*+", r"", x.group()), 
                    input_text)

print(input_text) # --> output

In this code I was trying to remove consecutively repeated "a nosotros" strings only if this strings are in the middle of "((VERB)" and )", that is, that string that captures the capturing group r"\(\(VERB\)" + r"((?:\w\s*)+)" + r"\)"

This is the output you should be getting when running this script:

"((VERB) saltar a nosotros) a nosotros a nosotros a nosotros ((VERB)correr a nosotros) sdsdsd ((VERB) saltar a nosotros)"

Although the code that I have placed in the question does edit the input string, what should i modify?

答案1

得分: 1

你可以使用以下代码:

input_text = re.sub(r"\(\(VERB\)[\w\s]*\)",  lambda x: re.sub(r"\ba nosotros(?:\s+a nosotros)*\b", "a nosotros", x.group()), input_text)

主要的模式是 \(\(VERB\)[\w\s]*\),它匹配 ((VERB) 后面跟着零个或多个单词字符或空格字符,然后是一个 ) 字符。

re.sub(r"\ba nosotros(?:\s+a nosotros)*\b", "a nosotros", x.group()) 部分会移除匹配中所有连续的完整单词 a nosotros

英文:

You can use

input_text = re.sub(r"\(\(VERB\)[\w\s]*\)",  lambda x: re.sub(r"\ba nosotros(?:\s+a nosotros)*\b", "a nosotros", x.group()), input_text)

The main pattern is \(\(VERB\)[\w\s]*\), it matches ((VERB) + zero or more word or whitespace chars and then a ) char.

The re.sub(r"\ba nosotros(?:\s+a nosotros)*\b", "a nosotros", x.group()) part removes all consecutive whole words a nosotros inside the match.

答案2

得分: 1

Python的可选正则表达式引擎模块(由Matthew Barnett开发)支持\K指令,它将报告的匹配起始点重置为当前字符串指针位置,并丢弃先前消耗的字符从最终匹配中。通过使用该指令,可以简单地将字符串中的匹配替换为空字符串。

以下是执行此操作的代码。

import regex

rgx = r"\(\(VERB\)(?:(?!\ba nosotros\b|\)).)*\K\ba nosotros\b(?=[^)]*\ba nosotros\b)"

txt_in = "((VERB) saltar a nosotros a nosotros) a nosotros a nosotros a nosotros ((VERB)correr a nosotros) sdsdsd ((VERB) saltar a nosotros a nosotros)"

txt_out = regex.sub(rgx, '', txt_in)

print(txt_out)
-> ((VERB) saltar  a nosotros) a nosotros a nosotros a nosotros ((VERB)correr a nosotros) sdsdsd ((VERB) saltar  a nosotros)

正则表达式的拆解如下。

\(\(VERB\)          # 匹配文字
(?:                 # 开始非捕获组
  (?!               # 开始负向前瞻
    \ba nosotros\b  # 匹配由单词边界包围的文字
    |               # 或
    \)              # 匹配文字
  )                 # 结束负向前瞻
  .                 # 匹配任何字符,除了行终止符
)*                  # 结束非捕获组,执行零次或多次
\K                  # 请参阅本答案的第一段
\ba nosotros\b      # 匹配由单词边界包围的文字
(?=                 # 开始正向前瞻
  [^)]*             # 匹配任何字符,除了')',零次或多次
  \ba nosotros\b    # 匹配由单词边界包围的文字
)                   # 结束正向前瞻

Python演示 <-> 正则表达式演示

一次匹配一个字符的技巧(在此为(?:(?!\ba nosotros\b|\)).))被称为温和贪婪令牌解决方案

英文:

Python's optional regex engine module (developed by Matthew Barnett) supports the \K directive, which resets the starting point of the reported match to the current string pointer locations and discards any previously consumed characters from the final match. By employing that directive one can simply replace matches in the string with empty strings.

The code for doing that is as follows.

import regex

rgx = r&quot;\(\(VERB\)(?:(?!\ba nosotros\b|\)).)*\K\ba nosotros\b(?=[^)]*\ba nosotros\b)&quot;

txt_in = &quot;((VERB) saltar a nosotros a nosotros) a nosotros a nosotros a nosotros ((VERB)correr a nosotros) sdsdsd ((VERB) saltar a nosotros a nosotros)&quot;

txt_out = regex.sub(rgx, &#39;&#39;, txt_in)

print(txt_out)
-&gt; ((VERB) saltar  a nosotros) a nosotros a nosotros a nosotros ((VERB)correr a nosotros) sdsdsd ((VERB) saltar  a nosotros)

The regular expression can be broken down as follows.

\(\(VERB\)          # match literal
(?:                 # begin non-capture group
  (?!               # begin negative lookahead
    \ba nosotros\b  # match literal surrounded by word boundaries
    |               # or 
    \)              # match literal 
  )                 # end of negative lookahead
  .                 # match any character other than a line terminator
)*                  # end non-capture group and execute zero or more times
\K                  # see the first paragraph of this answer
\ba nosotros\b      # match literal surrounded by word boundaries
(?=                 # begin positive lookahead
  [^)]*             # match any characters other than &#39;)&#39; zero or more times
  \ba nosotros\b    # match literal surrounded by word boundaries
)                   # end positive lookahead

Python demo<sup><sub><-</sup></sub><sub>\(ツ)/</sub><sup><sub>-></sub></sup>Regex demo

The technique of matching one character at a time with a negative lookahead (here (?:(?!\ba nosotros\b|\)).)) is called the tempered greedy token solution.

huangapple
  • 本文由 发表于 2023年2月27日 02:19:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/75574109.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定