How to find nested patterns within a string, and merge them into one using a regex reordering of the string?

huangapple go评论53阅读模式
英文:

How to find nested patterns within a string, and merge them into one using a regex reordering of the string?

问题

我需要移除内部包含在其他 ((PERS) something_2) 中的 ((PERS) something_1) ,例如 ((PERS)something_1 ((PERS)something_2)) 应该变成 ((PERS)something_1 something_2)

或者例如,((PERS)something_1 ((PERS)something_2 ((PERS)something_3)) ((PERS)something_4)) 应该变成 ((PERS)something_1 something_2 something_3 something_4)

这样,避免了在其他封装中的封装。

我使用了 (.*?) 捕获组,它查找前一个模式和下一个模式之间的任何内容(包括换行符)。虽然也许使用 ((?:\w\s*)+) 这样的模式更好,以避免捕获 ((PERS) ) 序列的元素。但无论如何,这段代码未正确连接嵌套模式的内容,从中删除了必要的部分。

当运行此脚本时,您应该获得以下输出:

"here ((PERS)the Andys ) ((PERS)assása asas) ((VERB)asas (asas)) ((PERS)saasas Asasas bbbg gg)"

因此,嵌套模式 ((PERS) ) 应该从输入文本中被移除,而剩余的模式不会被修改。

英文:
import re

#example input string:
input_text = "here ((PERS)the ((PERS)Andys) ) ((PERS)assása asas) ((VERB)asas (asas)) ((PERS)saasas ((PERS)Asasas ((PERS)bbbg gg)))"

def remove_nested_pers(match):
    # match is a re.Match object representing the nested pattern, and I want to remove it
    nested_text = match.group(1)
    
    # recursively remove nested patterns
    nested_text = re.sub(r"\(\(PERS\)(.*?)\)", lambda m: m.group(1), nested_text)
    #nested_text = re.sub(r"\(\(PERS\)((?:\w\s*)+)\)", lambda m: m.group(1), nested_text)
    
    # replace nested pattern with cleaned text
    return nested_text


# recursively remove nested PERS patterns
input_text = re.sub(r"\(\(PERS\)(.*?)\)", remove_nested_pers, input_text)

print(input_text) # --> output

I need to remove the ((PERS) something_1) that are inside another ((PERS) something_2) , for example ((PERS)something_1 ((PERS)something_2)) should become ((PERS)something_1 something_2)

Or for example, ((PERS)something_1 ((PERS)something_2 ((PERS)something_3)) ((PERS)something_4))should become ((PERS)something_1 something_2 something_3 something_4)

In this way, encapsulations within other encapsulations would be avoided.

I've used the (.*?) capturing group which looks for anything (including new line characters) between the previous pattern and the next one. Although perhaps a pattern like ((?:\w\s*)+) is better to avoid capturing elements of the sequence ((PERS) ). Although regardless of this, this code fails to correctly join the content of the nested patterns, eliminating necessary parts.

This is the output you should be getting when running this script:

"here ((PERS)the Andys ) ((PERS)assása asas) ((VERB)asas (asas)) ((PERS)saasas Asasas bbbg gg)"

So the nested patterns ((PERS) ) should have been removed from the input text and the remaining patterns are not modified.

答案1

得分: 1

在原生 `re` 模块中不支持递归的情况下,你可以从内向外迭代地执行此操作。由于不太可能出现非常深的嵌套(例如 100 层深),这是一个实际的解决方案:

```python
import re

input_text = "here ((PERS)the ((PERS)Andys) ) ((PERS)assása asas) ((VERB)asas (asas)) ((PERS)saasas ((PERS)Asasas ((PERS)bbbg gg))"

size = len(input_text) + 1
while size > len(input_text):
    size = len(input_text)
    input_text = re.sub(r"(\(\(PERS\)(?:(?!\(\()[^)])*)\(\(PERS\)((?:(?!\(\()[^)])*)\)", r"", input_text)

print(input_text)

输出:

here ((PERS)the Andys ) ((PERS)ass&#225ása asas) ((VERB)asas (asas)) ((PERS)saasas Asasas bbbg gg)

<details>
<summary>英文:</summary>

In absence of recursion support in the native `re` module, you could do this iteratively, from the inside-out. As it is not expected that the nesting is going to be very deep (like 100 levels deep), this is a pragmatic solution:

import re

input_text = "here ((PERS)the ((PERS)Andys) ) ((PERS)assása asas) ((VERB)asas (asas)) ((PERS)saasas ((PERS)Asasas ((PERS)bbbg gg)))"

size = len(input_text) + 1
while size > len(input_text):
size = len(input_text)
input_text = re.sub(r"(((PERS)(?:(?!(()[^)]))((PERS)((?:(?!(()[^)])))", r"\1\2", input_text)

print(input_text)


Output:

```none
here ((PERS)the Andys ) ((PERS)ass&#225;sa asas) ((VERB)asas (asas)) ((PERS)saasas Asasas bbbg gg)

huangapple
  • 本文由 发表于 2023年2月26日 19:36:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/75571703.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定