如何使用Python正则表达式按’ y ‘或’ y)’拆分和重新排序((PERS))标签内的内容?

huangapple go评论150阅读模式
英文:

How to split and reorder the content inside the ((PERS)) tag by ' y ' or ' y)' using Python regular expressions?

问题

import re

input_text = ""((PERS) Marcos Sy y) ((PERS) Lucy) estuvieron ((VERB) jugando) sdds""  # 例子1
input_text = ""ashsahghgsa ((PERS) María y Rosa ds) son alumnas de esa escuela y juegan juntas""  # 例子2

input_text = re.sub(
    r""\(\(PERS\)"" + r"((?:\w\s*)+(?:\sy\s(?:\w\s*)+)+)(?=\s*y\s*(?:\)|\())"",
    lambda m: (f""((PERS)){m[1].replace(' y', ') y ((PERS)')}""),
    input_text, re.IGNORECASE)

print(input_text)  # --> 输出

我需要将((PERS))标签中的内容分开,如果中间有" y "" y)"。因此,将((PERS))标签中的" y"" y "移出,并将其余内容(如果在例子2中找到的情况)留在另一个((PERS))标签中。我尝试使用\s+y\s+?\s+y\s+

为了实现所需的输出,我尝试使用正则表达式来匹配((PERS))标签内由" y "" y)"分隔的所有名称。为此,我尝试使用正向先行断言来检查每个名称之后是否有" y "" y)",然后将所有名称组合在一起。但是这个正向先行断言无法正常工作。

因此,对于每个示例,可以得到以下输出:

""((PERS) Marcos Sy) y ((PERS) Lucy) estuvieron ((VERB) jugando) sdds""  # 例子1

""ashsahghgsa ((PERS) María) y ((PERS)Rosa ds) son alumnas de esa escuela y juegan juntas""  # 例子2

此正则表达式适用于内容是否以大写字母开头,尽管我认为在这种情况下最好使用r""((?:\w\s*)+)"",因为内容已经封装在标签内。

英文:
import re

input_text = "((PERS) Marcos Sy y) ((PERS) Lucy) estuvieron ((VERB) jugando) sdds" #example 1
input_text = "ashsahghgsa ((PERS) María y Rosa ds) son alumnas de esa escuela y juegan juntas" #example 2

input_text = re.sub(
                    r"\(\(PERS\)" + r"((?:\w\s*)+(?:\sy\s(?:\w\s*)+)+)(?=\s*y\s*(?:\)|\())",
                    #lambda m: (f"((PERS)){m[1]}) y"),
                    lambda m: (f"((PERS)){m[1].replace(' y', ') y ((PERS)')}"),
                    input_text, re.IGNORECASE)

print(input_text) # --> output

I need to separate the content inside a ((PERS) ) tag if there is a " y " or a " y)" in between.
So get the " y" or the " y " out of the ((PERS) ) tag and the rest of the content (in case it finds as is the case in example 2) left in another ((PERS) ) tag. I try with \s+y\s+? and with \s+y\s+

To achieve the desired output, I tried with a regex to match all the names inside the ((PERS) ) tag that are separated by " y " or " y)". For that I tried to use a positive lookahead to check for " y " or " y)" after each name, and then group all the names together. But this lookahead dont works well.

So get this output for each of the examples respectively

"((PERS) Marcos Sy) y ((PERS) Lucy) estuvieron ((VERB) jugando) sdds" #for example 1

"ashsahghgsa ((PERS) María) y ((PERS)Rosa ds) son alumnas de esa escuela y juegan juntas" #for example 2

This regex is for content that does or does have to start with a capital letter r"([A-Z][\wí]+\s*)" although I think that in this case it would be better to simply use r"((?:\w\s*)+)" since the content is already encapsulated.

答案1

得分: 1

你可以只使用2个正则表达式,这会使它变得更简单。首先:

input_text = re.sub(
  r"\(\(PERS\)\s+([\w\s]+)\s+y\)\s+\(\(PERS\)\s+([\w\s]+)\)",
  lambda m: (f"((PERS) {m[1]}) y ((PERS) {m[2]})"),
  input_text,
  re.IGNORECASE)

这个正则表达式涵盖了你的第一个用例,匹配以下内容:

  • ((PERS)
  • 后跟一些空格 \s+
  • 一些混合的字母字符和空格,被捕获为 ([\w\s]+),我理解没有其他字符,比如 -
  • 一些更多的空格直到 y)
  • 然后再次相同,但没有 y: \(\(PERS\)\s+([\w\s]+)\)
    然后我们将两个匹配组格式化为 ((PERS) {m[1]}) y ((PERS) {m[2]}) 格式。

解决方案的第二部分非常类似,只是在第一个括号内匹配第二组:

input_text = re.sub(
  r"\(\(PERS\)\s+([\w\s]+)\s+y\s+([\w\s]+)\)",
  lambda m: (f"((PERS) {m[1]}) y ((PERS) {m[2]})"),
  input_text,
  re.IGNORECASE)

当然,你可以使用更复杂的正则表达式和替换函数来实现相同的效果,但我认为没有必要。这个正则表达式可以工作,例如:
\(\(PERS\)\s+([\w\s]+)\s+(y|y\s+([\w\s]+))\)(\s+\(\(PERS\)\s+([\w\s]+)\)),但接下来你需要处理有第1组和第5组的情况,或者使用逻辑来处理第1组和第3组。

英文:

You could just use 2 regexes which simplifies it a lot. First:

input_text = re.sub(
  r"\(\(PERS\)\s+([\w\s]+)\s+y\)\s+\(\(PERS\)\s+([\w\s]+)\)",
  lambda m: (f"((PERS) {m[1]}) y ((PERS) {m[2]})"),
  input_text,
  re.IGNORECASE)

This one covers your 1st use case and matches:

  • ((PERS)
  • followed by some whitespace \s+
  • some mixed word characters and whitespaces that get captured ([\w\s]+), as I understand without any other characters like -
  • some more whitespaces until y)
  • then again the same except without y: \(\(PERS\)\s+([\w\s]+)\)
    Then we format both matched groups into ((PERS) {m[1]}) y ((PERS) {m[2]}) format.

The 2nd part of solution is very similar, except it just matches the 2nd group inside the 1st parentheses:

input_text = re.sub(
  r"\(\(PERS\)\s+([\w\s]+)\s+y\s+([\w\s]+)\)",
  lambda m: (f"((PERS) {m[1]}) y ((PERS) {m[2]})"),
  input_text,
  re.IGNORECASE)

You could ofc do it with a much more convoluted regex and replacement lambda, but I see no point. This regex would work, for instance:
\(\(PERS\)\s+([\w\s]+)\s+(y|y\s+([\w\s]+))\)(\s+\(\(PERS\)\s+([\w\s]+)\))? but then you'd need to cover for cases when there's group 1 and group 5 or otherwise use logic for group 1 and 3.

答案2

得分: 1

根据你的要求,以下是代码部分的中文翻译:

在我看来使用两个独立的正则表达式会更简单和清晰测试[简单](https://regex101.com/r/SFi2A7/1)然后[扩展](https://regex101.com/r/aA60Uz/1)带有[部分](https://regex101.com/r/EBWKNr/1))
示例1似乎有一个错误而示例2需要拆分

    input_text = '''
    input_text += "((PERS) Marcos Sy y) ((PERS) Lucy) estuvieron ((VERB) jugando) sdds" #示例1
    input_text += "\n"
    input_text += "ashsahghgsa ((PERS) María y Rosa ds) son alumnas de esa escuela y juegan juntas" #示例2

    input_text += "\n\n" \
        + "((PERS) Marcos Sy y) ((PERS) Lucy) ((PERS) Marcos Sy y Ana) estuvieron ((VERB) jugando) sdds\n\
    ashsahghgsa ((PERS) María y Isabel y Ana y Rosa ds) son alumnas de esa escuela y juegan juntas"
        # 示例1+2扩展

    import re

    # 第一个:用于示例2

    # # 用于扩展示例2
    input_text = re.sub(
        r"\(\(PERS\)    (?P<multiple>    (?:  (?: \s [A-Z&#237;][\w&#237;]+(?:\s[a-z&#237;]+)? )* \sy  )+    (?:\s[^\)]+)    )    \)",
        lambda m: (f"((PERS){m['multiple'].replace(' y', '')} y ((PERS)))"),
        input_text, flags = re.IGNORECASE | re.VERBOSE # re.VERBOSE == re.X # 扩展(忽略空白)
    )

    # # 用于示例2(简单)
    # input_text = re.sub( \
    #     r'(\(\(PERS\)(?:\s(?!y)(?:[\w&#237;]+))*)\sy(\s[A-Z&#237;][\w&#237;]+(?:\s[a-z&#237;]+)?\))', \
    #     r'\g<1>) y ((PERS)\g<2>', \
    #     input_text, \
    #     flags = re.MULTILINE)

    # 第二个:用于示例1

    input_text = re.sub( \
        r'(\(\(PERS\)(?:\s(?:[\w&#237;]+))*)\sy\)', \
        r') y', \
        input_text, \
        flags = re.MULTILINE)

    print(input_text)

希望这有助于你的需求。

英文:

In my opinion, two separated regex would be simplier and clearer. Tests: simple, then expanded (with partial).
Example 1 seems to be a bug, while example 2 needs to be splitted:

input_text = &#39;&#39;
input_text += &quot;((PERS) Marcos Sy y) ((PERS) Lucy) estuvieron ((VERB) jugando) sdds&quot; #example 1
input_text += &quot;\n&quot;
input_text += &quot;ashsahghgsa ((PERS) Mar&#237;a y Rosa ds) son alumnas de esa escuela y juegan juntas&quot; #example 2

input_text += &quot;\n\n&quot; \
    + &quot;((PERS) Marcos Sy y) ((PERS) Lucy) ((PERS) Marcos Sy y Ana) estuvieron ((VERB) jugando) sdds\n\
ashsahghgsa ((PERS) Mar&#237;a y Isabel y Ana y Rosa ds) son alumnas de esa escuela y juegan juntas&quot;
    # example 1+2 expanded


import re

# first: for example 2

# # for example 2 expanded
input_text = re.sub(
    r&quot;\(\(PERS\)    (?P&lt;multiple&gt;    (?:  (?: \s [A-Z&#237;][\w&#237;]+(?:\s[a-z&#237;]+)? )* \sy  )+    (?:\s[^\)]+)    )    \)&quot;,
    lambda m: (f&quot;((PERS){m[&#39;multiple&#39;].replace(&#39; y&#39;, &#39;) y ((PERS)&#39;)})&quot;),
    input_text, flags = re.IGNORECASE | re.VERBOSE # re.VERBOSE == re.X # extended (ignore white space)
)

# # for example 2 (simple)
# input_text = re.sub( \
#     r&#39;(\(\(PERS\)(?:\s(?!y)(?:[\w&#237;]+))*)\sy(\s[A-Z&#237;][\w&#237;]+(?:\s[a-z&#237;]+)?\))&#39;, \
#     r&#39;\g&lt;1&gt;) y ((PERS)\g&lt;2&gt;&#39;, \
#     input_text, \
#     flags = re.MULTILINE)

# second: for example 1

input_text = re.sub( \
    r&#39;(\(\(PERS\)(?:\s(?:[\w&#237;]+))*)\sy\)&#39;, \
    r&#39;) y&#39;, \
    input_text, \
    flags = re.MULTILINE)

print(input_text)

result (original examples 1+2):

&quot;((PERS) Marcos Sy) y ((PERS) Lucy) estuvieron ((VERB) jugando) sdds&quot;
&quot;ashsahghgsa ((PERS) Mar&#237;a) y ((PERS) Rosa ds) son alumnas de esa escuela y juegan juntas&quot;

result (expanded example 1+2):

&quot;((PERS) Marcos Sy) y ((PERS) Lucy) ((PERS) Marcos Sy) y ((PERS) Ana) estuvieron ((VERB) jugando) sdds&quot;
&quot;ashsahghgsa ((PERS) Mar&#237;a) y ((PERS) Isabel) y ((PERS) Ana) y ((PERS) Rosa ds) son alumnas de esa escuela y juegan juntas&quot;

(Are you sure expecting ((PERS)Rosa ds) - without space? And it's no clear you need "ds" after "Rosa"? I don't speak Spanish, maybe that? 如何使用Python正则表达式按’ y ‘或’ y)’拆分和重新排序((PERS))标签内的内容? but dealt with it 如何使用Python正则表达式按’ y ‘或’ y)’拆分和重新排序((PERS))标签内的内容? )

答案3

得分: 1

以下是代码的翻译部分:

import re

pattern = r"(\(\(PERS\)\s*)((?:(?![()]|\sy\b).)* y\b[^()]*?)\s*"
s = "((PERS) Marcos Sy y) ((PERS) Lucy) estuvieron ((VERB) jugando) sdds\n"
    "ashsahghgsa ((PERS) Mar&#237;a y Rosa ds) son alumnas de esa escuela y juegan juntas\n"
    "ashsahghgsa ((PERS) Mar&#237;a y Rosa ds y Test Person 1 y test person 2) son alumnas de esa escuela y juegan juntas"


def custom_replacement(m):
    return m.group(1) + " y ((PERS) ".join(

) replaced_names = re.sub(pattern, custom_replacement, s) replaced_pers = re.sub(r"(\(\(PERS\)[^()]*\))\s*(?=\(\(PERS\)[^()]*\))", r" y ", replaced_names) print(replaced_pers)

希望这对你有帮助。如果有其他问题,请随时提出。

英文:

If there can not be any other occurrence of a parenthesis, you might use a pattern with 2 capture groups, and then use split on the second group to get the separate parts between y so that there can also be multiple names.

Pattern to get the ((PERS)...) parts with y

(\(\(PERS\)\s*)((?:(?![()]|\sy\b).)* y\b[^()]*?)\s*\)

Regex demo

After these replacements, you can put <code> y </code> between all the remaining consecutive ((PERS)...) parts with another pattern:

(\(\(PERS\)[^()]*\))\s*(?=\(\(PERS\)[^()]*\))

Regex demo

import re

pattern = r&quot;(\(\(PERS\)\s*)((?:(?![()]|\sy\b).)* y\b[^()]*?)\s*\)&quot;
s = (&quot;((PERS) Marcos Sy y) ((PERS) Lucy) estuvieron ((VERB) jugando) sdds\n&quot;
            &quot;ashsahghgsa ((PERS) Mar&#237;a y Rosa ds) son alumnas de esa escuela y juegan juntas\n&quot;
            &quot;ashsahghgsa ((PERS) Mar&#237;a y Rosa ds y Test Person 1 y test person 2) son alumnas de esa escuela y juegan juntas&quot;)


def custom_replacement(m):
    return m.group(1) + &quot; y ((PERS) &quot;.join(

) replaced_names = re.sub(pattern, custom_replacement, s) replaced_pers = re.sub(r&quot;(\(\(PERS\)[^()]*\))\s*(?=\(\(PERS\)[^()]*\))&quot;, r&quot; y &quot;, replaced_names) print(replaced_pers)

Output

((PERS) Marcos Sy) y ((PERS) Lucy) estuvieron ((VERB) jugando) sdds
ashsahghgsa ((PERS) Mar&#237;a) y ((PERS) Rosa ds) son alumnas de esa escuela y juegan juntas
ashsahghgsa ((PERS) Mar&#237;a) y ((PERS) Rosa ds) y ((PERS) Test Person 1) y ((PERS) test person 2) son alumnas de esa escuela y juegan juntas

See a Python demo.

huangapple
  • 本文由 发表于 2023年3月7日 06:02:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/75656240.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定