2023年3月7日 06:02:46go评论150阅读模式

英文:

How to split and reorder the content inside the ((PERS)) tag by ' y ' or ' y)' using Python regular expressions?

问题

import re

input_text = "&quot;((PERS) Marcos Sy y) ((PERS) Lucy) estuvieron ((VERB) jugando) sdds&quot;"  # 例子1
input_text = "&quot;ashsahghgsa ((PERS) Mar&#237;a y Rosa ds) son alumnas de esa escuela y juegan juntas&quot;"  # 例子2

input_text = re.sub(
    r"&quot;\(\(PERS\)&quot;" + r"((?:\w\s*)+(?:\sy\s(?:\w\s*)+)+)(?=\s*y\s*(?:\)|\())&quot;",
    lambda m: (f"&quot;((PERS)){m[1].replace(' y', ') y ((PERS)')}&quot;"),
    input_text, re.IGNORECASE)

print(input_text)  # --> 输出

我需要将((PERS))标签中的内容分开，如果中间有" y "或" y)"。因此，将((PERS))标签中的" y"或" y "移出，并将其余内容（如果在例子2中找到的情况）留在另一个((PERS))标签中。我尝试使用\s+y\s+?和\s+y\s+。

为了实现所需的输出，我尝试使用正则表达式来匹配((PERS))标签内由" y "或" y)"分隔的所有名称。为此，我尝试使用正向先行断言来检查每个名称之后是否有" y "或" y)"，然后将所有名称组合在一起。但是这个正向先行断言无法正常工作。

因此，对于每个示例，可以得到以下输出：

"&quot;((PERS) Marcos Sy) y ((PERS) Lucy) estuvieron ((VERB) jugando) sdds&quot;"  # 例子1

"&quot;ashsahghgsa ((PERS) Mar&#237;a) y ((PERS)Rosa ds) son alumnas de esa escuela y juegan juntas&quot;"  # 例子2

此正则表达式适用于内容是否以大写字母开头，尽管我认为在这种情况下最好使用r""((?:\w\s*)+)""，因为内容已经封装在标签内。

英文:

import re

input_text = &quot;((PERS) Marcos Sy y) ((PERS) Lucy) estuvieron ((VERB) jugando) sdds&quot; #example 1
input_text = &quot;ashsahghgsa ((PERS) Mar&#237;a y Rosa ds) son alumnas de esa escuela y juegan juntas&quot; #example 2

input_text = re.sub(
                    r&quot;\(\(PERS\)&quot; + r&quot;((?:\w\s*)+(?:\sy\s(?:\w\s*)+)+)(?=\s*y\s*(?:\)|\())&quot;,
                    #lambda m: (f&quot;((PERS)){m[1]}) y&quot;),
                    lambda m: (f&quot;((PERS)){m[1].replace(&#39; y&#39;, &#39;) y ((PERS)&#39;)}&quot;),
                    input_text, re.IGNORECASE)

print(input_text) # --&gt; output

I need to separate the content inside a ((PERS) ) tag if there is a " y " or a " y)" in between.
So get the " y" or the " y " out of the ((PERS) ) tag and the rest of the content (in case it finds as is the case in example 2) left in another ((PERS) ) tag. I try with \s+y\s+? and with \s+y\s+

To achieve the desired output, I tried with a regex to match all the names inside the ((PERS) ) tag that are separated by " y " or " y)". For that I tried to use a positive lookahead to check for " y " or " y)" after each name, and then group all the names together. But this lookahead dont works well.

So get this output for each of the examples respectively

&quot;((PERS) Marcos Sy) y ((PERS) Lucy) estuvieron ((VERB) jugando) sdds&quot; #for example 1

&quot;ashsahghgsa ((PERS) Mar&#237;a) y ((PERS)Rosa ds) son alumnas de esa escuela y juegan juntas&quot; #for example 2

This regex is for content that does or does have to start with a capital letter r"([A-Z][\wí]+\s*)" although I think that in this case it would be better to simply use r"((?:\w\s*)+)" since the content is already encapsulated.

答案1

得分: 1

你可以只使用2个正则表达式，这会使它变得更简单。首先：

input_text = re.sub(
  r"\(\(PERS\)\s+([\w\s]+)\s+y\)\s+\(\(PERS\)\s+([\w\s]+)\)",
  lambda m: (f"((PERS) {m[1]}) y ((PERS) {m[2]})"),
  input_text,
  re.IGNORECASE)

这个正则表达式涵盖了你的第一个用例，匹配以下内容：

((PERS)
后跟一些空格 \s+
一些混合的字母字符和空格，被捕获为 ([\w\s]+)，我理解没有其他字符，比如 -
一些更多的空格直到 y)
然后再次相同，但没有 y: \(\(PERS\)\s+([\w\s]+)\)
然后我们将两个匹配组格式化为 ((PERS) {m[1]}) y ((PERS) {m[2]}) 格式。

解决方案的第二部分非常类似，只是在第一个括号内匹配第二组：

input_text = re.sub(
  r"\(\(PERS\)\s+([\w\s]+)\s+y\s+([\w\s]+)\)",
  lambda m: (f"((PERS) {m[1]}) y ((PERS) {m[2]})"),
  input_text,
  re.IGNORECASE)

当然，你可以使用更复杂的正则表达式和替换函数来实现相同的效果，但我认为没有必要。这个正则表达式可以工作，例如：
\(\(PERS\)\s+([\w\s]+)\s+(y|y\s+([\w\s]+))\)(\s+\(\(PERS\)\s+([\w\s]+)\))，但接下来你需要处理有第1组和第5组的情况，或者使用逻辑来处理第1组和第3组。

英文:

You could just use 2 regexes which simplifies it a lot. First:

input_text = re.sub(
  r&quot;\(\(PERS\)\s+([\w\s]+)\s+y\)\s+\(\(PERS\)\s+([\w\s]+)\)&quot;,
  lambda m: (f&quot;((PERS) {m[1]}) y ((PERS) {m[2]})&quot;),
  input_text,
  re.IGNORECASE)

This one covers your 1st use case and matches:

((PERS)
followed by some whitespace \s+
some mixed word characters and whitespaces that get captured ([\w\s]+), as I understand without any other characters like -
some more whitespaces until y)
then again the same except without y: \(\(PERS\)\s+([\w\s]+)\)
Then we format both matched groups into ((PERS) {m[1]}) y ((PERS) {m[2]}) format.

The 2nd part of solution is very similar, except it just matches the 2nd group inside the 1st parentheses:

input_text = re.sub(
  r&quot;\(\(PERS\)\s+([\w\s]+)\s+y\s+([\w\s]+)\)&quot;,
  lambda m: (f&quot;((PERS) {m[1]}) y ((PERS) {m[2]})&quot;),
  input_text,
  re.IGNORECASE)

You could ofc do it with a much more convoluted regex and replacement lambda, but I see no point. This regex would work, for instance:
\(\(PERS\)\s+([\w\s]+)\s+(y|y\s+([\w\s]+))\)(\s+\(\(PERS\)\s+([\w\s]+)\))? but then you'd need to cover for cases when there's group 1 and group 5 or otherwise use logic for group 1 and 3.

答案2

得分: 1

根据你的要求，以下是代码部分的中文翻译：

在我看来，使用两个独立的正则表达式会更简单和清晰。测试：[简单](https://regex101.com/r/SFi2A7/1)，然后[扩展](https://regex101.com/r/aA60Uz/1)（带有[部分](https://regex101.com/r/EBWKNr/1))。
示例1似乎有一个错误，而示例2需要拆分：

    input_text = '&#39;'
    input_text += "((PERS) Marcos Sy y) ((PERS) Lucy) estuvieron ((VERB) jugando) sdds" #示例1
    input_text += "\n"
    input_text += "ashsahghgsa ((PERS) Mar&#237;a y Rosa ds) son alumnas de esa escuela y juegan juntas" #示例2

    input_text += "\n\n" \
        + "((PERS) Marcos Sy y) ((PERS) Lucy) ((PERS) Marcos Sy y Ana) estuvieron ((VERB) jugando) sdds\n\
    ashsahghgsa ((PERS) Mar&#237;a y Isabel y Ana y Rosa ds) son alumnas de esa escuela y juegan juntas"
        # 示例1+2扩展

    import re

    # 第一个：用于示例2

    # # 用于扩展示例2
    input_text = re.sub(
        r"\(\(PERS\)    (?P<multiple>    (?:  (?: \s [A-Z&#237;][\w&#237;]+(?:\s[a-z&#237;]+)? )* \sy  )+    (?:\s[^\)]+)    )    \)",
        lambda m: (f"((PERS){m['multiple'].replace(' y', '')} y ((PERS)))"),
        input_text, flags = re.IGNORECASE | re.VERBOSE # re.VERBOSE == re.X # 扩展（忽略空白）
    )

    # # 用于示例2（简单）
    # input_text = re.sub( \
    #     r'(\(\(PERS\)(?:\s(?!y)(?:[\w&#237;]+))*)\sy(\s[A-Z&#237;][\w&#237;]+(?:\s[a-z&#237;]+)?\))', \
    #     r'\g<1>) y ((PERS)\g<2>', \
    #     input_text, \
    #     flags = re.MULTILINE)

    # 第二个：用于示例1

    input_text = re.sub( \
        r'(\(\(PERS\)(?:\s(?:[\w&#237;]+))*)\sy\)', \
        r') y', \
        input_text, \
        flags = re.MULTILINE)

    print(input_text)

希望这有助于你的需求。

英文:

In my opinion, two separated regex would be simplier and clearer. Tests: simple, then expanded (with partial).
Example 1 seems to be a bug, while example 2 needs to be splitted:

input_text = &#39;&#39;
input_text += &quot;((PERS) Marcos Sy y) ((PERS) Lucy) estuvieron ((VERB) jugando) sdds&quot; #example 1
input_text += &quot;\n&quot;
input_text += &quot;ashsahghgsa ((PERS) Mar&#237;a y Rosa ds) son alumnas de esa escuela y juegan juntas&quot; #example 2

input_text += &quot;\n\n&quot; \
    + &quot;((PERS) Marcos Sy y) ((PERS) Lucy) ((PERS) Marcos Sy y Ana) estuvieron ((VERB) jugando) sdds\n\
ashsahghgsa ((PERS) Mar&#237;a y Isabel y Ana y Rosa ds) son alumnas de esa escuela y juegan juntas&quot;
    # example 1+2 expanded


import re

# first: for example 2

# # for example 2 expanded
input_text = re.sub(
    r&quot;\(\(PERS\)    (?P&lt;multiple&gt;    (?:  (?: \s [A-Z&#237;][\w&#237;]+(?:\s[a-z&#237;]+)? )* \sy  )+    (?:\s[^\)]+)    )    \)&quot;,
    lambda m: (f&quot;((PERS){m[&#39;multiple&#39;].replace(&#39; y&#39;, &#39;) y ((PERS)&#39;)})&quot;),
    input_text, flags = re.IGNORECASE | re.VERBOSE # re.VERBOSE == re.X # extended (ignore white space)
)

# # for example 2 (simple)
# input_text = re.sub( \
#     r&#39;(\(\(PERS\)(?:\s(?!y)(?:[\w&#237;]+))*)\sy(\s[A-Z&#237;][\w&#237;]+(?:\s[a-z&#237;]+)?\))&#39;, \
#     r&#39;\g&lt;1&gt;) y ((PERS)\g&lt;2&gt;&#39;, \
#     input_text, \
#     flags = re.MULTILINE)

# second: for example 1

input_text = re.sub( \
    r&#39;(\(\(PERS\)(?:\s(?:[\w&#237;]+))*)\sy\)&#39;, \
    r&#39;) y&#39;, \
    input_text, \
    flags = re.MULTILINE)

print(input_text)

result (original examples 1+2):

&quot;((PERS) Marcos Sy) y ((PERS) Lucy) estuvieron ((VERB) jugando) sdds&quot;
&quot;ashsahghgsa ((PERS) Mar&#237;a) y ((PERS) Rosa ds) son alumnas de esa escuela y juegan juntas&quot;

result (expanded example 1+2):

&quot;((PERS) Marcos Sy) y ((PERS) Lucy) ((PERS) Marcos Sy) y ((PERS) Ana) estuvieron ((VERB) jugando) sdds&quot;
&quot;ashsahghgsa ((PERS) Mar&#237;a) y ((PERS) Isabel) y ((PERS) Ana) y ((PERS) Rosa ds) son alumnas de esa escuela y juegan juntas&quot;

(Are you sure expecting ((PERS)Rosa ds) - without space? And it's no clear you need "ds" after "Rosa"? I don't speak Spanish, maybe that? but dealt with it )

答案3

得分: 1

以下是代码的翻译部分：

import re

pattern = r"(\(\(PERS\)\s*)((?:(?![()]|\sy\b).)* y\b[^()]*?)\s*"
s = "((PERS) Marcos Sy y) ((PERS) Lucy) estuvieron ((VERB) jugando) sdds\n"
    "ashsahghgsa ((PERS) Mar&#237;a y Rosa ds) son alumnas de esa escuela y juegan juntas\n"
    "ashsahghgsa ((PERS) Mar&#237;a y Rosa ds y Test Person 1 y test person 2) son alumnas de esa escuela y juegan juntas"


def custom_replacement(m):
    return m.group(1) + " y ((PERS) ".join()

replaced_names = re.sub(pattern, custom_replacement, s)
replaced_pers = re.sub(r"(\(\(PERS\)[^()]*\))\s*(?=\(\(PERS\)[^()]*\))", r" y ", replaced_names)
print(replaced_pers)

希望这对你有帮助。如果有其他问题，请随时提出。

英文:

If there can not be any other occurrence of a parenthesis, you might use a pattern with 2 capture groups, and then use split on the second group to get the separate parts between y so that there can also be multiple names.

Pattern to get the ((PERS)...) parts with y

(\(\(PERS\)\s*)((?:(?![()]|\sy\b).)* y\b[^()]*?)\s*\)

Regex demo

After these replacements, you can put <code> y </code> between all the remaining consecutive ((PERS)...) parts with another pattern:

(\(\(PERS\)[^()]*\))\s*(?=\(\(PERS\)[^()]*\))

Regex demo

import re

pattern = r&quot;(\(\(PERS\)\s*)((?:(?![()]|\sy\b).)* y\b[^()]*?)\s*\)&quot;
s = (&quot;((PERS) Marcos Sy y) ((PERS) Lucy) estuvieron ((VERB) jugando) sdds\n&quot;
            &quot;ashsahghgsa ((PERS) Mar&#237;a y Rosa ds) son alumnas de esa escuela y juegan juntas\n&quot;
            &quot;ashsahghgsa ((PERS) Mar&#237;a y Rosa ds y Test Person 1 y test person 2) son alumnas de esa escuela y juegan juntas&quot;)


def custom_replacement(m):
    return m.group(1) + &quot; y ((PERS) &quot;.join()


replaced_names = re.sub(pattern, custom_replacement, s)
replaced_pers = re.sub(r&quot;(\(\(PERS\)[^()]*\))\s*(?=\(\(PERS\)[^()]*\))&quot;, r&quot; y &quot;, replaced_names)
print(replaced_pers)

Output

((PERS) Marcos Sy) y ((PERS) Lucy) estuvieron ((VERB) jugando) sdds
ashsahghgsa ((PERS) Mar&#237;a) y ((PERS) Rosa ds) son alumnas de esa escuela y juegan juntas
ashsahghgsa ((PERS) Mar&#237;a) y ((PERS) Rosa ds) y ((PERS) Test Person 1) y ((PERS) test person 2) son alumnas de esa escuela y juegan juntas

See a Python demo.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用Python正则表达式按’ y ‘或’ y)’拆分和重新排序((PERS))标签内的内容？

问题

答案1

答案2

答案3

无法从Scrapy API获取数据

PySpark：在匹配后提取5个下一个单词

广度优先搜索和深度优先搜索算法对背包问题的比较

“Missing fields in json extraction” 可翻译为 “JSON提取中缺少的字段”。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论