Why when split a string into a list of substrings, without removing the separators, parts of this original string are lost in the splitting process?

huangapple go评论86阅读模式
英文:

Why when split a string into a list of substrings, without removing the separators, parts of this original string are lost in the splitting process?

问题

该函数之所以在example 2中删除了substrings_with_nouns_and_their_modifiers_list列表中某些元素的((PERS)部分,然后导致使用re.compile()时出现"unbalanced parenthesis"错误,是因为正则表达式的构建时没有考虑到元素中可能包含括号的情况。为了修复这个问题,你可以在构建正则表达式时,确保substrings_with_nouns_and_their_modifiers_list中的元素被适当地包裹在括号中,而不是将((PERS)从元素中删除。

要修改identification_of_nominal_complements()函数以解决这个问题,你可以将以下代码添加到函数中:

import re
from itertools import chain

def identification_of_nominal_complements(input_text):

    pat_identifier_noun_with_modifiers = r"((?:l[oa]s|l[oa])\s+.+?)\s*(?=\(\(VERB\))"
    substrings_with_nouns_and_their_modifiers_list = re.findall(pat_identifier_noun_with_modifiers, input_text)
    separator_elements = r"\s*(?:,|(,|)\s*y)\s*"

    substrings_with_nouns_and_their_modifiers_list = [re.split(separator_elements, s) for s in substrings_with_nouns_and_their_modifiers_list]
    substrings_with_nouns_and_their_modifiers_list = list(chain.from_iterable(substrings_with_nouns_and_their_modifiers_list))
    substrings_with_nouns_and_their_modifiers_list = list(filter(lambda x: x is not None and x.strip() != '', substrings_with_nouns_and_their_modifiers_list))
    print(substrings_with_nouns_and_their_modifiers_list) # --> list output

    # Wrap substrings in ((PERS))
    substrings_with_nouns_and_their_modifiers_list = [f"((PERS){s})" for s in substrings_with_nouns_and_their_modifiers_list]

    pat = re.compile(rf"(?<!\(PERS\))({ '|'.join(substrings_with_nouns_and_their_modifiers_list)})(?![\w)-])")
    input_text = re.sub(pat, r'((PERS))', input_text)

    return input_text

# example 2, it works correctly now:
input_text = "((VERB)Creo) que ((PERS)los viejos gabinetes) ((VERB)estan) en desuso, hay que ((PERS)los viejos gabinetes) ((VERB)hacer) algo con ((PERS)los viejos gabinetes), ya que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes) ((VERB)quedaron) en el deposito. ((PERS)los candelabros) son brillantes los candelabros ((VERB)brillan). ((PERS)los candelabros) ((VERB)estan) ahi"

input_text = identification_of_nominal_complements(input_text)
print(input_text) # --> string output

通过上述修改,((PERS)部分不再被删除,而是在元素中正确包裹,这将避免"unbalanced parenthesis"错误,并获得正确的输出。

英文:
import re
from itertools import chain

def identification_of_nominal_complements(input_text):

    pat_identifier_noun_with_modifiers = r&quot;((?:l[oa]s|l[oa])\s+.+?)\s*(?=\(\(VERB\))&quot;
    substrings_with_nouns_and_their_modifiers_list = re.findall(pat_identifier_noun_with_modifiers, input_text)
    separator_elements = r&quot;\s*(?:,|(,|)\s*y)\s*&quot;

    substrings_with_nouns_and_their_modifiers_list = [re.split(separator_elements, s) for s in substrings_with_nouns_and_their_modifiers_list]
    substrings_with_nouns_and_their_modifiers_list = list(chain.from_iterable(substrings_with_nouns_and_their_modifiers_list))
    substrings_with_nouns_and_their_modifiers_list = list(filter(lambda x: x is not None and x.strip() != &#39;&#39;, substrings_with_nouns_and_their_modifiers_list))
    print(substrings_with_nouns_and_their_modifiers_list) # --&gt; list output

    pat = re.compile(rf&quot;(?&lt;!\(PERS\))({&#39;|&#39;.join(substrings_with_nouns_and_their_modifiers_list)})(?![&#39;\w)-])&quot;)
    input_text = re.sub(pat, r&#39;((PERS))&#39;, input_text)

    return input_text

#example 1, it works well:
input_text = &quot;He ((VERB)visto) la maceta de la se&#241;ora de rojo ((VERB)es) grande. He ((VERB)visto) que la maceta de la se&#241;ora de rojo y a ((PERS)Lucila) ((VERB)es) grande.&quot;

#example 2, it works wrong and gives error:
input_text = &quot;((VERB)Creo) que ((PERS)los viejos gabinetes) ((VERB)estan) en desuso, hay que ((PERS)los viejos gabinetes) ((VERB)hacer) algo con ((PERS)los viejos gabinetes), ya que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes) ((VERB)quedaron) en el deposito. ((PERS)los candelabros) son brillantes los candelabros ((VERB)brillan). ((PERS)los candelabros) ((VERB)estan) ahi&quot;


input_text = identification_of_nominal_complements(input_text)
print(input_text) # --&gt; string output

Why does this function with example 2 cut off the ((PERS) part of some of the elements of the substrings_with_nouns_and_their_modifiers_list list, and in example 1 this same function doesn't?
For this reason, elements are generated with unbalanced parentheses, which generates a re.error: unbalanced parenthesis later, specifically on the line where the re.compile() function is used.

For example 1, the output obtained is correct, they are not removed unnecessarily ((PERS) and consequently the error of unbalanced parentheses is not obtained

[&#39;la maceta de la se&#241;ora de rojo&#39;, &#39;la maceta de la se&#241;ora de rojo&#39;, &#39;a ((PERS)Lucila)&#39;]

&#39;He ((VERB)visto) ((PERS)la maceta de la se&#241;ora de rojo) ((VERB)es) grande. He ((VERB)visto) que ((PERS)la maceta de la se&#241;ora de rojo) y a ((PERS)Lucila) ((VERB)es) grande.&#39;

In example 2, is where the problem is, although the function with which the string is processed is the same, for some reason the substring ((PERS) is removed from some elements of the substrings_with_nouns_and_their_modifiers_list list , which will trigger an unbalanced parenthesis error when using re.compile(), because, in this particular case, there are some substrings that contain ) but not (, because the ((PERS) was removed

[&#39;los viejos gabinetes)&#39;, &#39;los viejos gabinetes)&#39;, &#39;los viejos gabinetes)&#39;, &#39;a que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes)&#39;, &#39;los candelabros) son brillantes los candelabros&#39;, &#39;los candelabros)&#39;]

Traceback (most recent call last):
pat = re.compile(rf&quot;(?&lt;!\(PERS\))({&#39;|&#39;.join(substrings_with_nouns_and_their_modifiers_list)})(?![&#39;\w)-])&quot;)
raise source.error(&quot;unbalanced parenthesis&quot;)
re.error: unbalanced parenthesis at position 56

And if the identification_of_nominal_complements() function worked correctly, these should be the outputs you would get when sending the function the string from example 2, where not removing some ((PERS) avoids the unbalanced parenthesis error when using re.compile(). This is the correct output for the example 2 string:

[&#39;((PERS)los viejos gabinetes)&#39;, &#39;((PERS)los viejos gabinetes)&#39;, &#39;((PERS)los viejos gabinetes)&#39;, &#39;a que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes)&#39;, &#39;((PERS)los candelabros) son brillantes los candelabros&#39;, &#39;((PERS)los candelabros)&#39;]

&#39;((VERB)Creo) que ((PERS)los viejos gabinetes) ((VERB)estan) en desuso, hay que ((PERS)los viejos gabinetes) ((VERB)hacer) algo con ((PERS)los viejos gabinetes), ya que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes) ((VERB)quedaron) en el deposito. ((PERS)los candelabros) son brillantes los candelabros ((VERB)brillan). ((PERS)los candelabros) ((VERB)estan) ahi&#39;

What should I modify in the identification_of_nominal_complements() function so that when sending the string of example 2 I don't have the unbalanced parentheses error and I can get this correct output

答案1

得分: 1

因为没有在开头找到模式 [^\s]*,所以这个具有示例 2 的函数会截断某些元素中的 ((PERS)) 部分:

pat_identifier_noun_with_modifiers = r"([^\s](?:l[oa]s|l[oa])\s+.+?)\s(?=((VERB))"
现在的结果是:

['((PERS)los viejos gabinetes)', '((PERS)los viejos gabinetes)', '((PERS)los viejos gabinetes)', 'a que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes)', '((PERS)los candelabros) son brillantes los candelabros', '((PERS)los candelabros)']

'((VERB)Creo) que ((PERS)los viejos gabinetes) ((VERB)estan) en desuso, hay que ((PERS)los viejos gabinetes) ((VERB)hacer) algo con ((PERS)los viejos gabinetes), ya que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes) ((VERB)quedaron) en el deposito. ((PERS)los candelabros) son brillantes los candelabros ((VERB)brillan). ((PERS)los candelabros) ((VERB)estan) ahi'

英文:

"Why does this function with example 2 cut off the ((PERS) part of some of the elements..." Because of no pattern [^\s]* at the beginning:

pat_identifier_noun_with_modifiers = r&quot;([^\s]*(?:l[oa]s|l[oa])\s+.+?)\s*(?=\(\(VERB\))&quot;

And now result is:

[&#39;((PERS)los viejos gabinetes)&#39;, &#39;((PERS)los viejos gabinetes)&#39;, &#39;((PERS)los viejos gabinetes)&#39;, &#39;a que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes)&#39;, &#39;((PERS)los candelabros) son brillantes los candelabros&#39;, &#39;((PERS)los candelabros)&#39;]

&#39;((VERB)Creo) que ((PERS)los viejos gabinetes) ((VERB)estan) en desuso, hay que ((PERS)los viejos gabinetes) ((VERB)hacer) algo con ((PERS)los viejos gabinetes), ya que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes) ((VERB)quedaron) en el deposito. ((PERS)los candelabros) son brillantes los candelabros ((VERB)brillan). ((PERS)los candelabros) ((VERB)estan) ahi&#39;

huangapple
  • 本文由 发表于 2023年3月3日 23:35:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/75629071.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定