2023年3月3日 23:35:29go评论112阅读模式

英文:

Why when split a string into a list of substrings, without removing the separators, parts of this original string are lost in the splitting process?

问题

该函数之所以在example 2中删除了substrings_with_nouns_and_their_modifiers_list列表中某些元素的((PERS)部分，然后导致使用re.compile()时出现"unbalanced parenthesis"错误，是因为正则表达式的构建时没有考虑到元素中可能包含括号的情况。为了修复这个问题，你可以在构建正则表达式时，确保substrings_with_nouns_and_their_modifiers_list中的元素被适当地包裹在括号中，而不是将((PERS)从元素中删除。

要修改identification_of_nominal_complements()函数以解决这个问题，你可以将以下代码添加到函数中：

import re
from itertools import chain
def identification_of_nominal_complements(input_text):
    pat_identifier_noun_with_modifiers = r"((?:l[oa]s|l[oa])\s+.+?)\s*(?=\(\(VERB\))"
    substrings_with_nouns_and_their_modifiers_list = re.findall(pat_identifier_noun_with_modifiers, input_text)
    separator_elements = r"\s*(?:,|(,|)\s*y)\s*"
    substrings_with_nouns_and_their_modifiers_list = [re.split(separator_elements, s) for s in substrings_with_nouns_and_their_modifiers_list]
    substrings_with_nouns_and_their_modifiers_list = list(chain.from_iterable(substrings_with_nouns_and_their_modifiers_list))
    substrings_with_nouns_and_their_modifiers_list = list(filter(lambda x: x is not None and x.strip() != '', substrings_with_nouns_and_their_modifiers_list))
    print(substrings_with_nouns_and_their_modifiers_list) # --> list output
    # Wrap substrings in ((PERS))
    substrings_with_nouns_and_their_modifiers_list = [f"((PERS){s})" for s in substrings_with_nouns_and_their_modifiers_list]
    pat = re.compile(rf"(?<!\(PERS\))({ '|'.join(substrings_with_nouns_and_their_modifiers_list)})(?![\w)-])")
    input_text = re.sub(pat, r'((PERS))', input_text)
    return input_text
# example 2, it works correctly now:
input_text = "((VERB)Creo) que ((PERS)los viejos gabinetes) ((VERB)estan) en desuso, hay que ((PERS)los viejos gabinetes) ((VERB)hacer) algo con ((PERS)los viejos gabinetes), ya que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes) ((VERB)quedaron) en el deposito. ((PERS)los candelabros) son brillantes los candelabros ((VERB)brillan). ((PERS)los candelabros) ((VERB)estan) ahi"
input_text = identification_of_nominal_complements(input_text)
print(input_text) # --> string output

通过上述修改，((PERS)部分不再被删除，而是在元素中正确包裹，这将避免"unbalanced parenthesis"错误，并获得正确的输出。

英文:

import re
from itertools import chain
def identification_of_nominal_complements(input_text):
    pat_identifier_noun_with_modifiers = r&quot;((?:l[oa]s|l[oa])\s+.+?)\s*(?=\(\(VERB\))&quot;
    substrings_with_nouns_and_their_modifiers_list = re.findall(pat_identifier_noun_with_modifiers, input_text)
    separator_elements = r&quot;\s*(?:,|(,|)\s*y)\s*&quot;
    substrings_with_nouns_and_their_modifiers_list = [re.split(separator_elements, s) for s in substrings_with_nouns_and_their_modifiers_list]
    substrings_with_nouns_and_their_modifiers_list = list(chain.from_iterable(substrings_with_nouns_and_their_modifiers_list))
    substrings_with_nouns_and_their_modifiers_list = list(filter(lambda x: x is not None and x.strip() != &#39;&#39;, substrings_with_nouns_and_their_modifiers_list))
    print(substrings_with_nouns_and_their_modifiers_list) # --&gt; list output
    pat = re.compile(rf&quot;(?&lt;!\(PERS\))({&#39;|&#39;.join(substrings_with_nouns_and_their_modifiers_list)})(?![&#39;\w)-])&quot;)
    input_text = re.sub(pat, r&#39;((PERS))&#39;, input_text)
    return input_text
#example 1, it works well:
input_text = &quot;He ((VERB)visto) la maceta de la se&#241;ora de rojo ((VERB)es) grande. He ((VERB)visto) que la maceta de la se&#241;ora de rojo y a ((PERS)Lucila) ((VERB)es) grande.&quot;
#example 2, it works wrong and gives error:
input_text = &quot;((VERB)Creo) que ((PERS)los viejos gabinetes) ((VERB)estan) en desuso, hay que ((PERS)los viejos gabinetes) ((VERB)hacer) algo con ((PERS)los viejos gabinetes), ya que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes) ((VERB)quedaron) en el deposito. ((PERS)los candelabros) son brillantes los candelabros ((VERB)brillan). ((PERS)los candelabros) ((VERB)estan) ahi&quot;
input_text = identification_of_nominal_complements(input_text)
print(input_text) # --&gt; string output

Why does this function with example 2 cut off the ((PERS) part of some of the elements of the substrings_with_nouns_and_their_modifiers_list list, and in example 1 this same function doesn't?
For this reason, elements are generated with unbalanced parentheses, which generates a re.error: unbalanced parenthesis later, specifically on the line where the re.compile() function is used.

For example 1, the output obtained is correct, they are not removed unnecessarily ((PERS) and consequently the error of unbalanced parentheses is not obtained

[&#39;la maceta de la se&#241;ora de rojo&#39;, &#39;la maceta de la se&#241;ora de rojo&#39;, &#39;a ((PERS)Lucila)&#39;]
&#39;He ((VERB)visto) ((PERS)la maceta de la se&#241;ora de rojo) ((VERB)es) grande. He ((VERB)visto) que ((PERS)la maceta de la se&#241;ora de rojo) y a ((PERS)Lucila) ((VERB)es) grande.&#39;

In example 2, is where the problem is, although the function with which the string is processed is the same, for some reason the substring ((PERS) is removed from some elements of the substrings_with_nouns_and_their_modifiers_list list , which will trigger an unbalanced parenthesis error when using re.compile(), because, in this particular case, there are some substrings that contain ) but not (, because the ((PERS) was removed

[&#39;los viejos gabinetes)&#39;, &#39;los viejos gabinetes)&#39;, &#39;los viejos gabinetes)&#39;, &#39;a que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes)&#39;, &#39;los candelabros) son brillantes los candelabros&#39;, &#39;los candelabros)&#39;]
Traceback (most recent call last):
pat = re.compile(rf&quot;(?&lt;!\(PERS\))({&#39;|&#39;.join(substrings_with_nouns_and_their_modifiers_list)})(?![&#39;\w)-])&quot;)
raise source.error(&quot;unbalanced parenthesis&quot;)
re.error: unbalanced parenthesis at position 56

And if the identification_of_nominal_complements() function worked correctly, these should be the outputs you would get when sending the function the string from example 2, where not removing some ((PERS) avoids the unbalanced parenthesis error when using re.compile(). This is the correct output for the example 2 string:

[&#39;((PERS)los viejos gabinetes)&#39;, &#39;((PERS)los viejos gabinetes)&#39;, &#39;((PERS)los viejos gabinetes)&#39;, &#39;a que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes)&#39;, &#39;((PERS)los candelabros) son brillantes los candelabros&#39;, &#39;((PERS)los candelabros)&#39;]
&#39;((VERB)Creo) que ((PERS)los viejos gabinetes) ((VERB)estan) en desuso, hay que ((PERS)los viejos gabinetes) ((VERB)hacer) algo con ((PERS)los viejos gabinetes), ya que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes) ((VERB)quedaron) en el deposito. ((PERS)los candelabros) son brillantes los candelabros ((VERB)brillan). ((PERS)los candelabros) ((VERB)estan) ahi&#39;

What should I modify in the identification_of_nominal_complements() function so that when sending the string of example 2 I don't have the unbalanced parentheses error and I can get this correct output

答案1

得分: 1

因为没有在开头找到模式 [^\s]*，所以这个具有示例 2 的函数会截断某些元素中的 ((PERS)) 部分：

pat_identifier_noun_with_modifiers = r"([^\s](?:l[oa]s|l[oa])\s+.+?)\s(?=((VERB))"
现在的结果是：

['((PERS)los viejos gabinetes)', '((PERS)los viejos gabinetes)', '((PERS)los viejos gabinetes)', 'a que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes)', '((PERS)los candelabros) son brillantes los candelabros', '((PERS)los candelabros)']

'((VERB)Creo) que ((PERS)los viejos gabinetes) ((VERB)estan) en desuso, hay que ((PERS)los viejos gabinetes) ((VERB)hacer) algo con ((PERS)los viejos gabinetes), ya que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes) ((VERB)quedaron) en el deposito. ((PERS)los candelabros) son brillantes los candelabros ((VERB)brillan). ((PERS)los candelabros) ((VERB)estan) ahi'

英文:

"Why does this function with example 2 cut off the ((PERS) part of some of the elements..." Because of no pattern [^\s]* at the beginning:

pat_identifier_noun_with_modifiers = r&quot;([^\s]*(?:l[oa]s|l[oa])\s+.+?)\s*(?=\(\(VERB\))&quot;

And now result is:

[&#39;((PERS)los viejos gabinetes)&#39;, &#39;((PERS)los viejos gabinetes)&#39;, &#39;((PERS)los viejos gabinetes)&#39;, &#39;a que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes)&#39;, &#39;((PERS)los candelabros) son brillantes los candelabros&#39;, &#39;((PERS)los candelabros)&#39;]
&#39;((VERB)Creo) que ((PERS)los viejos gabinetes) ((VERB)estan) en desuso, hay que ((PERS)los viejos gabinetes) ((VERB)hacer) algo con ((PERS)los viejos gabinetes), ya que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes) ((VERB)quedaron) en el deposito. ((PERS)los candelabros) son brillantes los candelabros ((VERB)brillan). ((PERS)los candelabros) ((VERB)estan) ahi&#39;

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Why when split a string into a list of substrings, without removing the separators, parts of this original string are lost in the splitting process?

问题

答案1

动态加载一个类

opening two instances of chrome webdriver(with and without proxies) with selenium python

捕获未定义的模块函数调用并在模块内处理它

关于在子类中扩展属性的问题

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。