英文:
Why when split a string into a list of substrings, without removing the separators, parts of this original string are lost in the splitting process?
问题
该函数之所以在example 2
中删除了substrings_with_nouns_and_their_modifiers_list
列表中某些元素的((PERS)
部分,然后导致使用re.compile()
时出现"unbalanced parenthesis"错误,是因为正则表达式的构建时没有考虑到元素中可能包含括号的情况。为了修复这个问题,你可以在构建正则表达式时,确保substrings_with_nouns_and_their_modifiers_list
中的元素被适当地包裹在括号中,而不是将((PERS)
从元素中删除。
要修改identification_of_nominal_complements()
函数以解决这个问题,你可以将以下代码添加到函数中:
import re
from itertools import chain
def identification_of_nominal_complements(input_text):
pat_identifier_noun_with_modifiers = r"((?:l[oa]s|l[oa])\s+.+?)\s*(?=\(\(VERB\))"
substrings_with_nouns_and_their_modifiers_list = re.findall(pat_identifier_noun_with_modifiers, input_text)
separator_elements = r"\s*(?:,|(,|)\s*y)\s*"
substrings_with_nouns_and_their_modifiers_list = [re.split(separator_elements, s) for s in substrings_with_nouns_and_their_modifiers_list]
substrings_with_nouns_and_their_modifiers_list = list(chain.from_iterable(substrings_with_nouns_and_their_modifiers_list))
substrings_with_nouns_and_their_modifiers_list = list(filter(lambda x: x is not None and x.strip() != '', substrings_with_nouns_and_their_modifiers_list))
print(substrings_with_nouns_and_their_modifiers_list) # --> list output
# Wrap substrings in ((PERS))
substrings_with_nouns_and_their_modifiers_list = [f"((PERS){s})" for s in substrings_with_nouns_and_their_modifiers_list]
pat = re.compile(rf"(?<!\(PERS\))({ '|'.join(substrings_with_nouns_and_their_modifiers_list)})(?![\w)-])")
input_text = re.sub(pat, r'((PERS))', input_text)
return input_text
# example 2, it works correctly now:
input_text = "((VERB)Creo) que ((PERS)los viejos gabinetes) ((VERB)estan) en desuso, hay que ((PERS)los viejos gabinetes) ((VERB)hacer) algo con ((PERS)los viejos gabinetes), ya que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes) ((VERB)quedaron) en el deposito. ((PERS)los candelabros) son brillantes los candelabros ((VERB)brillan). ((PERS)los candelabros) ((VERB)estan) ahi"
input_text = identification_of_nominal_complements(input_text)
print(input_text) # --> string output
通过上述修改,((PERS)
部分不再被删除,而是在元素中正确包裹,这将避免"unbalanced parenthesis"错误,并获得正确的输出。
英文:
import re
from itertools import chain
def identification_of_nominal_complements(input_text):
pat_identifier_noun_with_modifiers = r"((?:l[oa]s|l[oa])\s+.+?)\s*(?=\(\(VERB\))"
substrings_with_nouns_and_their_modifiers_list = re.findall(pat_identifier_noun_with_modifiers, input_text)
separator_elements = r"\s*(?:,|(,|)\s*y)\s*"
substrings_with_nouns_and_their_modifiers_list = [re.split(separator_elements, s) for s in substrings_with_nouns_and_their_modifiers_list]
substrings_with_nouns_and_their_modifiers_list = list(chain.from_iterable(substrings_with_nouns_and_their_modifiers_list))
substrings_with_nouns_and_their_modifiers_list = list(filter(lambda x: x is not None and x.strip() != '', substrings_with_nouns_and_their_modifiers_list))
print(substrings_with_nouns_and_their_modifiers_list) # --> list output
pat = re.compile(rf"(?<!\(PERS\))({'|'.join(substrings_with_nouns_and_their_modifiers_list)})(?!['\w)-])")
input_text = re.sub(pat, r'((PERS))', input_text)
return input_text
#example 1, it works well:
input_text = "He ((VERB)visto) la maceta de la señora de rojo ((VERB)es) grande. He ((VERB)visto) que la maceta de la señora de rojo y a ((PERS)Lucila) ((VERB)es) grande."
#example 2, it works wrong and gives error:
input_text = "((VERB)Creo) que ((PERS)los viejos gabinetes) ((VERB)estan) en desuso, hay que ((PERS)los viejos gabinetes) ((VERB)hacer) algo con ((PERS)los viejos gabinetes), ya que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes) ((VERB)quedaron) en el deposito. ((PERS)los candelabros) son brillantes los candelabros ((VERB)brillan). ((PERS)los candelabros) ((VERB)estan) ahi"
input_text = identification_of_nominal_complements(input_text)
print(input_text) # --> string output
Why does this function with example 2
cut off the ((PERS)
part of some of the elements of the substrings_with_nouns_and_their_modifiers_list
list, and in example 1
this same function doesn't?
For this reason, elements are generated with unbalanced parentheses, which generates a re.error: unbalanced parenthesis
later, specifically on the line where the re.compile()
function is used.
For example 1
, the output obtained is correct, they are not removed unnecessarily ((PERS)
and consequently the error of unbalanced parentheses is not obtained
['la maceta de la señora de rojo', 'la maceta de la señora de rojo', 'a ((PERS)Lucila)']
'He ((VERB)visto) ((PERS)la maceta de la señora de rojo) ((VERB)es) grande. He ((VERB)visto) que ((PERS)la maceta de la señora de rojo) y a ((PERS)Lucila) ((VERB)es) grande.'
In example 2
, is where the problem is, although the function with which the string is processed is the same, for some reason the substring ((PERS)
is removed from some elements of the substrings_with_nouns_and_their_modifiers_list
list , which will trigger an unbalanced parenthesis error when using re.compile()
, because, in this particular case, there are some substrings that contain )
but not (
, because the ((PERS)
was removed
['los viejos gabinetes)', 'los viejos gabinetes)', 'los viejos gabinetes)', 'a que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes)', 'los candelabros) son brillantes los candelabros', 'los candelabros)']
Traceback (most recent call last):
pat = re.compile(rf"(?<!\(PERS\))({'|'.join(substrings_with_nouns_and_their_modifiers_list)})(?!['\w)-])")
raise source.error("unbalanced parenthesis")
re.error: unbalanced parenthesis at position 56
And if the identification_of_nominal_complements()
function worked correctly, these should be the outputs you would get when sending the function the string from example 2
, where not removing some ((PERS)
avoids the unbalanced parenthesis error when using re.compile()
. This is the correct output for the example 2
string:
['((PERS)los viejos gabinetes)', '((PERS)los viejos gabinetes)', '((PERS)los viejos gabinetes)', 'a que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes)', '((PERS)los candelabros) son brillantes los candelabros', '((PERS)los candelabros)']
'((VERB)Creo) que ((PERS)los viejos gabinetes) ((VERB)estan) en desuso, hay que ((PERS)los viejos gabinetes) ((VERB)hacer) algo con ((PERS)los viejos gabinetes), ya que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes) ((VERB)quedaron) en el deposito. ((PERS)los candelabros) son brillantes los candelabros ((VERB)brillan). ((PERS)los candelabros) ((VERB)estan) ahi'
What should I modify in the identification_of_nominal_complements()
function so that when sending the string of example 2
I don't have the unbalanced parentheses error and I can get this correct output
答案1
得分: 1
因为没有在开头找到模式 [^\s]*
,所以这个具有示例 2 的函数会截断某些元素中的 ((PERS)) 部分:
pat_identifier_noun_with_modifiers = r"([^\s](?:l[oa]s|l[oa])\s+.+?)\s(?=((VERB))"
现在的结果是:
['((PERS)los viejos gabinetes)', '((PERS)los viejos gabinetes)', '((PERS)los viejos gabinetes)', 'a que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes)', '((PERS)los candelabros) son brillantes los candelabros', '((PERS)los candelabros)']
'((VERB)Creo) que ((PERS)los viejos gabinetes) ((VERB)estan) en desuso, hay que ((PERS)los viejos gabinetes) ((VERB)hacer) algo con ((PERS)los viejos gabinetes), ya que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes) ((VERB)quedaron) en el deposito. ((PERS)los candelabros) son brillantes los candelabros ((VERB)brillan). ((PERS)los candelabros) ((VERB)estan) ahi'
英文:
"Why does this function with example 2 cut off the ((PERS) part of some of the elements..." Because of no pattern [^\s]*
at the beginning:
pat_identifier_noun_with_modifiers = r"([^\s]*(?:l[oa]s|l[oa])\s+.+?)\s*(?=\(\(VERB\))"
And now result is:
['((PERS)los viejos gabinetes)', '((PERS)los viejos gabinetes)', '((PERS)los viejos gabinetes)', 'a que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes)', '((PERS)los candelabros) son brillantes los candelabros', '((PERS)los candelabros)']
'((VERB)Creo) que ((PERS)los viejos gabinetes) ((VERB)estan) en desuso, hay que ((PERS)los viejos gabinetes) ((VERB)hacer) algo con ((PERS)los viejos gabinetes), ya que ((PERS)los viejos gabinetes) son importantes. ((PERS)los viejos gabinetes) ((VERB)quedaron) en el deposito. ((PERS)los candelabros) son brillantes los candelabros ((VERB)brillan). ((PERS)los candelabros) ((VERB)estan) ahi'
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论