英文:
Python NLP processing if statement not in stop words list
问题
我正在使用NLP的spacy
库,并创建了一个函数来从文本中返回一个标记列表。
import spacy
def preprocess_text_spacy(text):
stop_words = ["a", "the", "is", "are"]
nlp = spacy.load('en_core_web_sm')
tokens = set()
doc = nlp(text)
for word in doc:
if word.is_currency:
tokens.add(word.lower_)
elif len(word.lower_) == 1:
if word.is_digit and float(word.text) == 0:
tokens.add(word.text)
elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and word.lower_ not in stop_words:
tokens.add(word.lower_)
return list(tokens)
这个函数不正确,因为它未能移除停用词。只有当删除最后的条件and not in stop_words
时,一切都正常。
如何升级这个函数,以便根据一个预定义的停用词列表来删除停用词,并保留所有其他条件语句?
英文:
I'm working with NLP spacy
library and I created a function to return a list of token from a text.
import spacy
def preprocess_text_spacy(text):
stop_words = ["a", "the", "is", "are"]
nlp = spacy.load('en_core_web_sm')
tokens = set()
doc = nlp(text)
for word in doc:
if word.is_currency:
tokens.add(word.lower_)
elif len(word.lower_) == 1:
if word.is_digit and float(word.text) == 0:
tokens.add(word.text)
elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and not in stop_words:
tokens.add(word.lower_)
return list(tokens)
This function is not correct because removing stop words not working.
Everything is ok only if I delete the last condition and not in stop_words
.
How to upgrade this function to remove stop words according a defined list in addition to all other condition statement?
答案1
得分: 1
你的条件写错了。你最后的 elif
等同于以下内容:
condC = not in stop_words
elif condA and condB and not in condC:
...
如果你尝试执行这段代码,会导致语法错误。要检查某个元素是否在可迭代对象中,你需要将该元素放在关键字 in
的左边。你只需写 word
:
elif condA and condB and ... and str(word) not in stop_words:
...
英文:
You are writing your condition wrong. Your last elif
is equivalent to this:
condC = not in stop_words
elif condA and condB and not in condC:
...
If you try to execute this code you will get a syntax error. To check if some element is in some iterable, you need to provide that element at the left side of the keyword in
. You just have to write word
:
elif condA and condB and ... and str(word) not in stop_words:
...
答案2
得分: 1
你的代码看起来没问题,只有一个小修改:
在elif的最后加上 and str(word) not in stop_words
import spacy
def preprocess_text_spacy(text):
stop_words = ["a", "the", "is", "are"]
nlp = spacy.load('en_core_web_sm')
tokens = set()
doc = nlp(text)
print(doc)
for word in doc:
if word.is_currency:
tokens.add(word.lower_)
elif len(word.lower_) == 1:
if word.is_digit and float(word.text) == 0:
tokens.add(word.text)
elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and str(word) not in stop_words:
tokens.add(word.lower_)
return list(tokens)
英文:
Your code looks fine to me, there is a small change
at the end of elif put and str(word) not in stop_words
import spacy
def preprocess_text_spacy(text):
stop_words = ["a", "the", "is", "are"]
nlp = spacy.load('en_core_web_sm')
tokens = set()
doc = nlp(text)
print(doc)
for word in doc:
if word.is_currency:
tokens.add(word.lower_)
elif len(word.lower_) == 1:
if word.is_digit and float(word.text) == 0:
tokens.add(word.text)
elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and str(word) not in stop_words:
tokens.add(word.lower_)
return list(tokens)
答案3
得分: 0
def preprocess_text_spacy(text, stop_words):
nlp = spacy.load('en_core_web_sm')
tokens = []
doc = nlp(text)
for word in doc:
if word.is_currency:
tokens.append(word.lower_)
elif len(word.lower_) == 1:
if word.is_digit and float(word.text) == 0:
tokens.append(word.text)
elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and word.lower_ not in stop_words:
tokens.append(word.lower_)
return tokens
英文:
You need to add stop_words to the function, which takes a list of stop words as input and then you need then modify the condition for adding words to the token list, to check if the word is in the stop_words list or not
def preprocess_text_spacy(text, stop_words):
nlp = spacy.load('en_core_web_sm')
tokens = []
doc = nlp(text)
for word in doc:
if word.is_currency:
tokens.append(word.lower_)
elif len(word.lower_) == 1:
if word.is_digit and float(word.text) == 0:
tokens.append(word.text)
elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and word.lower_ not in stop_words:
tokens.append(word.lower_)
return tokens
Sample:
text = "This is a sample text to demonstrate the function."
stop_words = ["a", "the", "is", "are"]
tokens = preprocess_text_spacy(text, stop_words)
print(tokens)
Output:
['this', 'sample', 'text', 'to', 'demonstrate', 'function']
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论