2023年3月9日 17:03:19go评论67阅读模式

英文:

Python NLP processing if statement not in stop words list

问题

我正在使用NLP的spacy库，并创建了一个函数来从文本中返回一个标记列表。

import spacy

def preprocess_text_spacy(text):
    stop_words = ["a", "the", "is", "are"]
    nlp = spacy.load('en_core_web_sm')
    tokens = set()
    doc = nlp(text)
    for word in doc:
        if word.is_currency:
            tokens.add(word.lower_)
        elif len(word.lower_) == 1:
            if word.is_digit and float(word.text) == 0:
                tokens.add(word.text)
        elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and word.lower_ not in stop_words:
            tokens.add(word.lower_)
    return list(tokens)

这个函数不正确，因为它未能移除停用词。只有当删除最后的条件and not in stop_words时，一切都正常。

如何升级这个函数，以便根据一个预定义的停用词列表来删除停用词，并保留所有其他条件语句？

英文:

I'm working with NLP spacy library and I created a function to return a list of token from a text.

import spacy    
def preprocess_text_spacy(text):
	stop_words = [&quot;a&quot;, &quot;the&quot;, &quot;is&quot;, &quot;are&quot;]
	nlp = spacy.load(&#39;en_core_web_sm&#39;)
	tokens = set()
	doc = nlp(text)
	for word in doc:
		if word.is_currency:
			tokens.add(word.lower_)
		elif len(word.lower_) == 1:
			if word.is_digit and float(word.text) == 0:
				tokens.add(word.text)
		elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and not in stop_words:
			tokens.add(word.lower_)
	return list(tokens)

This function is not correct because removing stop words not working.
Everything is ok only if I delete the last condition and not in stop_words.

How to upgrade this function to remove stop words according a defined list in addition to all other condition statement?

答案1

得分: 1

你的条件写错了。你最后的 elif 等同于以下内容：

condC = not in stop_words
elif condA and condB and not in condC:
    ...

如果你尝试执行这段代码，会导致语法错误。要检查某个元素是否在可迭代对象中，你需要将该元素放在关键字 in 的左边。你只需写 word：

elif condA and condB and ... and str(word) not in stop_words:
   ...

英文:

You are writing your condition wrong. Your last elif is equivalent to this:

condC = not in stop_words
elif condA and condB and not in condC:
    ...

If you try to execute this code you will get a syntax error. To check if some element is in some iterable, you need to provide that element at the left side of the keyword in. You just have to write word:

elif condA and condB and ... and str(word) not in stop_words:
   ...

答案2

得分: 1

你的代码看起来没问题，只有一个小修改：

在elif的最后加上 and str(word) not in stop_words

import spacy    
def preprocess_text_spacy(text):
    stop_words = ["a", "the", "is", "are"]
    nlp = spacy.load('en_core_web_sm')
    tokens = set()
    doc = nlp(text)
    print(doc)
    for word in doc:
        if word.is_currency:
            tokens.add(word.lower_)
        elif len(word.lower_) == 1:
            if word.is_digit and float(word.text) == 0:
                tokens.add(word.text)
        elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and str(word) not in stop_words:
            tokens.add(word.lower_)
    return list(tokens)

英文:

Your code looks fine to me, there is a small change

at the end of elif put and str(word) not in stop_words

import spacy    
def preprocess_text_spacy(text):
    stop_words = [&quot;a&quot;, &quot;the&quot;, &quot;is&quot;, &quot;are&quot;]
    nlp = spacy.load(&#39;en_core_web_sm&#39;)
    tokens = set()
    doc = nlp(text)
    print(doc)
    for word in doc:
        if word.is_currency:
            tokens.add(word.lower_)
        elif len(word.lower_) == 1:
            if word.is_digit and float(word.text) == 0:
                tokens.add(word.text)
        elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and str(word) not in stop_words:
            tokens.add(word.lower_)
    return list(tokens)

答案3

得分: 0

def preprocess_text_spacy(text, stop_words):
    nlp = spacy.load('en_core_web_sm')
    tokens = []
    doc = nlp(text)
    for word in doc:
        if word.is_currency:
            tokens.append(word.lower_)
        elif len(word.lower_) == 1:
            if word.is_digit and float(word.text) == 0:
                tokens.append(word.text)
        elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and word.lower_ not in stop_words:
            tokens.append(word.lower_)
    return tokens

英文:

You need to add stop_words to the function, which takes a list of stop words as input and then you need then modify the condition for adding words to the token list, to check if the word is in the stop_words list or not

def preprocess_text_spacy(text, stop_words):
    nlp = spacy.load(&#39;en_core_web_sm&#39;)
    tokens = []
    doc = nlp(text)
    for word in doc:
        if word.is_currency:
            tokens.append(word.lower_)
        elif len(word.lower_) == 1:
            if word.is_digit and float(word.text) == 0:
                tokens.append(word.text)
        elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and word.lower_ not in stop_words:
            tokens.append(word.lower_)
    return tokens

Sample:

text = &quot;This is a sample text to demonstrate the function.&quot;
stop_words = [&quot;a&quot;, &quot;the&quot;, &quot;is&quot;, &quot;are&quot;]
tokens = preprocess_text_spacy(text, stop_words)
print(tokens)

Output:

[&#39;this&#39;, &#39;sample&#39;, &#39;text&#39;, &#39;to&#39;, &#39;demonstrate&#39;, &#39;function&#39;]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python自然语言处理（NLP）处理中的if语句不在停用词列表中。

问题

答案1

答案2

答案3

“Vscode” """ """ 当注释颜色不知何故变成了深绿色时

在queue.Queue上进行多路复用？

在一个具有多个选项卡的Dash应用程序中的多个输入

在Django对象列表中获取对象索引

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论