2023年3月4日 03:48:20go评论169阅读模式

英文:

Using Natural Language Processing, how can we add our own Stop Words to a list?

问题

我正在测试下面的库，基于这个代码示例：

import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from collections import Counter

df_new = pd.DataFrame(['okay', 'yeah', 'thank', 'im'])
stop_words = text.ENGLISH_STOP_WORDS.union(df_new)
#stop_words

w_counts = Counter(w for w in ' '.join(df['text_without_stopwords']).split() if w.lower() not in stop_words)

df_words = pd.DataFrame.from_dict(w_counts, orient='index').reset_index()
df_words.columns = ['word','count']

import seaborn as sns
# 选择前20个最常出现的单词
d = df_words.nlargest(columns="count", n=25)
plt.figure(figsize=(20,5))
ax = sns.barplot(data=d, x="word", y="count")
ax.set(ylabel='Count')
plt.show()

我看到这个图表。

我试图将这些单词添加到停用词中：'okay', 'yeah', 'thank', 'im'

但是...它们都通过了！！这里有什么问题？

英文:

I am testing the library below, based on this code sample:

import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from collections import Counter

df_new = pd.DataFrame([&#39;okay&#39;, &#39;yeah&#39;, &#39;thank&#39;, &#39;im&#39;])
stop_words = text.ENGLISH_STOP_WORDS.union(df_new)
#stop_words

w_counts = Counter(w for w in &#39; &#39;.join(df[&#39;text_without_stopwords&#39;]).split() if w.lower() not in stop_words)


df_words = pd.DataFrame.from_dict(w_counts, orient=&#39;index&#39;).reset_index()
df_words.columns = [&#39;word&#39;,&#39;count&#39;]


import seaborn as sns
# selecting top 20 most frequent words
d = df_words.nlargest(columns=&quot;count&quot;, n = 25) 
plt.figure(figsize=(20,5))
ax = sns.barplot(data=d, x= &quot;word&quot;, y = &quot;count&quot;)
ax.set(ylabel = &#39;Count&#39;)
plt.show()

I'm seeing this chart.

I'm trying to add these words to stop words: 'okay', 'yeah', 'thank', 'im'

But...they are all coming through!! What's wrong here??

答案1

得分: 1

代替将所有筛选后的单词连接到 io.StringIO 缓冲区并加载到数据帧中，更加简单和快速的方法是使用 collections.Counter 及其 most_common 函数来立即获取单词计数：

import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from collections import Counter

# 示例数据帧
df = pd.DataFrame({'text_without_stopwords': ['my stop text hex words',
                                              'with some stop boards words', 'stop text']})
w_counts = Counter(w for w in ' '.join(df['text_without_stopwords']).split()
                   if w.lower() not in ENGLISH_STOP_WORDS)
plt.bar(*zip(*w_counts.most_common(25)))
plt.xticks(rotation=60)
plt.show()

示例图：

英文:

Instead of join all the filtered words into io.StringIO buffer and loading it to a dataframe, a much more straightforward/quick way is using collections.Counter with its most_common function to get word counts right away:

import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from collections import Counter

# sample dataframe
df = pd.DataFrame({&#39;text_without_stopwords&#39;: [&#39;my stop text hex words&#39;,
                                              &#39;with some stop boards words&#39;, &#39;stop text&#39;]})
w_counts = Counter(w for w in &#39; &#39;.join(df[&#39;text_without_stopwords&#39;]).split()
                   if w.lower() not in ENGLISH_STOP_WORDS)
plt.bar(*zip(*w_counts.most_common(25)))
plt.xticks(rotation=60)
plt.show()

Sample plot:

答案2

得分: 1

尝试创建w_counts以排除df_new中的单词，我认为您代码中的问题是您正在创建包含要添加到停用词列表中的单词的df_new，但您实际上没有删除这些单词。

stop_words = ENGLISH_STOP_WORDS.union(['okay', 'yeah', 'thank', 'im'])
w_counts = Counter(w for w in ' '.join(df['text_without_stopwords']).split() if w.lower() not in stop_words)

英文:

Try to creates w_counts to exclude the words in df_new, I think the issue with your code it you creating df_new containing the words that you want to add to the stop words list, but you are not actually removing these words.

stop_words = ENGLISH_STOP_WORDS.union([&#39;okay&#39;, &#39;yeah&#39;, &#39;thank&#39;, &#39;im&#39;])
w_counts = Counter(w for w in &#39; &#39;.join(df[&#39;text_without_stopwords&#39;]).split() if w.lower() not in stop_words)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用自然语言处理，我们如何将自定义的停用词添加到列表中？

问题

答案1

答案2

如何更高效地下载WHL文件？

如何向Django过滤器添加非模型字段

CodeHS 8.3.8: Word Ladder 无法通过自动评分器。

获取在切片列表时选择的索引。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论