英文:
Using Natural Language Processing, how can we add our own Stop Words to a list?
问题
我正在测试下面的库,基于这个代码示例:
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from collections import Counter
df_new = pd.DataFrame(['okay', 'yeah', 'thank', 'im'])
stop_words = text.ENGLISH_STOP_WORDS.union(df_new)
#stop_words
w_counts = Counter(w for w in ' '.join(df['text_without_stopwords']).split() if w.lower() not in stop_words)
df_words = pd.DataFrame.from_dict(w_counts, orient='index').reset_index()
df_words.columns = ['word','count']
import seaborn as sns
# 选择前20个最常出现的单词
d = df_words.nlargest(columns="count", n=25)
plt.figure(figsize=(20,5))
ax = sns.barplot(data=d, x="word", y="count")
ax.set(ylabel='Count')
plt.show()
我看到这个图表。
我试图将这些单词添加到停用词中:'okay', 'yeah', 'thank', 'im'
但是...它们都通过了!!这里有什么问题?
英文:
I am testing the library below, based on this code sample:
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from collections import Counter
df_new = pd.DataFrame(['okay', 'yeah', 'thank', 'im'])
stop_words = text.ENGLISH_STOP_WORDS.union(df_new)
#stop_words
w_counts = Counter(w for w in ' '.join(df['text_without_stopwords']).split() if w.lower() not in stop_words)
df_words = pd.DataFrame.from_dict(w_counts, orient='index').reset_index()
df_words.columns = ['word','count']
import seaborn as sns
# selecting top 20 most frequent words
d = df_words.nlargest(columns="count", n = 25)
plt.figure(figsize=(20,5))
ax = sns.barplot(data=d, x= "word", y = "count")
ax.set(ylabel = 'Count')
plt.show()
I'm seeing this chart.
I'm trying to add these words to stop words: 'okay', 'yeah', 'thank', 'im'
But...they are all coming through!! What's wrong here??
答案1
得分: 1
代替将所有筛选后的单词连接到 io.StringIO
缓冲区并加载到数据帧中,更加简单和快速的方法是使用 collections.Counter
及其 most_common
函数来立即获取单词计数:
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from collections import Counter
# 示例数据帧
df = pd.DataFrame({'text_without_stopwords': ['my stop text hex words',
'with some stop boards words', 'stop text']})
w_counts = Counter(w for w in ' '.join(df['text_without_stopwords']).split()
if w.lower() not in ENGLISH_STOP_WORDS)
plt.bar(*zip(*w_counts.most_common(25)))
plt.xticks(rotation=60)
plt.show()
示例图:
英文:
Instead of join all the filtered words into io.StringIO
buffer and loading it to a dataframe, a much more straightforward/quick way is using collections.Counter
with its most_common
function to get word counts right away:
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from collections import Counter
# sample dataframe
df = pd.DataFrame({'text_without_stopwords': ['my stop text hex words',
'with some stop boards words', 'stop text']})
w_counts = Counter(w for w in ' '.join(df['text_without_stopwords']).split()
if w.lower() not in ENGLISH_STOP_WORDS)
plt.bar(*zip(*w_counts.most_common(25)))
plt.xticks(rotation=60)
plt.show()
Sample plot:
答案2
得分: 1
尝试创建w_counts以排除df_new中的单词,我认为您代码中的问题是您正在创建包含要添加到停用词列表中的单词的df_new,但您实际上没有删除这些单词。
stop_words = ENGLISH_STOP_WORDS.union(['okay', 'yeah', 'thank', 'im'])
w_counts = Counter(w for w in ' '.join(df['text_without_stopwords']).split() if w.lower() not in stop_words)
英文:
Try to creates w_counts to exclude the words in df_new, I think the issue with your code it you creating df_new containing the words that you want to add to the stop words list, but you are not actually removing these words.
stop_words = ENGLISH_STOP_WORDS.union(['okay', 'yeah', 'thank', 'im'])
w_counts = Counter(w for w in ' '.join(df['text_without_stopwords']).split() if w.lower() not in stop_words)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论