英文:
Taking too long to count word frequency in pandas dataframe
问题
在StackOverflow上进行了研究后,我得到了以下代码,用于计算我的数据框中一列中单词的相对频率:
df['objeto'] = df['objeto'].apply(unidecode.unidecode)
df['objeto'] = df['objeto'].str.replace('[^\w\s]','')
stop_words = nltk.corpus.stopwords.words('portuguese')
stop_words.extend(['12', 'termo', 'aquisicao', 'vinte', 'demandas'])
counter = Counter()
for word in " ".join(df['objeto']).lower().split():
if word not in stop_words:
counter[word] += 1
print(counter.most_common(10))
for word, count in counter.most_common(100):
print(word, count)
问题在于该代码执行大约需要30秒的时间。我做错了什么?有没有优化和改进代码的方法?我打算创建一个类似的函数来在其他数据框上执行此操作。
我是pandas的初学者,很少使用它。我在StackOverflow上进行了一些研究。谢谢。
英文:
After researching here on StackOverflow, I came up with the code below to count the relative frequency of words in one of the columns of my dataframe:
df['objeto'] = df['objeto'].apply(unidecode.unidecode)
df['objeto'] = df['objeto'].str.replace('[^\w\s]','')
stop_words = nltk.corpus.stopwords.words('portuguese')
stop_words.extend(['12', 'termo', 'aquisicao', 'vinte', 'demandas'])
counter = Counter()
for word in " ".join(df['objeto']).lower().split():
if word not in stop_words:
counter[word] += 1
print(counter.most_common(10))
for word, count in counter.most_common(100):
print(word, count)
The problem is that the code is taking approximately 30 seconds to execute. What did I do wrong? Is there any way to optimize and improve my code? I intend to create a function like this to do it on other dataframes.
I'm a beginner in pandas, I use it sparingly. I did some research here on stackoverflow. Thank you.
答案1
得分: 1
df = pd.DataFrame(dict(
id = ['a', 'b', 'c', 'd'],
objeto = ['Foo bar', 'hello Hi FOO', 'Yes hi Hello', 'Pythons PaNdas yeS']
))
stop_words = ['foo', 'bar']
# 使用 pandas 进行计数的主要问题在于没有使用 `.value_counts()`
# 在这种情况下,您可以将所有单词放入单个列中,可以使用 `.explode()` 完成
df['objeto'].str.casefold().str.split().explode()
# 结果如下所示:
# 0 foo
# 0 bar
# 1 hello
# 1 hi
# 1 foo
# 2 yes
# 2 hi
# 2 hello
# 3 pythons
# 3 pandas
# 3 yes
# Name: objeto, dtype: object
# 您可以使用 `.mask()` 来删除包含在 `stop_words` 中的单词,然后使用 `.value_counts()` 进行计数
df['objeto'].str.casefold().str.split().explode().mask(lambda word: word.isin(stop_words)).value_counts()
# 结果如下所示:
# objeto
# hello 2
# hi 2
# yes 2
# pythons 1
# pandas 1
# Name: objeto, dtype: int64
英文:
It helps if you provide some sort of runnable example:
df = pd.DataFrame(dict(
id = ['a', 'b', 'c', 'd'],
objeto = ['Foo bar', 'hello Hi FOO', 'Yes hi Hello', 'Pythons PaNdas yeS']
))
stop_words = ['foo', 'bar']
The main issue here is not using pandas to do the counting.
pandas has .value_counts()
In this case, you want to get all the words into a single column which you can do with .explode()
df['objeto'].str.casefold().str.split().explode()
0 foo
0 bar
1 hello
1 hi
1 foo
2 yes
2 hi
2 hello
3 pythons
3 pandas
3 yes
Name: objeto, dtype: object
You can .mask()
to remove words that are .isin(stop_words)
then .value_counts()
df['objeto'].str.casefold().str.split().explode().mask(lambda word: word.isin(stop_words)).value_counts()
objeto
hello 2
hi 2
yes 2
pythons 1
pandas 1
Name: count, dtype: int64
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论