2023年6月18日 21:33:06go评论93阅读模式

英文:

Taking too long to count word frequency in pandas dataframe

问题

在StackOverflow上进行了研究后，我得到了以下代码，用于计算我的数据框中一列中单词的相对频率：

df['objeto'] = df['objeto'].apply(unidecode.unidecode)
df['objeto'] = df['objeto'].str.replace('[^\w\s]','')
stop_words = nltk.corpus.stopwords.words('portuguese')
stop_words.extend(['12', 'termo', 'aquisicao', 'vinte', 'demandas'])
counter = Counter()
for word in " ".join(df['objeto']).lower().split():
    if word not in stop_words:
        counter[word] += 1
print(counter.most_common(10))
for word, count in counter.most_common(100):
    print(word, count)

问题在于该代码执行大约需要30秒的时间。我做错了什么？有没有优化和改进代码的方法？我打算创建一个类似的函数来在其他数据框上执行此操作。

我是pandas的初学者，很少使用它。我在StackOverflow上进行了一些研究。谢谢。

英文:

After researching here on StackOverflow, I came up with the code below to count the relative frequency of words in one of the columns of my dataframe:

df[&#39;objeto&#39;] = df[&#39;objeto&#39;].apply(unidecode.unidecode)
df[&#39;objeto&#39;] = df[&#39;objeto&#39;].str.replace(&#39;[^\w\s]&#39;,&#39;&#39;)
stop_words = nltk.corpus.stopwords.words(&#39;portuguese&#39;)
stop_words.extend([&#39;12&#39;, &#39;termo&#39;, &#39;aquisicao&#39;, &#39;vinte&#39;, &#39;demandas&#39;])
counter = Counter()
for word in &quot; &quot;.join(df[&#39;objeto&#39;]).lower().split():
    if word not in stop_words:
        counter[word] += 1
print(counter.most_common(10))
for word, count in counter.most_common(100):
    print(word, count)

The problem is that the code is taking approximately 30 seconds to execute. What did I do wrong? Is there any way to optimize and improve my code? I intend to create a function like this to do it on other dataframes.

I'm a beginner in pandas, I use it sparingly. I did some research here on stackoverflow. Thank you.

答案1

得分: 1

df = pd.DataFrame(dict(
   id = ['a', 'b', 'c', 'd'],
   objeto = ['Foo bar', 'hello Hi FOO', 'Yes hi Hello', 'Pythons PaNdas yeS']
))
stop_words = ['foo', 'bar']
# 使用 pandas 进行计数的主要问题在于没有使用 `.value_counts()`
# 在这种情况下，您可以将所有单词放入单个列中，可以使用 `.explode()` 完成
df['objeto'].str.casefold().str.split().explode()
# 结果如下所示：
# 0        foo
# 0        bar
# 1      hello
# 1         hi
# 1        foo
# 2        yes
# 2         hi
# 2      hello
# 3    pythons
# 3     pandas
# 3        yes
# Name: objeto, dtype: object
# 您可以使用 `.mask()` 来删除包含在 `stop_words` 中的单词，然后使用 `.value_counts()` 进行计数
df['objeto'].str.casefold().str.split().explode().mask(lambda word: word.isin(stop_words)).value_counts()
# 结果如下所示：
# objeto
# hello      2
# hi         2
# yes        2
# pythons    1
# pandas     1
# Name: objeto, dtype: int64

英文:

It helps if you provide some sort of runnable example:

df = pd.DataFrame(dict(
   id = [&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;],
   objeto = [&#39;Foo bar&#39;, &#39;hello Hi FOO&#39;, &#39;Yes hi Hello&#39;, &#39;Pythons PaNdas yeS&#39;]
))
stop_words = [&#39;foo&#39;, &#39;bar&#39;]

The main issue here is not using pandas to do the counting.

pandas has .value_counts()

In this case, you want to get all the words into a single column which you can do with .explode()

df[&#39;objeto&#39;].str.casefold().str.split().explode()

0        foo
0        bar
1      hello
1         hi
1        foo
2        yes
2         hi
2      hello
3    pythons
3     pandas
3        yes
Name: objeto, dtype: object

You can .mask() to remove words that are .isin(stop_words) then .value_counts()

df[&#39;objeto&#39;].str.casefold().str.split().explode().mask(lambda word: word.isin(stop_words)).value_counts()

objeto
hello      2
hi         2
yes        2
pythons    1
pandas     1
Name: count, dtype: int64

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Pandas数据框中计算单词频率花费的时间太长。

问题

答案1

TypeError: ‘decimal.Decimal’ object cannot be interpreted as an integer

使用pandas的.loc出现意外结果 – 尝试根据条件连接2个列

如何在R中将导入的向量分成列？

使用 streamlit.write(df) 时文本被截断。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。