在Pandas数据框中计算单词频率花费的时间太长。

huangapple go评论93阅读模式
英文:

Taking too long to count word frequency in pandas dataframe

问题

在StackOverflow上进行了研究后,我得到了以下代码,用于计算我的数据框中一列中单词的相对频率:

  1. df['objeto'] = df['objeto'].apply(unidecode.unidecode)
  2. df['objeto'] = df['objeto'].str.replace('[^\w\s]','')
  3. stop_words = nltk.corpus.stopwords.words('portuguese')
  4. stop_words.extend(['12', 'termo', 'aquisicao', 'vinte', 'demandas'])
  5. counter = Counter()
  6. for word in " ".join(df['objeto']).lower().split():
  7. if word not in stop_words:
  8. counter[word] += 1
  9. print(counter.most_common(10))
  10. for word, count in counter.most_common(100):
  11. print(word, count)

问题在于该代码执行大约需要30秒的时间。我做错了什么?有没有优化和改进代码的方法?我打算创建一个类似的函数来在其他数据框上执行此操作。

我是pandas的初学者,很少使用它。我在StackOverflow上进行了一些研究。谢谢。

英文:

After researching here on StackOverflow, I came up with the code below to count the relative frequency of words in one of the columns of my dataframe:

  1. df['objeto'] = df['objeto'].apply(unidecode.unidecode)
  2. df['objeto'] = df['objeto'].str.replace('[^\w\s]','')
  3. stop_words = nltk.corpus.stopwords.words('portuguese')
  4. stop_words.extend(['12', 'termo', 'aquisicao', 'vinte', 'demandas'])
  5. counter = Counter()
  6. for word in " ".join(df['objeto']).lower().split():
  7. if word not in stop_words:
  8. counter[word] += 1
  9. print(counter.most_common(10))
  10. for word, count in counter.most_common(100):
  11. print(word, count)

The problem is that the code is taking approximately 30 seconds to execute. What did I do wrong? Is there any way to optimize and improve my code? I intend to create a function like this to do it on other dataframes.

I'm a beginner in pandas, I use it sparingly. I did some research here on stackoverflow. Thank you.

答案1

得分: 1

  1. df = pd.DataFrame(dict(
  2. id = ['a', 'b', 'c', 'd'],
  3. objeto = ['Foo bar', 'hello Hi FOO', 'Yes hi Hello', 'Pythons PaNdas yeS']
  4. ))
  5. stop_words = ['foo', 'bar']
  6. # 使用 pandas 进行计数的主要问题在于没有使用 `.value_counts()`
  7. # 在这种情况下,您可以将所有单词放入单个列中,可以使用 `.explode()` 完成
  8. df['objeto'].str.casefold().str.split().explode()
  9. # 结果如下所示:
  10. # 0 foo
  11. # 0 bar
  12. # 1 hello
  13. # 1 hi
  14. # 1 foo
  15. # 2 yes
  16. # 2 hi
  17. # 2 hello
  18. # 3 pythons
  19. # 3 pandas
  20. # 3 yes
  21. # Name: objeto, dtype: object
  22. # 您可以使用 `.mask()` 来删除包含在 `stop_words` 中的单词,然后使用 `.value_counts()` 进行计数
  23. df['objeto'].str.casefold().str.split().explode().mask(lambda word: word.isin(stop_words)).value_counts()
  24. # 结果如下所示:
  25. # objeto
  26. # hello 2
  27. # hi 2
  28. # yes 2
  29. # pythons 1
  30. # pandas 1
  31. # Name: objeto, dtype: int64
英文:

It helps if you provide some sort of runnable example:

  1. df = pd.DataFrame(dict(
  2. id = ['a', 'b', 'c', 'd'],
  3. objeto = ['Foo bar', 'hello Hi FOO', 'Yes hi Hello', 'Pythons PaNdas yeS']
  4. ))
  5. stop_words = ['foo', 'bar']

The main issue here is not using pandas to do the counting.

pandas has .value_counts()

In this case, you want to get all the words into a single column which you can do with .explode()

  1. df['objeto'].str.casefold().str.split().explode()
  1. 0 foo
  2. 0 bar
  3. 1 hello
  4. 1 hi
  5. 1 foo
  6. 2 yes
  7. 2 hi
  8. 2 hello
  9. 3 pythons
  10. 3 pandas
  11. 3 yes
  12. Name: objeto, dtype: object

You can .mask() to remove words that are .isin(stop_words) then .value_counts()

  1. df['objeto'].str.casefold().str.split().explode().mask(lambda word: word.isin(stop_words)).value_counts()
  1. objeto
  2. hello 2
  3. hi 2
  4. yes 2
  5. pythons 1
  6. pandas 1
  7. Name: count, dtype: int64

huangapple
  • 本文由 发表于 2023年6月18日 21:33:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/76500795.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定