在 R 中更改 text_tokens 函数的输出

huangapple go评论98阅读模式
英文:

changing the output of text_tokens function in R

问题

关于使用corpus包和text_tokens()函数进行文本挖掘的问题。我想使用该函数进行词干处理和停用词删除。我有大量数据(近100万条评论)需要处理。但是我在处理text_tokens函数输出时遇到了问题。以下是我的数据和代码的基本示例:

  1. library(tidyverse)
  2. library(corpus)
  3. library(stopwords)
  4. text <- data.frame(comment_id = 1:2,
  5. comment_content = c("Hallo mein Name ist aaron","Vielen Lieben Dank für das Video"))
  6. tmp <- text_tokens(text$comment_content,
  7. text_filter(stemmer = "de", drop = stopwords("german")))

我的问题是,我希望输出是一个data.frame,其中第一列是comment_id,第二列是word_token。我想要的输出应该如下所示:

  1. df <- data.frame(comment_id = c(1,1,1,2,2,2),
  2. comment_tokens = c("hallo","nam","aaron","lieb","dank","video"))

在 R 中更改 text_tokens 函数的输出

我尝试了不同的do.calls(cbind/rbind),但它们没有给我想要的结果。那么我要找的是什么函数呢?是不是tidyverse中的map()函数?

提前感谢。

致敬,

Aaron

英文:

I have a question redarding text mining with the corpus package and the function text_tokens(). I want to use the function for stemming and deleting stop words. I have a huge amount of data (almost 1.000.000 comments) where I want to use it for. But I've problems with the output, the function text_tokens produces. So here is a basic example of my data and code:

  1. library(tidyverse)
  2. library(corpus)
  3. library(stopwords)
  4. text &lt;- data.frame(comment_id = 1:2,
  5. comment_content = c(&quot;Hallo mein Name ist aaron&quot;,&quot;Vielen Lieben Dank f&#252;r das Video&quot;))
  6. tmp &lt;- text_tokens(text$comment_content,
  7. text_filter(stemmer = &quot;de&quot;,drop = stopwords(&quot;german&quot;)))

My problem now is, that I want a data.frame as output with the comment_id in the first column and word_token in the column. So the output I would like to have looks as followed:

  1. df &lt;- data.frame(comment_id = c(1,1,1,2,2,2),
  2. comment_tokens = c(&quot;hallo&quot;,&quot;nam&quot;,&quot;aaron&quot;,&quot;lieb&quot;,&quot;dank&quot;,&quot;video&quot;))

在 R 中更改 text_tokens 函数的输出

I tried different do.calls (cbind/rbind), but they don't give me the result I need. So what is the function I'm looking for, is it map() from the tidyverse?

Thank you in advance.

Cheers,

Aaron

答案1

得分: 1

这是使用purrr中的imap_dfr的一个选项:

  1. library(corpus)
  2. library(dplyr)
  3. library(purrr)
  4. text <- data.frame(comment_id = 1:2,
  5. comment_content = c("Hallo mein Name ist aaron","Vielen Lieben Dank für das Video"))
  6. tmp <- text_tokens(text$comment_content,
  7. text_filter(stemmer = "de",drop = corpus::stopwords_de)) %>%
  8. purrr::imap_dfr(function(x, y) {
  9. tibble(
  10. comment_id = y,
  11. comment_tokens = x
  12. )
  13. })
  14. tmp
  15. #> # A tibble: 6 × 2
  16. #> comment_id comment_tokens
  17. #> <int> <chr>
  18. #> 1 1 hallo
  19. #> 2 1 nam
  20. #> 3 1 aaron
  21. #> 4 2 lieb
  22. #> 5 2 dank
  23. #> 6 2 video

或者,如果你更喜欢使用匿名函数:

  1. tmp <- text_tokens(text$comment_content, text_filter(stemmer = "de",drop = corpus::stopwords_de)) %>%
  2. purrr::imap_dfr(~ tibble(comment_id = .y, comment_tokens = .x))
英文:

Here's an option using imap_dfr from purrr:

  1. library(corpus)
  2. library(dplyr)
  3. library(purrr)
  4. text &lt;- data.frame(comment_id = 1:2,
  5. comment_content = c(&quot;Hallo mein Name ist aaron&quot;,&quot;Vielen Lieben Dank f&#252;r das Video&quot;))
  6. tmp &lt;- text_tokens(text$comment_content,
  7. text_filter(stemmer = &quot;de&quot;,drop = corpus::stopwords_de)) %&gt;%
  8. purrr::imap_dfr(function(x, y) {
  9. tibble(
  10. comment_id = y,
  11. comment_tokens = x
  12. )
  13. })
  14. tmp
  15. #&gt; # A tibble: 6 &#215; 2
  16. #&gt; comment_id comment_tokens
  17. #&gt; &lt;int&gt; &lt;chr&gt;
  18. #&gt; 1 1 hallo
  19. #&gt; 2 1 nam
  20. #&gt; 3 1 aaron
  21. #&gt; 4 2 lieb
  22. #&gt; 5 2 dank
  23. #&gt; 6 2 video

Or if you prefer using an anonymous function:

  1. tmp &lt;- text_tokens(text$comment_content, text_filter(stemmer = &quot;de&quot;,drop = corpus::stopwords_de)) %&gt;%
  2. purrr::imap_dfr(~ tibble(comment_id = .y, comment_tokens = .x))

huangapple
  • 本文由 发表于 2023年3月9日 22:58:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/75686316.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定