2023年3月9日 22:58:22go评论98阅读模式

英文:

changing the output of text_tokens function in R

问题

关于使用corpus包和text_tokens()函数进行文本挖掘的问题。我想使用该函数进行词干处理和停用词删除。我有大量数据（近100万条评论）需要处理。但是我在处理text_tokens函数输出时遇到了问题。以下是我的数据和代码的基本示例：

library(tidyverse)
library(corpus)
library(stopwords)
text <- data.frame(comment_id = 1:2,
                   comment_content = c("Hallo mein Name ist aaron","Vielen Lieben Dank für das Video"))
tmp <- text_tokens(text$comment_content, 
                   text_filter(stemmer = "de", drop = stopwords("german")))

我的问题是，我希望输出是一个data.frame，其中第一列是comment_id，第二列是word_token。我想要的输出应该如下所示：

df <- data.frame(comment_id = c(1,1,1,2,2,2),
                 comment_tokens = c("hallo","nam","aaron","lieb","dank","video"))

我尝试了不同的do.calls（cbind/rbind），但它们没有给我想要的结果。那么我要找的是什么函数呢？是不是tidyverse中的map()函数？

提前感谢。

致敬，

Aaron

英文:

I have a question redarding text mining with the corpus package and the function text_tokens(). I want to use the function for stemming and deleting stop words. I have a huge amount of data (almost 1.000.000 comments) where I want to use it for. But I've problems with the output, the function text_tokens produces. So here is a basic example of my data and code:

library(tidyverse)
library(corpus)
library(stopwords)
text &lt;- data.frame(comment_id = 1:2,
                   comment_content = c(&quot;Hallo mein Name ist aaron&quot;,&quot;Vielen Lieben Dank f&#252;r das Video&quot;))
tmp &lt;- text_tokens(text$comment_content, 
                   text_filter(stemmer = &quot;de&quot;,drop = stopwords(&quot;german&quot;)))

My problem now is, that I want a data.frame as output with the comment_id in the first column and word_token in the column. So the output I would like to have looks as followed:

df &lt;- data.frame(comment_id = c(1,1,1,2,2,2),
                 comment_tokens = c(&quot;hallo&quot;,&quot;nam&quot;,&quot;aaron&quot;,&quot;lieb&quot;,&quot;dank&quot;,&quot;video&quot;))

I tried different do.calls (cbind/rbind), but they don't give me the result I need. So what is the function I'm looking for, is it map() from the tidyverse?

Thank you in advance.

Cheers,

Aaron

答案1

得分: 1

这是使用purrr中的imap_dfr的一个选项：

library(corpus)
library(dplyr)
library(purrr)
text <- data.frame(comment_id = 1:2,
                   comment_content = c("Hallo mein Name ist aaron","Vielen Lieben Dank für das Video"))
tmp <- text_tokens(text$comment_content, 
                   text_filter(stemmer = "de",drop = corpus::stopwords_de)) %>%
  purrr::imap_dfr(function(x, y) {
  tibble(
    comment_id = y,
    comment_tokens = x
  )
})
tmp
#> # A tibble: 6 × 2
#>   comment_id comment_tokens
#>        <int> <chr>         
#> 1          1 hallo         
#> 2          1 nam           
#> 3          1 aaron         
#> 4          2 lieb          
#> 5          2 dank          
#> 6          2 video

或者，如果你更喜欢使用匿名函数：

tmp <- text_tokens(text$comment_content, text_filter(stemmer = "de",drop = corpus::stopwords_de)) %>%
  purrr::imap_dfr(~ tibble(comment_id = .y, comment_tokens = .x))

英文:

Here's an option using imap_dfr from purrr:

library(corpus)
library(dplyr)
library(purrr)
text &lt;- data.frame(comment_id = 1:2,
                   comment_content = c(&quot;Hallo mein Name ist aaron&quot;,&quot;Vielen Lieben Dank f&#252;r das Video&quot;))
tmp &lt;- text_tokens(text$comment_content, 
                   text_filter(stemmer = &quot;de&quot;,drop = corpus::stopwords_de)) %&gt;% 
  purrr::imap_dfr(function(x, y) {
  tibble(
    comment_id = y,
    comment_tokens = x
  )
})
tmp
#&gt; # A tibble: 6 &#215; 2
#&gt;   comment_id comment_tokens
#&gt;        &lt;int&gt; &lt;chr&gt;         
#&gt; 1          1 hallo         
#&gt; 2          1 nam           
#&gt; 3          1 aaron         
#&gt; 4          2 lieb          
#&gt; 5          2 dank          
#&gt; 6          2 video

Or if you prefer using an anonymous function:

tmp &lt;- text_tokens(text$comment_content, text_filter(stemmer = &quot;de&quot;,drop = corpus::stopwords_de)) %&gt;% 
  purrr::imap_dfr(~ tibble(comment_id = .y, comment_tokens = .x))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在 R 中更改 text_tokens 函数的输出

问题

答案1

R {quanteda}：在字典中去除重音符号

随机抽样一个数据框，直到检测到所有个体。

Quanteda和stringr在R中：(正确) 正则表达式无法解析

cmdstanR：从stan模型拟合中提取抽样结果。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。