英文:
changing the output of text_tokens function in R
问题
关于使用corpus
包和text_tokens()
函数进行文本挖掘的问题。我想使用该函数进行词干处理和停用词删除。我有大量数据(近100万条评论)需要处理。但是我在处理text_tokens
函数输出时遇到了问题。以下是我的数据和代码的基本示例:
library(tidyverse)
library(corpus)
library(stopwords)
text <- data.frame(comment_id = 1:2,
comment_content = c("Hallo mein Name ist aaron","Vielen Lieben Dank für das Video"))
tmp <- text_tokens(text$comment_content,
text_filter(stemmer = "de", drop = stopwords("german")))
我的问题是,我希望输出是一个data.frame
,其中第一列是comment_id,第二列是word_token。我想要的输出应该如下所示:
df <- data.frame(comment_id = c(1,1,1,2,2,2),
comment_tokens = c("hallo","nam","aaron","lieb","dank","video"))
我尝试了不同的do.calls
(cbind/rbind),但它们没有给我想要的结果。那么我要找的是什么函数呢?是不是tidyverse中的map()
函数?
提前感谢。
致敬,
Aaron
英文:
I have a question redarding text mining with the corpus package
and the function text_tokens()
. I want to use the function for stemming and deleting stop words. I have a huge amount of data (almost 1.000.000 comments) where I want to use it for. But I've problems with the output, the function text_tokens
produces. So here is a basic example of my data and code:
library(tidyverse)
library(corpus)
library(stopwords)
text <- data.frame(comment_id = 1:2,
comment_content = c("Hallo mein Name ist aaron","Vielen Lieben Dank für das Video"))
tmp <- text_tokens(text$comment_content,
text_filter(stemmer = "de",drop = stopwords("german")))
My problem now is, that I want a data.frame
as output with the comment_id in the first column and word_token in the column. So the output I would like to have looks as followed:
df <- data.frame(comment_id = c(1,1,1,2,2,2),
comment_tokens = c("hallo","nam","aaron","lieb","dank","video"))
I tried different do.calls
(cbind/rbind), but they don't give me the result I need. So what is the function I'm looking for, is it map()
from the tidyverse?
Thank you in advance.
Cheers,
Aaron
答案1
得分: 1
这是使用purrr
中的imap_dfr
的一个选项:
library(corpus)
library(dplyr)
library(purrr)
text <- data.frame(comment_id = 1:2,
comment_content = c("Hallo mein Name ist aaron","Vielen Lieben Dank für das Video"))
tmp <- text_tokens(text$comment_content,
text_filter(stemmer = "de",drop = corpus::stopwords_de)) %>%
purrr::imap_dfr(function(x, y) {
tibble(
comment_id = y,
comment_tokens = x
)
})
tmp
#> # A tibble: 6 × 2
#> comment_id comment_tokens
#> <int> <chr>
#> 1 1 hallo
#> 2 1 nam
#> 3 1 aaron
#> 4 2 lieb
#> 5 2 dank
#> 6 2 video
或者,如果你更喜欢使用匿名函数:
tmp <- text_tokens(text$comment_content, text_filter(stemmer = "de",drop = corpus::stopwords_de)) %>%
purrr::imap_dfr(~ tibble(comment_id = .y, comment_tokens = .x))
英文:
Here's an option using imap_dfr
from purrr
:
library(corpus)
library(dplyr)
library(purrr)
text <- data.frame(comment_id = 1:2,
comment_content = c("Hallo mein Name ist aaron","Vielen Lieben Dank für das Video"))
tmp <- text_tokens(text$comment_content,
text_filter(stemmer = "de",drop = corpus::stopwords_de)) %>%
purrr::imap_dfr(function(x, y) {
tibble(
comment_id = y,
comment_tokens = x
)
})
tmp
#> # A tibble: 6 × 2
#> comment_id comment_tokens
#> <int> <chr>
#> 1 1 hallo
#> 2 1 nam
#> 3 1 aaron
#> 4 2 lieb
#> 5 2 dank
#> 6 2 video
Or if you prefer using an anonymous function:
tmp <- text_tokens(text$comment_content, text_filter(stemmer = "de",drop = corpus::stopwords_de)) %>%
purrr::imap_dfr(~ tibble(comment_id = .y, comment_tokens = .x))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论