英文:
Text mining in R: delete first sentence of each document
问题
case_number | text |
---|---|
1 | 今天是个好天气。阳光明媚。 |
2 | 今天天气很糟糕。下雨了。 |
所以结果应该如下所示
case_number | text |
---|---|
1 | 阳光明媚。 |
2 | 下雨了。 |
这是示例数据集:
case_number <- c(1, 2)
text <- c("今天是个好天气。阳光明媚。",
"今天天气很糟糕。下雨了。")
data <- data.frame(case_number, text)
英文:
I have several documents and do not need the first sentence of each document.
I could not find a solution so far.
Here is an example. The structure of the data looks like this
case_number | text |
---|---|
1 | Today is a good day. It is sunny. |
2 | Today is a bad day. It is rainy. |
So the results should look like this
case_number | text |
---|---|
1 | It is sunny. |
2 | It is rainy. |
Here is the example dataset:
case_number <- c(1, 2)
text <- c("Today is a good day. It is sunny.",
"Today is a bad day. It is rainy.")
data <- data.frame(case_number, text)
答案1
得分: 1
如果有可能句子中包含一些标点符号(例如缩写或数字),而且你已经在使用一些文本挖掘库,那么让它处理标记化是完全有道理的。
使用 {tidytext}
:
library(dplyr)
library(tidytext)
# 带有标点符号的第一个句子示例
data <- data.frame(case_number = c(1, 2),
text = c("Today is a good day, above avg. for sure, by 5.1 points. It is sunny.",
"Today is a bad day. It is rainy."))
# 将文本标记化为句子,将标记转换为小写是可选的
data %>%
unnest_sentences(s, text)
#> case_number s
#> 1 1 today is a good day, above avg. for sure, by 5.1 points.
#> 2 1 it is sunny.
#> 3 2 today is a bad day.
#> 4 2 it is rainy.
# 删除每个 case_number 组的第一个记录
data %>%
unnest_sentences(s, text) %>%
filter(row_number() > 1, .by = case_number)
#> case_number s
#> 1 1 it is sunny.
#> 2 2 it is rainy.
创建于 2023-08-10,使用 reprex v2.0.2。
英文:
If there's a chance that sentences might include some punctuation (e.g. abbreviations or numerics), and you are using some text mining library anyway, it makes perfect sense to let it handle tokenization.
With {tidytext}
:
library(dplyr)
library(tidytext)
# exmple with punctuation in 1st sentence
data <- data.frame(case_number = c(1, 2),
text = c("Today is a good day, above avg. for sure, by 5.1 points. It is sunny.",
"Today is a bad day. It is rainy."))
# tokenize to sentences, converting tokens to lowercase is optional
data %>%
unnest_sentences(s, text)
#> case_number s
#> 1 1 today is a good day, above avg. for sure, by 5.1 points.
#> 2 1 it is sunny.
#> 3 2 today is a bad day.
#> 4 2 it is rainy.
# drop 1st record of every case_number group
data %>%
unnest_sentences(s, text) %>%
filter(row_number() > 1, .by = case_number)
#> case_number s
#> 1 1 it is sunny.
#> 2 2 it is rainy.
<sup>Created on 2023-08-10 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论