2023年8月10日 15:05:52go评论117阅读模式

英文:

Text mining in R: delete first sentence of each document

问题

case_number	text
1	今天是个好天气。阳光明媚。
2	今天天气很糟糕。下雨了。

所以结果应该如下所示

case_number	text
1	阳光明媚。
2	下雨了。

这是示例数据集：

case_number <- c(1, 2)
text <- c("今天是个好天气。阳光明媚。",
          "今天天气很糟糕。下雨了。")
data <- data.frame(case_number, text)

英文:

I have several documents and do not need the first sentence of each document.
I could not find a solution so far.

Here is an example. The structure of the data looks like this

case_number	text
1	Today is a good day. It is sunny.
2	Today is a bad day. It is rainy.

So the results should look like this

case_number	text
1	It is sunny.
2	It is rainy.

Here is the example dataset:

case_number &lt;- c(1, 2)
text &lt;- c(&quot;Today is a good day. It is sunny.&quot;,
          &quot;Today is a bad day. It is rainy.&quot;)
data &lt;- data.frame(case_number, text)

答案1

得分: 1

如果有可能句子中包含一些标点符号（例如缩写或数字），而且你已经在使用一些文本挖掘库，那么让它处理标记化是完全有道理的。

使用 {tidytext}：

library(dplyr)
library(tidytext)
# 带有标点符号的第一个句子示例
data &lt;- data.frame(case_number = c(1, 2),
                   text = c(&quot;Today is a good day, above avg. for sure, by 5.1 points. It is sunny.&quot;,
                            &quot;Today is a bad day. It is rainy.&quot;))
# 将文本标记化为句子，将标记转换为小写是可选的
data %&gt;% 
  unnest_sentences(s, text)
#&gt;   case_number                                                        s
#&gt; 1           1 today is a good day, above avg. for sure, by 5.1 points.
#&gt; 2           1                                             it is sunny.
#&gt; 3           2                                      today is a bad day.
#&gt; 4           2                                             it is rainy.
# 删除每个 case_number 组的第一个记录
data %&gt;% 
  unnest_sentences(s, text) %&gt;% 
  filter(row_number() &gt; 1, .by = case_number)
#&gt;   case_number            s
#&gt; 1           1 it is sunny.
#&gt; 2           2 it is rainy.

^{创建于 2023-08-10，使用 reprex v2.0.2。}

英文:

If there's a chance that sentences might include some punctuation (e.g. abbreviations or numerics), and you are using some text mining library anyway, it makes perfect sense to let it handle tokenization.

With {tidytext} :

library(dplyr)
library(tidytext)
# exmple with punctuation in 1st sentence
data &lt;- data.frame(case_number = c(1, 2),
                   text = c(&quot;Today is a good day, above avg. for sure, by 5.1 points. It is sunny.&quot;,
                            &quot;Today is a bad day. It is rainy.&quot;))
# tokenize to sentences, converting tokens to lowercase is optional
data %&gt;% 
  unnest_sentences(s, text)
#&gt;   case_number                                                        s
#&gt; 1           1 today is a good day, above avg. for sure, by 5.1 points.
#&gt; 2           1                                             it is sunny.
#&gt; 3           2                                      today is a bad day.
#&gt; 4           2                                             it is rainy.
# drop 1st record of every case_number group
data %&gt;% 
  unnest_sentences(s, text) %&gt;% 
  filter(row_number() &gt; 1, .by = case_number)
#&gt;   case_number            s
#&gt; 1           1 it is sunny.
#&gt; 2           2 it is rainy.

<sup>Created on 2023-08-10 with reprex v2.0.2</sup>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在R中进行文本挖掘：删除每个文档的第一句话

问题

答案1

如何在同一图中为两个模型制作漂亮的ROC曲线？

从一个数据框中提取变量标签，然后分配给另一个数据框中的变量。

Use of svyglm and svydesign with R for multistage stratified cluster design

如何在“merge”转换后返回与开始时相同的对象

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。