在R中进行文本挖掘:删除每个文档的第一句话

huangapple go评论109阅读模式
英文:

Text mining in R: delete first sentence of each document

问题

case_number text
1 今天是个好天气。阳光明媚。
2 今天天气很糟糕。下雨了。

所以结果应该如下所示

case_number text
1 阳光明媚。
2 下雨了。

这是示例数据集:

  1. case_number <- c(1, 2)
  2. text <- c("今天是个好天气。阳光明媚。",
  3. "今天天气很糟糕。下雨了。")
  4. data <- data.frame(case_number, text)
英文:

I have several documents and do not need the first sentence of each document.
I could not find a solution so far.

Here is an example. The structure of the data looks like this

case_number text
1 Today is a good day. It is sunny.
2 Today is a bad day. It is rainy.

So the results should look like this

case_number text
1 It is sunny.
2 It is rainy.

Here is the example dataset:

  1. case_number &lt;- c(1, 2)
  2. text &lt;- c(&quot;Today is a good day. It is sunny.&quot;,
  3. &quot;Today is a bad day. It is rainy.&quot;)
  4. data &lt;- data.frame(case_number, text)

答案1

得分: 1

如果有可能句子中包含一些标点符号(例如缩写或数字),而且你已经在使用一些文本挖掘库,那么让它处理标记化是完全有道理的。

使用 {tidytext}

  1. library(dplyr)
  2. library(tidytext)
  3. # 带有标点符号的第一个句子示例
  4. data &lt;- data.frame(case_number = c(1, 2),
  5. text = c(&quot;Today is a good day, above avg. for sure, by 5.1 points. It is sunny.&quot;,
  6. &quot;Today is a bad day. It is rainy.&quot;))
  7. # 将文本标记化为句子,将标记转换为小写是可选的
  8. data %&gt;%
  9. unnest_sentences(s, text)
  10. #&gt; case_number s
  11. #&gt; 1 1 today is a good day, above avg. for sure, by 5.1 points.
  12. #&gt; 2 1 it is sunny.
  13. #&gt; 3 2 today is a bad day.
  14. #&gt; 4 2 it is rainy.
  15. # 删除每个 case_number 组的第一个记录
  16. data %&gt;%
  17. unnest_sentences(s, text) %&gt;%
  18. filter(row_number() &gt; 1, .by = case_number)
  19. #&gt; case_number s
  20. #&gt; 1 1 it is sunny.
  21. #&gt; 2 2 it is rainy.

创建于 2023-08-10,使用 reprex v2.0.2

英文:

If there's a chance that sentences might include some punctuation (e.g. abbreviations or numerics), and you are using some text mining library anyway, it makes perfect sense to let it handle tokenization.

With {tidytext} :

  1. library(dplyr)
  2. library(tidytext)
  3. # exmple with punctuation in 1st sentence
  4. data &lt;- data.frame(case_number = c(1, 2),
  5. text = c(&quot;Today is a good day, above avg. for sure, by 5.1 points. It is sunny.&quot;,
  6. &quot;Today is a bad day. It is rainy.&quot;))
  7. # tokenize to sentences, converting tokens to lowercase is optional
  8. data %&gt;%
  9. unnest_sentences(s, text)
  10. #&gt; case_number s
  11. #&gt; 1 1 today is a good day, above avg. for sure, by 5.1 points.
  12. #&gt; 2 1 it is sunny.
  13. #&gt; 3 2 today is a bad day.
  14. #&gt; 4 2 it is rainy.
  15. # drop 1st record of every case_number group
  16. data %&gt;%
  17. unnest_sentences(s, text) %&gt;%
  18. filter(row_number() &gt; 1, .by = case_number)
  19. #&gt; case_number s
  20. #&gt; 1 1 it is sunny.
  21. #&gt; 2 2 it is rainy.

<sup>Created on 2023-08-10 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年8月10日 15:05:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/76873337.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定