提取并组织文本文件到数据框架 (dataframe)。

huangapple go评论94阅读模式
英文:

extract and organise textfile to dataframe

问题

我有一个包含以下结构的大型文本文件:

  1. AA<-tibble::tribble(
  2. ~`-------------------------------------------------`,
  3. "ABCD 2002201234 09-06-2015 10:34",
  4. "-------------------------------------------------",
  5. "Lorem ipsum",
  6. "Lorem ipsum",
  7. "Lorem ipsum Lorem ipsum",
  8. "Lorem ipsum: Lorem ipsum",
  9. "123456",
  10. "AB",
  11. "AB",
  12. "Lorem ipsum",
  13. "-------------------------------------------------",
  14. "ABCDEF 1001101234 05-03-2011 09:15",
  15. "-------------------------------------------------",
  16. "TEST",
  17. "TEST"
  18. )

我想将上述内容组织成一个数据框(DF),其中包含变量:ID、DATE和TEXT。ID应该是10位数字(例如,2002201234和1001101234),DATE是自明的,TEXT应该是在底部线("-------------")和下一篇文章的上部线之间的所有文本。

如何以最简单的方式执行这个操作?

英文:

I have a huge text file with the following structure:

  1. AA<-tibble::tribble(
  2. ~`-------------------------------------------------`,
  3. "ABCD 2002201234 09-06-2015 10:34",
  4. "-------------------------------------------------",
  5. "Lorem ipsum",
  6. "Lorem ipsum",
  7. "Lorem ipsum Lorem ipsum",
  8. "Lorem ipsum: Lorem ipsum",
  9. "123456",
  10. "AB",
  11. "AB",
  12. "Lorem ipsum",
  13. "-------------------------------------------------",
  14. "ABCDEF 1001101234 05-03-2011 09:15",
  15. "-------------------------------------------------",
  16. "TEST",
  17. "TEST"
  18. )

I want to organise the above into a DF with variables: ID, DATE and TEXT. ID should be the 10-digit number (in the example 2002201234 and 1001101234) DATE is self explanatory and TEXT should be all text between the bottom line ("-------------") to the upper line of next post.

Which is the easiest way to perform this?

Regards, H

答案1

得分: 3

  1. 在基本的R中:
  2. x <- paste(AA[[1]], collapse = '\n')
  3. y <- regmatches(x, gregexec("(\\d{10}) *(.*?)\n-+([^-]+)", x, perl = TRUE))[[1]]
  4. setNames(data.frame(t(y[2:4,])), c('ID', 'Date', 'Text'))
  5. ID Date Text
  6. <chr> <chr> <chr>
  7. 1 2002201234 09-06-2015 10:34 "\nLorem ipsum\nLorem ipsum\nLorem ipsum Lo…
  8. 2 1001101234 05-03-2011 09:15 "\nTEST\nTEST"
英文:

in base R:

  1. x &lt;- paste(AA[[1]], collapse = &#39;\n&#39;)
  2. y &lt;- regmatches(x, gregexec(&quot;(\\d{10}) *(.*?)\n-+([^-]+)&quot;, x, perl = TRUE))[[1]]
  3. setNames(data.frame(t(y[2:4,])), c(&#39;ID&#39;, &#39;Date&#39;, &#39;Text&#39;))
  4. ID Date Text
  5. &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
  6. 1 2002201234 09-06-2015 10:34 &quot;\nLorem ipsum\nLorem ipsum\nLorem ipsum Lo
  7. 2 1001101234 05-03-2011 09:15 &quot;\nTEST\nTEST&quot;

答案2

得分: 2

以下是代码的翻译部分:

这里提供了使用 pmap 的解决方案,具体效率和速度取决于您的文件大小。

您需要调整以下内容:

  • 正确的日期格式(它不是明确的)
  • 文本如何折叠,目前是使用换行符
  1. library(stringr)
  2. library(purrr)
  3. library(dplyr)
  4. AA <- tibble::tribble(
  5. ~X1,
  6. "-------------------------------------------------",
  7. "ABCD 2002201234 09-06-2015 10:34",
  8. "-------------------------------------------------",
  9. "Lorem ipsum",
  10. "Lorem ipsum",
  11. "Lorem ipsum Lorem ipsum",
  12. "Lorem ipsum: Lorem ipsum",
  13. "123456",
  14. "AB",
  15. "AB",
  16. "Lorem ipsum",
  17. "-------------------------------------------------",
  18. "ABCDEF 1001101234 05-03-2011 09:15",
  19. "-------------------------------------------------",
  20. "TEST",
  21. "TEST"
  22. )
  23. line_positions <- which(str_detect(AA$X1, "-------------------------------------------------"))
  24. id_positions <- line_positions[seq(from = 1, to = length(line_positions), by = 2)] + 1
  25. text_start_positions <- line_positions[seq(from = 2, to = length(line_positions), by = 2)] + 1
  26. text_stop_positions <- c(line_positions[seq(from = 3, to = length(line_positions), by = 2)] - 1, nrow(AA))
  27. clean_AA <- pmap_dfr(list(id_positions, text_start_positions, text_stop_positions),
  28. function(id, start, stop) {
  29. entry_info <- AA %>%
  30. slice(id) %>%
  31. pull(X1) %>%
  32. str_split(., pattern = " ")
  33. text_info <- AA %>%
  34. slice(seq(from = start, to = stop)) %>%
  35. pull(X1)
  36. data.frame(
  37. ID = entry_info[[1]][2],
  38. DATE = as.Date(entry_info[[1]][3], format = "%d-%m-%Y"),
  39. TEXT = paste0(text_info, collapse = "\n")
  40. )
  41. })
  42. clean_AA
  43. #> ID DATE
  44. #> 1 2002201234 2015-06-09
  45. #> 2 1001101234 2011-03-05
  46. #> TEXT
  47. #> 1 Lorem ipsum\nLorem ipsum\nLorem ipsum Lorem ipsum\nLorem ipsum: Lorem ipsum\n123456\nAB\nAB\nLorem ipsum
  48. #> 2 TEST\nTEST

创建于2023年02月06日,使用 reprex package(版本1.0.0)

英文:

Here is a solution using pmap which might a bit overkill or slow depending how big your file is.

You need to adjust:

  • the correct date format (it's not unambiguous)
  • how the text should be collapsed, right now it is with a line break
  1. library(stringr)
  2. library(purrr)
  3. library(dplyr)
  4. AA &lt;- tibble::tribble(
  5. ~X1,
  6. &quot;-------------------------------------------------&quot;,
  7. &quot;ABCD 2002201234 09-06-2015 10:34&quot;,
  8. &quot;-------------------------------------------------&quot;,
  9. &quot;Lorem ipsum&quot;,
  10. &quot;Lorem ipsum&quot;,
  11. &quot;Lorem ipsum Lorem ipsum&quot;,
  12. &quot;Lorem ipsum: Lorem ipsum&quot;,
  13. &quot;123456&quot;,
  14. &quot;AB&quot;,
  15. &quot;AB&quot;,
  16. &quot;Lorem ipsum&quot;,
  17. &quot;-------------------------------------------------&quot;,
  18. &quot;ABCDEF 1001101234 05-03-2011 09:15&quot;,
  19. &quot;-------------------------------------------------&quot;,
  20. &quot;TEST&quot;,
  21. &quot;TEST&quot;
  22. )
  23. line_positions &lt;- which(str_detect(AA$X1, &quot;-------------------------------------------------&quot;))
  24. id_positions &lt;- line_positions[seq(from = 1, to = length(line_positions), by = 2)] + 1
  25. text_start_positions &lt;- line_positions[seq(from = 2, to = length(line_positions), by = 2)] + 1
  26. text_stop_positions &lt;- c(line_positions[seq(from = 3, to = length(line_positions), by = 2)] - 1, nrow(AA))
  27. clean_AA &lt;- pmap_dfr(list(id_positions, text_start_positions, text_stop_positions),
  28. function(id, start, stop) {
  29. entry_info &lt;- AA %&gt;%
  30. slice(id) %&gt;%
  31. pull(X1) %&gt;%
  32. str_split(., pattern = &quot; &quot;)
  33. text_info &lt;- AA %&gt;%
  34. slice(seq(from = start, to = stop)) %&gt;%
  35. pull(X1)
  36. data.frame(
  37. ID = entry_info[[1]][2],
  38. DATE = as.Date(entry_info[[1]][3], format = &quot;%d-%m-%Y&quot;),
  39. TEXT = paste0(text_info, collapse = &quot;\n&quot;)
  40. )
  41. })
  42. clean_AA
  43. #&gt; ID DATE
  44. #&gt; 1 2002201234 2015-06-09
  45. #&gt; 2 1001101234 2011-03-05
  46. #&gt; TEXT
  47. #&gt; 1 Lorem ipsum\nLorem ipsum\nLorem ipsum Lorem ipsum\nLorem ipsum: Lorem ipsum\n123456\nAB\nAB\nLorem ipsum
  48. #&gt; 2 TEST\nTEST

<sup>Created on 2023-02-06 by the reprex package (v1.0.0)</sup>

答案3

得分: 1

使用基本的 tidyverse 包的解决方案。请查看代码中的注释以获取有关步骤的详细解释。

  1. library(tidyverse)
  2. library(lubridate)
  3. separator <- "-------------------------------------------------"
  4. tibble(
  5. tx = c(names(AA), AA[[1]]) # 从名称到数据向量中获取第一行,这应该在导入时完成
  6. ) |>
  7. mutate(
  8. grp = (tx == separator) %>% # 检测分隔行
  9. {. & lead(., 2)} |> # 分组以分隔行开始,之后再有两行
  10. cumsum()
  11. ) |>
  12. filter(tx != separator) |> # 删除分隔行
  13. nest(text = tx) |> # 嵌套以将文档作为观察单位
  14. mutate(
  15. fst = map_chr(text, \(x) x |> # 提取包含元信息的第一行
  16. pull(1) |>
  17. first()),
  18. id = str_extract(fst, "\\d{10}"), # 10位数字id的正则表达式
  19. date = str_extract(fst, "\\d{2}-\\d{2}-\\d{4}") |> # 日期的正则表达式
  20. lubridate::dmy(),
  21. text = map_chr(text, \(x) x |> # 将文本正文合并为单个字符串
  22. slice(-1) |>
  23. pull(1) |>
  24. str_c(collapse = "\n")),
  25. .before = text
  26. ) |>
  27. select(-fst)
  28. #> # A tibble: 2 × 4
  29. #> grp id date text
  30. #> <int> <chr> <date> <chr>
  31. #> 1 1 2002201234 2015-06-09 "Lorem ipsum\nLorem ipsum\nLorem ipsum Lorem ipsu…
  32. #> 2 2 1001101234 2011-03-05 "TEST\nTEST"
英文:

A solution using basic tidyverse packages. Look to the comments in the code for detailed explanations of the steps.

  1. library(tidyverse)
  2. library(lubridate)
  3. separator &lt;- &quot;-------------------------------------------------&quot;
  4. tibble(
  5. tx = c(names(AA), AA[[1]]) # take first line from name to data vector, this should be done during import
  6. ) |&gt;
  7. mutate(
  8. grp = (tx == separator) %&gt;% # detect separator lines
  9. {. &amp; lead(., 2)} |&gt; # group begins with with a separator line followed by another after 2 lines
  10. cumsum()
  11. ) |&gt;
  12. filter(tx != separator) |&gt; # remove separator lines
  13. nest(text = tx) |&gt; # nest to make document the unit of observation
  14. mutate(
  15. fst = map_chr(text, \(x) x |&gt; # extract first line containing meta info
  16. pull(1) |&gt;
  17. first()),
  18. id = str_extract(fst, &quot;\\d{10}&quot;), # Regex for 10 digit id string
  19. date = str_extract(fst, &quot;\\d{2}-\\d{2}-\\d{4}&quot;) |&gt; # regex for date
  20. lubridate::dmy(),
  21. text = map_chr(text, \(x) x |&gt; # collapse text body to single string
  22. slice(-1) |&gt;
  23. pull(1) |&gt;
  24. str_c(collapse = &quot;\n&quot;)),
  25. .before = text
  26. ) |&gt;
  27. select(-fst)
  28. #&gt; # A tibble: 2 &#215; 4
  29. #&gt; grp id date text
  30. #&gt; &lt;int&gt; &lt;chr&gt; &lt;date&gt; &lt;chr&gt;
  31. #&gt; 1 1 2002201234 2015-06-09 &quot;Lorem ipsum\nLorem ipsum\nLorem ipsum Lorem ipsu…
  32. #&gt; 2 2 1001101234 2011-03-05 &quot;TEST\nTEST&quot;

答案4

得分: 1

我会在tidyverse中使用一些简单的连续步骤,主要使用dplyrtidyrstringr

  1. library(dplyr)
  2. library(tidyr)
  3. library(stringr)
  4. AA %>%
  5. rename_with(~ "text") %>%
  6. filter(!str_detect(text, "---+")) %>% # 移除"-----"行
  7. mutate(index = cumsum(str_detect(text, ".*\\d{10}.*"))) %>% # 创建id索引列
  8. group_by(index) %>%
  9. mutate(temp = first(text)) %>% # 将id+日期信息分离到临时列中
  10. extract(col = temp,
  11. into = c("ID", "date"),
  12. regex = ".*(\\d{10}).*(\\d{2}-\\d{2}-\\d{4}).*",
  13. remove = TRUE) %>% # 从临时id创建"ID"和"date"列
  14. mutate(date = lubridate::dmy(date)) %>% # 将日期转换为适当的日期类
  15. slice(-1) %>% # 移除案例标题/ID行
  16. nest(text = text) %>% # 每行一个案例,带有嵌套的文本变量
  17. ungroup()

这将为我们提供所需的输出,其中文本列是一个包含所有文本数据的tibble列表。之后处理这些tibble非常容易:

  1. pull(AA, text)
  2. [[1]]
  3. # 一个tibble: 8 × 1
  4. text
  5. <chr>
  6. 1 Lorem ipsum
  7. 2 Lorem ipsum
  8. 3 Lorem ipsum Lorem ipsum
  9. 4 Lorem ipsum: Lorem ipsum
  10. 5 123456
  11. 6 AB
  12. 7 AB
  13. 8 Lorem ipsum
  14. [[2]]
  15. # 一个tibble: 2 × 1
  16. text
  17. <chr>
  18. 1 TEST
  19. 2 TEST

或者

  1. mutate(AA, text = map(text, pull))
  2. # 一个tibble: 2 × 4
  3. index ID date text
  4. <int> <chr> <chr> <list>
  5. 1 1 2002201234 09-06-2015 <chr [8]>
  6. 2 2 1001101234 05-03-2011 <chr [2]>

希望这有帮助!

英文:

I would use some simple sequential steps within the tidyverse. I would mainly use dplyr, tidyr and stringr.

  1. library(dplyr)
  2. library(tidyr)
  3. library(stringr)
  4. AA %&gt;%
  5. rename_with(~ &quot;text&quot;) %&gt;%
  6. filter(!str_detect(text, &quot;-{3,}&quot;)) %&gt;% #remove &quot;-----&quot; lines
  7. mutate(index = cumsum(str_detect(text, &quot;.*\\d{10}.*&quot;))) %&gt;% #create id index column
  8. group_by(index) %&gt;%
  9. mutate(temp = first(text)) %&gt;% #separate id+date info into temporary column
  10. extract(col = temp,
  11. into = c(&quot;ID&quot;, &quot;date&quot;),
  12. regex = &quot;.*(\\d{10}).*(\\d{2}-\\d{2}-\\d{4}).*&quot;,
  13. remove = TRUE) %&gt;% #create &quot;ID&quot; and &quot;date&quot; columns from temp id
  14. mutate(date = lubridate::dmy(date)) %&gt;% #convert dates into proper date class
  15. slice(-1) %&gt;% #remove case headers/id rows
  16. nest(text = text) %&gt;% #one case per line, with a nested text variable
  17. ungroup()
  18. # A tibble: 2 &#215; 4
  19. index ID date text
  20. &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;list&gt;
  21. 1 1 2002201234 09-06-2015 &lt;tibble [8 &#215; 1]&gt;
  22. 2 2 1001101234 05-03-2011 &lt;tibble [2 &#215; 1]&gt;

This would give us the desired output, with the text column as a list of tibbles with all the text data. It is fairly easy to handle these tibbles after that:

  1. pull(AA,text)
  2. [[1]]
  3. # A tibble: 8 &#215; 1
  4. text
  5. &lt;chr&gt;
  6. 1 Lorem ipsum
  7. 2 Lorem ipsum
  8. 3 Lorem ipsum Lorem ipsum
  9. 4 Lorem ipsum: Lorem ipsum
  10. 5 123456
  11. 6 AB
  12. 7 AB
  13. 8 Lorem ipsum
  14. [[2]]
  15. # A tibble: 2 &#215; 1
  16. text
  17. &lt;chr&gt;
  18. 1 TEST
  19. 2 TEST

OR

  1. mutate(AA, text = map(text, pull))
  2. # A tibble: 2 &#215; 4
  3. index ID date text
  4. &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;list&gt;
  5. 1 1 2002201234 09-06-2015 &lt;chr [8]&gt;
  6. 2 2 1001101234 05-03-2011 &lt;chr [2]&gt;

huangapple
  • 本文由 发表于 2023年2月7日 03:16:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/75365647.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定