提取并组织文本文件到数据框架 (dataframe)。

huangapple go评论76阅读模式
英文:

extract and organise textfile to dataframe

问题

我有一个包含以下结构的大型文本文件:

AA<-tibble::tribble(
  ~`-------------------------------------------------`,
  "ABCD 2002201234 09-06-2015 10:34",
  "-------------------------------------------------",
  "Lorem ipsum",
  "Lorem ipsum",
  "Lorem ipsum Lorem ipsum",
  "Lorem ipsum: Lorem ipsum",
  "123456",
  "AB",
  "AB",
  "Lorem ipsum",
  "-------------------------------------------------",
  "ABCDEF 1001101234 05-03-2011 09:15",
  "-------------------------------------------------",
  "TEST",
  "TEST"
)

我想将上述内容组织成一个数据框(DF),其中包含变量:ID、DATE和TEXT。ID应该是10位数字(例如,2002201234和1001101234),DATE是自明的,TEXT应该是在底部线("-------------")和下一篇文章的上部线之间的所有文本。

如何以最简单的方式执行这个操作?

英文:

I have a huge text file with the following structure:

AA<-tibble::tribble(
  ~`-------------------------------------------------`,
  "ABCD 2002201234 09-06-2015 10:34",
  "-------------------------------------------------",
  "Lorem ipsum",
  "Lorem ipsum",
  "Lorem ipsum Lorem ipsum",
  "Lorem ipsum: Lorem ipsum",
  "123456",
  "AB",
  "AB",
  "Lorem ipsum",
  "-------------------------------------------------",
  "ABCDEF 1001101234 05-03-2011 09:15",
  "-------------------------------------------------",
  "TEST",
  "TEST"
)

I want to organise the above into a DF with variables: ID, DATE and TEXT. ID should be the 10-digit number (in the example 2002201234 and 1001101234) DATE is self explanatory and TEXT should be all text between the bottom line ("-------------") to the upper line of next post.

Which is the easiest way to perform this?

Regards, H

答案1

得分: 3

在基本的R中:

x <- paste(AA[[1]], collapse = '\n')
y <- regmatches(x, gregexec("(\\d{10}) *(.*?)\n-+([^-]+)", x, perl = TRUE))[[1]]
setNames(data.frame(t(y[2:4,])), c('ID', 'Date', 'Text'))

      ID         Date             Text                                      
      <chr>      <chr>            <chr>                                     
    1 2002201234 09-06-2015 10:34 "\nLorem ipsum\nLorem ipsum\nLorem ipsum Lo…
    2 1001101234 05-03-2011 09:15 "\nTEST\nTEST"
英文:

in base R:

x &lt;- paste(AA[[1]], collapse = &#39;\n&#39;)
y &lt;- regmatches(x, gregexec(&quot;(\\d{10}) *(.*?)\n-+([^-]+)&quot;, x, perl = TRUE))[[1]]
setNames(data.frame(t(y[2:4,])), c(&#39;ID&#39;, &#39;Date&#39;, &#39;Text&#39;))

  ID         Date             Text                                        
  &lt;chr&gt;      &lt;chr&gt;            &lt;chr&gt;                                       
1 2002201234 09-06-2015 10:34 &quot;\nLorem ipsum\nLorem ipsum\nLorem ipsum Lo…
2 1001101234 05-03-2011 09:15 &quot;\nTEST\nTEST&quot; 

答案2

得分: 2

以下是代码的翻译部分:

这里提供了使用 pmap 的解决方案,具体效率和速度取决于您的文件大小。

您需要调整以下内容:

  • 正确的日期格式(它不是明确的)
  • 文本如何折叠,目前是使用换行符
library(stringr)
library(purrr)
library(dplyr)

AA <- tibble::tribble(
  ~X1,
  "-------------------------------------------------",
  "ABCD 2002201234 09-06-2015 10:34",
  "-------------------------------------------------",
  "Lorem ipsum",
  "Lorem ipsum",
  "Lorem ipsum Lorem ipsum",
  "Lorem ipsum: Lorem ipsum",
  "123456",
  "AB",
  "AB",
  "Lorem ipsum",
  "-------------------------------------------------",
  "ABCDEF 1001101234 05-03-2011 09:15",
  "-------------------------------------------------",
  "TEST",
  "TEST"
)

line_positions <- which(str_detect(AA$X1, "-------------------------------------------------"))
id_positions <- line_positions[seq(from = 1, to = length(line_positions), by = 2)] + 1
text_start_positions <- line_positions[seq(from = 2, to = length(line_positions), by = 2)] + 1
text_stop_positions <- c(line_positions[seq(from = 3, to = length(line_positions), by = 2)] - 1, nrow(AA))

clean_AA <- pmap_dfr(list(id_positions, text_start_positions, text_stop_positions),
                     function(id, start, stop) {
  entry_info <- AA %>%
    slice(id) %>%
    pull(X1) %>%
    str_split(., pattern = " ")
  text_info <- AA %>%
    slice(seq(from = start, to = stop)) %>%
    pull(X1)
  
  data.frame(
    ID = entry_info[[1]][2],
    DATE = as.Date(entry_info[[1]][3], format = "%d-%m-%Y"),
    TEXT = paste0(text_info, collapse = "\n")
  )
})

clean_AA
#>           ID       DATE
#> 1 2002201234 2015-06-09
#> 2 1001101234 2011-03-05
#>                                                                                                       TEXT
#> 1 Lorem ipsum\nLorem ipsum\nLorem ipsum Lorem ipsum\nLorem ipsum: Lorem ipsum\n123456\nAB\nAB\nLorem ipsum
#> 2                                                                                               TEST\nTEST

创建于2023年02月06日,使用 reprex package(版本1.0.0)

英文:

Here is a solution using pmap which might a bit overkill or slow depending how big your file is.

You need to adjust:

  • the correct date format (it's not unambiguous)
  • how the text should be collapsed, right now it is with a line break
library(stringr)
library(purrr)
library(dplyr)

AA &lt;- tibble::tribble(
  ~X1,
  &quot;-------------------------------------------------&quot;,
  &quot;ABCD 2002201234 09-06-2015 10:34&quot;,
  &quot;-------------------------------------------------&quot;,
  &quot;Lorem ipsum&quot;,
  &quot;Lorem ipsum&quot;,
  &quot;Lorem ipsum Lorem ipsum&quot;,
  &quot;Lorem ipsum: Lorem ipsum&quot;,
  &quot;123456&quot;,
  &quot;AB&quot;,
  &quot;AB&quot;,
  &quot;Lorem ipsum&quot;,
  &quot;-------------------------------------------------&quot;,
  &quot;ABCDEF 1001101234 05-03-2011 09:15&quot;,
  &quot;-------------------------------------------------&quot;,
  &quot;TEST&quot;,
  &quot;TEST&quot;
)

line_positions &lt;- which(str_detect(AA$X1, &quot;-------------------------------------------------&quot;))
id_positions &lt;- line_positions[seq(from = 1, to = length(line_positions), by = 2)] + 1
text_start_positions &lt;- line_positions[seq(from = 2, to = length(line_positions), by = 2)] + 1
text_stop_positions &lt;- c(line_positions[seq(from = 3, to = length(line_positions), by = 2)] - 1, nrow(AA))

clean_AA &lt;- pmap_dfr(list(id_positions, text_start_positions, text_stop_positions),
                     function(id, start, stop) {
  entry_info &lt;- AA %&gt;%
    slice(id) %&gt;% 
    pull(X1) %&gt;% 
    str_split(., pattern = &quot; &quot;)
  text_info &lt;- AA %&gt;% 
    slice(seq(from = start, to = stop)) %&gt;% 
    pull(X1)
  
  data.frame(
    ID = entry_info[[1]][2],
    DATE = as.Date(entry_info[[1]][3], format = &quot;%d-%m-%Y&quot;),
    TEXT = paste0(text_info, collapse = &quot;\n&quot;)
  )
})

clean_AA
#&gt;           ID       DATE
#&gt; 1 2002201234 2015-06-09
#&gt; 2 1001101234 2011-03-05
#&gt;                                                                                                       TEXT
#&gt; 1 Lorem ipsum\nLorem ipsum\nLorem ipsum Lorem ipsum\nLorem ipsum: Lorem ipsum\n123456\nAB\nAB\nLorem ipsum
#&gt; 2                                                                                               TEST\nTEST

<sup>Created on 2023-02-06 by the reprex package (v1.0.0)</sup>

答案3

得分: 1

使用基本的 tidyverse 包的解决方案。请查看代码中的注释以获取有关步骤的详细解释。

library(tidyverse)
library(lubridate)

separator <- "-------------------------------------------------"

tibble(
  tx = c(names(AA), AA[[1]]) # 从名称到数据向量中获取第一行,这应该在导入时完成
  ) |>
  mutate(
    grp = (tx == separator) %>% # 检测分隔行
      {. & lead(., 2)} |> # 分组以分隔行开始,之后再有两行
      cumsum()
    ) |>
  filter(tx != separator) |> # 删除分隔行
  nest(text = tx) |> # 嵌套以将文档作为观察单位
  mutate(
    fst = map_chr(text, \(x) x |> # 提取包含元信息的第一行
                    pull(1) |>
                    first()),
    id = str_extract(fst, "\\d{10}"), # 10位数字id的正则表达式
    date = str_extract(fst, "\\d{2}-\\d{2}-\\d{4}") |> # 日期的正则表达式
      lubridate::dmy(),
    text = map_chr(text, \(x) x |> # 将文本正文合并为单个字符串
                 slice(-1) |>
                 pull(1) |>
                 str_c(collapse = "\n")),
    .before = text
  ) |>
  select(-fst) 
#> # A tibble: 2 × 4
#>     grp id         date       text                                              
#>   <int> <chr>      <date>     <chr>                                             
#> 1     1 2002201234 2015-06-09 "Lorem ipsum\nLorem ipsum\nLorem ipsum Lorem ipsu…
#> 2     2 1001101234 2011-03-05 "TEST\nTEST"
英文:

A solution using basic tidyverse packages. Look to the comments in the code for detailed explanations of the steps.

library(tidyverse)
library(lubridate)

separator &lt;- &quot;-------------------------------------------------&quot;

tibble(
  tx = c(names(AA), AA[[1]]) # take first line from name to data vector, this should be done during import
  ) |&gt; 
  mutate(
    grp = (tx == separator) %&gt;% # detect separator lines
      {. &amp; lead(., 2)} |&gt; # group begins with with a separator line followed by another after 2 lines
      cumsum()
    ) |&gt; 
  filter(tx != separator) |&gt; # remove separator lines
  nest(text = tx) |&gt; # nest to make document the unit of observation
  mutate(
    fst = map_chr(text, \(x) x |&gt; # extract first line containing meta info
                    pull(1) |&gt;
                    first()),
    id = str_extract(fst, &quot;\\d{10}&quot;), # Regex for 10 digit id string
    date = str_extract(fst, &quot;\\d{2}-\\d{2}-\\d{4}&quot;) |&gt; # regex for date
      lubridate::dmy(),
    text = map_chr(text, \(x) x |&gt; # collapse text body to single string
                 slice(-1) |&gt;
                 pull(1) |&gt;
                 str_c(collapse = &quot;\n&quot;)),
    .before = text
  ) |&gt; 
  select(-fst) 
#&gt; # A tibble: 2 &#215; 4
#&gt;     grp id         date       text                                              
#&gt;   &lt;int&gt; &lt;chr&gt;      &lt;date&gt;     &lt;chr&gt;                                             
#&gt; 1     1 2002201234 2015-06-09 &quot;Lorem ipsum\nLorem ipsum\nLorem ipsum Lorem ipsu…
#&gt; 2     2 1001101234 2011-03-05 &quot;TEST\nTEST&quot;

答案4

得分: 1

我会在tidyverse中使用一些简单的连续步骤,主要使用dplyrtidyrstringr

library(dplyr)
library(tidyr)
library(stringr)
AA %>%
    rename_with(~ "text") %>%
    filter(!str_detect(text, "---+")) %>%     # 移除"-----"行
    mutate(index = cumsum(str_detect(text, ".*\\d{10}.*"))) %>%    # 创建id索引列
    group_by(index) %>%
    mutate(temp = first(text)) %>%     # 将id+日期信息分离到临时列中
    extract(col = temp,
            into = c("ID", "date"),
            regex = ".*(\\d{10}).*(\\d{2}-\\d{2}-\\d{4}).*",
            remove = TRUE) %>%     # 从临时id创建"ID"和"date"列
    mutate(date = lubridate::dmy(date)) %>% # 将日期转换为适当的日期类
    slice(-1) %>%     # 移除案例标题/ID行
    nest(text = text) %>%     # 每行一个案例,带有嵌套的文本变量
    ungroup()

这将为我们提供所需的输出,其中文本列是一个包含所有文本数据的tibble列表。之后处理这些tibble非常容易:

pull(AA, text)

[[1]]
# 一个tibble: 8 × 1
  text                    
  <chr>                   
1 Lorem ipsum             
2 Lorem ipsum             
3 Lorem ipsum Lorem ipsum 
4 Lorem ipsum: Lorem ipsum
5 123456                  
6 AB                      
7 AB                      
8 Lorem ipsum             

[[2]]
# 一个tibble: 2 × 1
  text 
  <chr>
1 TEST 
2 TEST 

或者

mutate(AA, text = map(text, pull))

# 一个tibble: 2 × 4
  index ID         date       text     
  <int> <chr>      <chr>      <list>   
1     1 2002201234 09-06-2015 <chr [8]>
2     2 1001101234 05-03-2011 <chr [2]>

希望这有帮助!

英文:

I would use some simple sequential steps within the tidyverse. I would mainly use dplyr, tidyr and stringr.

library(dplyr)
library(tidyr)
library(stringr)
AA %&gt;%
rename_with(~ &quot;text&quot;) %&gt;% 
filter(!str_detect(text, &quot;-{3,}&quot;)) %&gt;%     #remove &quot;-----&quot; lines
mutate(index = cumsum(str_detect(text, &quot;.*\\d{10}.*&quot;))) %&gt;%    #create id index column 
group_by(index) %&gt;%
mutate(temp = first(text)) %&gt;%     #separate id+date info into temporary column
extract(col = temp,
into = c(&quot;ID&quot;, &quot;date&quot;),
regex = &quot;.*(\\d{10}).*(\\d{2}-\\d{2}-\\d{4}).*&quot;,
remove = TRUE) %&gt;%     #create &quot;ID&quot; and &quot;date&quot; columns from temp id
mutate(date = lubridate::dmy(date)) %&gt;% #convert dates into proper date class
slice(-1) %&gt;%     #remove case headers/id rows
nest(text = text) %&gt;%     #one case per line, with a nested text variable
ungroup()
# A tibble: 2 &#215; 4
index ID         date       text            
&lt;int&gt; &lt;chr&gt;      &lt;chr&gt;      &lt;list&gt;          
1     1 2002201234 09-06-2015 &lt;tibble [8 &#215; 1]&gt;
2     2 1001101234 05-03-2011 &lt;tibble [2 &#215; 1]&gt;

This would give us the desired output, with the text column as a list of tibbles with all the text data. It is fairly easy to handle these tibbles after that:

pull(AA,text)
[[1]]
# A tibble: 8 &#215; 1
text                    
&lt;chr&gt;                   
1 Lorem ipsum             
2 Lorem ipsum             
3 Lorem ipsum Lorem ipsum 
4 Lorem ipsum: Lorem ipsum
5 123456                  
6 AB                      
7 AB                      
8 Lorem ipsum             
[[2]]
# A tibble: 2 &#215; 1
text 
&lt;chr&gt;
1 TEST 
2 TEST 

OR

mutate(AA, text = map(text, pull))
# A tibble: 2 &#215; 4
index ID         date       text     
&lt;int&gt; &lt;chr&gt;      &lt;chr&gt;      &lt;list&gt;   
1     1 2002201234 09-06-2015 &lt;chr [8]&gt;
2     2 1001101234 05-03-2011 &lt;chr [2]&gt;

huangapple
  • 本文由 发表于 2023年2月7日 03:16:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/75365647.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定