英文:
extract and organise textfile to dataframe
问题
我有一个包含以下结构的大型文本文件:
AA<-tibble::tribble(
~`-------------------------------------------------`,
"ABCD 2002201234 09-06-2015 10:34",
"-------------------------------------------------",
"Lorem ipsum",
"Lorem ipsum",
"Lorem ipsum Lorem ipsum",
"Lorem ipsum: Lorem ipsum",
"123456",
"AB",
"AB",
"Lorem ipsum",
"-------------------------------------------------",
"ABCDEF 1001101234 05-03-2011 09:15",
"-------------------------------------------------",
"TEST",
"TEST"
)
我想将上述内容组织成一个数据框(DF),其中包含变量:ID、DATE和TEXT。ID应该是10位数字(例如,2002201234和1001101234),DATE是自明的,TEXT应该是在底部线("-------------")和下一篇文章的上部线之间的所有文本。
如何以最简单的方式执行这个操作?
英文:
I have a huge text file with the following structure:
AA<-tibble::tribble(
~`-------------------------------------------------`,
"ABCD 2002201234 09-06-2015 10:34",
"-------------------------------------------------",
"Lorem ipsum",
"Lorem ipsum",
"Lorem ipsum Lorem ipsum",
"Lorem ipsum: Lorem ipsum",
"123456",
"AB",
"AB",
"Lorem ipsum",
"-------------------------------------------------",
"ABCDEF 1001101234 05-03-2011 09:15",
"-------------------------------------------------",
"TEST",
"TEST"
)
I want to organise the above into a DF with variables: ID, DATE and TEXT. ID should be the 10-digit number (in the example 2002201234 and 1001101234) DATE is self explanatory and TEXT should be all text between the bottom line ("-------------") to the upper line of next post.
Which is the easiest way to perform this?
Regards, H
答案1
得分: 3
在基本的R中:
x <- paste(AA[[1]], collapse = '\n')
y <- regmatches(x, gregexec("(\\d{10}) *(.*?)\n-+([^-]+)", x, perl = TRUE))[[1]]
setNames(data.frame(t(y[2:4,])), c('ID', 'Date', 'Text'))
ID Date Text
<chr> <chr> <chr>
1 2002201234 09-06-2015 10:34 "\nLorem ipsum\nLorem ipsum\nLorem ipsum Lo…
2 1001101234 05-03-2011 09:15 "\nTEST\nTEST"
英文:
in base R:
x <- paste(AA[[1]], collapse = '\n')
y <- regmatches(x, gregexec("(\\d{10}) *(.*?)\n-+([^-]+)", x, perl = TRUE))[[1]]
setNames(data.frame(t(y[2:4,])), c('ID', 'Date', 'Text'))
ID Date Text
<chr> <chr> <chr>
1 2002201234 09-06-2015 10:34 "\nLorem ipsum\nLorem ipsum\nLorem ipsum Lo…
2 1001101234 05-03-2011 09:15 "\nTEST\nTEST"
答案2
得分: 2
以下是代码的翻译部分:
这里提供了使用 pmap
的解决方案,具体效率和速度取决于您的文件大小。
您需要调整以下内容:
- 正确的日期格式(它不是明确的)
- 文本如何折叠,目前是使用换行符
library(stringr)
library(purrr)
library(dplyr)
AA <- tibble::tribble(
~X1,
"-------------------------------------------------",
"ABCD 2002201234 09-06-2015 10:34",
"-------------------------------------------------",
"Lorem ipsum",
"Lorem ipsum",
"Lorem ipsum Lorem ipsum",
"Lorem ipsum: Lorem ipsum",
"123456",
"AB",
"AB",
"Lorem ipsum",
"-------------------------------------------------",
"ABCDEF 1001101234 05-03-2011 09:15",
"-------------------------------------------------",
"TEST",
"TEST"
)
line_positions <- which(str_detect(AA$X1, "-------------------------------------------------"))
id_positions <- line_positions[seq(from = 1, to = length(line_positions), by = 2)] + 1
text_start_positions <- line_positions[seq(from = 2, to = length(line_positions), by = 2)] + 1
text_stop_positions <- c(line_positions[seq(from = 3, to = length(line_positions), by = 2)] - 1, nrow(AA))
clean_AA <- pmap_dfr(list(id_positions, text_start_positions, text_stop_positions),
function(id, start, stop) {
entry_info <- AA %>%
slice(id) %>%
pull(X1) %>%
str_split(., pattern = " ")
text_info <- AA %>%
slice(seq(from = start, to = stop)) %>%
pull(X1)
data.frame(
ID = entry_info[[1]][2],
DATE = as.Date(entry_info[[1]][3], format = "%d-%m-%Y"),
TEXT = paste0(text_info, collapse = "\n")
)
})
clean_AA
#> ID DATE
#> 1 2002201234 2015-06-09
#> 2 1001101234 2011-03-05
#> TEXT
#> 1 Lorem ipsum\nLorem ipsum\nLorem ipsum Lorem ipsum\nLorem ipsum: Lorem ipsum\n123456\nAB\nAB\nLorem ipsum
#> 2 TEST\nTEST
创建于2023年02月06日,使用 reprex package(版本1.0.0)
英文:
Here is a solution using pmap
which might a bit overkill or slow depending how big your file is.
You need to adjust:
- the correct date format (it's not unambiguous)
- how the text should be collapsed, right now it is with a line break
library(stringr)
library(purrr)
library(dplyr)
AA <- tibble::tribble(
~X1,
"-------------------------------------------------",
"ABCD 2002201234 09-06-2015 10:34",
"-------------------------------------------------",
"Lorem ipsum",
"Lorem ipsum",
"Lorem ipsum Lorem ipsum",
"Lorem ipsum: Lorem ipsum",
"123456",
"AB",
"AB",
"Lorem ipsum",
"-------------------------------------------------",
"ABCDEF 1001101234 05-03-2011 09:15",
"-------------------------------------------------",
"TEST",
"TEST"
)
line_positions <- which(str_detect(AA$X1, "-------------------------------------------------"))
id_positions <- line_positions[seq(from = 1, to = length(line_positions), by = 2)] + 1
text_start_positions <- line_positions[seq(from = 2, to = length(line_positions), by = 2)] + 1
text_stop_positions <- c(line_positions[seq(from = 3, to = length(line_positions), by = 2)] - 1, nrow(AA))
clean_AA <- pmap_dfr(list(id_positions, text_start_positions, text_stop_positions),
function(id, start, stop) {
entry_info <- AA %>%
slice(id) %>%
pull(X1) %>%
str_split(., pattern = " ")
text_info <- AA %>%
slice(seq(from = start, to = stop)) %>%
pull(X1)
data.frame(
ID = entry_info[[1]][2],
DATE = as.Date(entry_info[[1]][3], format = "%d-%m-%Y"),
TEXT = paste0(text_info, collapse = "\n")
)
})
clean_AA
#> ID DATE
#> 1 2002201234 2015-06-09
#> 2 1001101234 2011-03-05
#> TEXT
#> 1 Lorem ipsum\nLorem ipsum\nLorem ipsum Lorem ipsum\nLorem ipsum: Lorem ipsum\n123456\nAB\nAB\nLorem ipsum
#> 2 TEST\nTEST
<sup>Created on 2023-02-06 by the reprex package (v1.0.0)</sup>
答案3
得分: 1
使用基本的 tidyverse
包的解决方案。请查看代码中的注释以获取有关步骤的详细解释。
library(tidyverse)
library(lubridate)
separator <- "-------------------------------------------------"
tibble(
tx = c(names(AA), AA[[1]]) # 从名称到数据向量中获取第一行,这应该在导入时完成
) |>
mutate(
grp = (tx == separator) %>% # 检测分隔行
{. & lead(., 2)} |> # 分组以分隔行开始,之后再有两行
cumsum()
) |>
filter(tx != separator) |> # 删除分隔行
nest(text = tx) |> # 嵌套以将文档作为观察单位
mutate(
fst = map_chr(text, \(x) x |> # 提取包含元信息的第一行
pull(1) |>
first()),
id = str_extract(fst, "\\d{10}"), # 10位数字id的正则表达式
date = str_extract(fst, "\\d{2}-\\d{2}-\\d{4}") |> # 日期的正则表达式
lubridate::dmy(),
text = map_chr(text, \(x) x |> # 将文本正文合并为单个字符串
slice(-1) |>
pull(1) |>
str_c(collapse = "\n")),
.before = text
) |>
select(-fst)
#> # A tibble: 2 × 4
#> grp id date text
#> <int> <chr> <date> <chr>
#> 1 1 2002201234 2015-06-09 "Lorem ipsum\nLorem ipsum\nLorem ipsum Lorem ipsu…
#> 2 2 1001101234 2011-03-05 "TEST\nTEST"
英文:
A solution using basic tidyverse
packages. Look to the comments in the code for detailed explanations of the steps.
library(tidyverse)
library(lubridate)
separator <- "-------------------------------------------------"
tibble(
tx = c(names(AA), AA[[1]]) # take first line from name to data vector, this should be done during import
) |>
mutate(
grp = (tx == separator) %>% # detect separator lines
{. & lead(., 2)} |> # group begins with with a separator line followed by another after 2 lines
cumsum()
) |>
filter(tx != separator) |> # remove separator lines
nest(text = tx) |> # nest to make document the unit of observation
mutate(
fst = map_chr(text, \(x) x |> # extract first line containing meta info
pull(1) |>
first()),
id = str_extract(fst, "\\d{10}"), # Regex for 10 digit id string
date = str_extract(fst, "\\d{2}-\\d{2}-\\d{4}") |> # regex for date
lubridate::dmy(),
text = map_chr(text, \(x) x |> # collapse text body to single string
slice(-1) |>
pull(1) |>
str_c(collapse = "\n")),
.before = text
) |>
select(-fst)
#> # A tibble: 2 × 4
#> grp id date text
#> <int> <chr> <date> <chr>
#> 1 1 2002201234 2015-06-09 "Lorem ipsum\nLorem ipsum\nLorem ipsum Lorem ipsu…
#> 2 2 1001101234 2011-03-05 "TEST\nTEST"
答案4
得分: 1
我会在tidyverse中使用一些简单的连续步骤,主要使用dplyr
、tidyr
和stringr
。
library(dplyr)
library(tidyr)
library(stringr)
AA %>%
rename_with(~ "text") %>%
filter(!str_detect(text, "---+")) %>% # 移除"-----"行
mutate(index = cumsum(str_detect(text, ".*\\d{10}.*"))) %>% # 创建id索引列
group_by(index) %>%
mutate(temp = first(text)) %>% # 将id+日期信息分离到临时列中
extract(col = temp,
into = c("ID", "date"),
regex = ".*(\\d{10}).*(\\d{2}-\\d{2}-\\d{4}).*",
remove = TRUE) %>% # 从临时id创建"ID"和"date"列
mutate(date = lubridate::dmy(date)) %>% # 将日期转换为适当的日期类
slice(-1) %>% # 移除案例标题/ID行
nest(text = text) %>% # 每行一个案例,带有嵌套的文本变量
ungroup()
这将为我们提供所需的输出,其中文本列是一个包含所有文本数据的tibble列表。之后处理这些tibble非常容易:
pull(AA, text)
[[1]]
# 一个tibble: 8 × 1
text
<chr>
1 Lorem ipsum
2 Lorem ipsum
3 Lorem ipsum Lorem ipsum
4 Lorem ipsum: Lorem ipsum
5 123456
6 AB
7 AB
8 Lorem ipsum
[[2]]
# 一个tibble: 2 × 1
text
<chr>
1 TEST
2 TEST
或者
mutate(AA, text = map(text, pull))
# 一个tibble: 2 × 4
index ID date text
<int> <chr> <chr> <list>
1 1 2002201234 09-06-2015 <chr [8]>
2 2 1001101234 05-03-2011 <chr [2]>
希望这有帮助!
英文:
I would use some simple sequential steps within the tidyverse. I would mainly use dplyr
, tidyr
and stringr
.
library(dplyr)
library(tidyr)
library(stringr)
AA %>%
rename_with(~ "text") %>%
filter(!str_detect(text, "-{3,}")) %>% #remove "-----" lines
mutate(index = cumsum(str_detect(text, ".*\\d{10}.*"))) %>% #create id index column
group_by(index) %>%
mutate(temp = first(text)) %>% #separate id+date info into temporary column
extract(col = temp,
into = c("ID", "date"),
regex = ".*(\\d{10}).*(\\d{2}-\\d{2}-\\d{4}).*",
remove = TRUE) %>% #create "ID" and "date" columns from temp id
mutate(date = lubridate::dmy(date)) %>% #convert dates into proper date class
slice(-1) %>% #remove case headers/id rows
nest(text = text) %>% #one case per line, with a nested text variable
ungroup()
# A tibble: 2 × 4
index ID date text
<int> <chr> <chr> <list>
1 1 2002201234 09-06-2015 <tibble [8 × 1]>
2 2 1001101234 05-03-2011 <tibble [2 × 1]>
This would give us the desired output, with the text column as a list of tibbles with all the text data. It is fairly easy to handle these tibbles after that:
pull(AA,text)
[[1]]
# A tibble: 8 × 1
text
<chr>
1 Lorem ipsum
2 Lorem ipsum
3 Lorem ipsum Lorem ipsum
4 Lorem ipsum: Lorem ipsum
5 123456
6 AB
7 AB
8 Lorem ipsum
[[2]]
# A tibble: 2 × 1
text
<chr>
1 TEST
2 TEST
OR
mutate(AA, text = map(text, pull))
# A tibble: 2 × 4
index ID date text
<int> <chr> <chr> <list>
1 1 2002201234 09-06-2015 <chr [8]>
2 2 1001101234 05-03-2011 <chr [2]>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论