英文:
How to combine text that has spilled over multiple rows in R
问题
我使用tabulizer在R中抓取了一个PDF文件,并发现多行文本经常溢出到多行中,如下所示:
我想将跨多行的文本组合成有限数量的行,使其看起来像这样:
以下是重新创建这些tibble的代码:
英文:
I scraped a PDF in R using tabulizer and saw that multi-line text often spilled over into multiple rows, like so:
item quantity price
1 PURCHASE ORDER NA NA
2 ITEM CODE ABC NA NA
3 1 Computer soft 3 10.99
4 ware NA NA
5 2 Mathem 2 8.50
6 atics curriculum NA NA
7 and calculators NA NA
I would like to combine the text across multiple rows into a limited number of rows, so it looks like this:
item quantity price
1 PURCHASE ORDER ITEM CODE ABC NA NA
2 1 Computer software 3 10.99
3 2 Mathematics curriculum and calculators 2 8.5
Here is the code for recreating these tibbles:
#Messy tibble
item <- c("PURCHASE ORDER", "ITEM CODE ABC", "1 Computer soft", "ware", "2 Mathem", "atics curriculum", "and calculators")
quantity <- c(NA, NA, 3, NA, 2, NA, NA)
price <- c(NA, NA, 10.99, NA, 8.50, NA, NA)
df <- data.frame(item, quantity, price)
df
#Clean tibble
item <- c("PURCHASE ORDER ITEM CODE ABC", "1 Computer software", "2 Mathematics curriculum and calculators")
quantity <- c(NA, 3, 2)
price <- c(NA, 10.99, 8.50)
df <- data.frame(item, quantity, price)
df
答案1
得分: 2
你可以通过那些在 quantity
和 price
上有缺失值的组来识别它们。然后,为将要在一起的文本创建唯一的组标识符。最后,你可以将文本粘贴在一起,并获取属于每个组中第一个观察值的 quantity
和 price
的值:
library(dplyr)
item <- c("PURCHASE ORDER", "ITEM CODE ABC", "1 Computer soft", "ware", "2 Mathem", "atics curriculum", "and calculators")
quantity <- c(NA, NA, 3, NA, 2, NA, NA)
price <- c(NA, NA, 10.99, NA, 8.50, NA, NA)
df <- data.frame(item, quantity, price)
df
#> item quantity price
#> 1 PURCHASE ORDER NA NA
#> 2 ITEM CODE ABC NA NA
#> 3 1 Computer soft 3 10.99
#> 4 ware NA NA
#> 5 2 Mathem 2 8.50
#> 6 atics curriculum NA NA
#> 7 and calculators NA NA
df %>%
mutate(grp = ifelse(is.na(quantity) & is.na(price) & !row_number() == 1, 0, 1),
grp = cumsum(grp)) %>%
group_by(grp) %>%
reframe(item = paste(item, collapse=" "),
across(c(quantity, price), first)) %>%
select(-grp)
#> # A tibble: 3 × 3
#> item quantity price
#> <chr> <dbl> <dbl>
#> 1 PURCHASE ORDER ITEM CODE ABC NA NA
#> 2 1 Computer soft ware 3 11.0
#> 3 2 Mathem atics curriculum and calculators 2 8.5
创建于2023-06-21,使用 reprex v2.0.2
英文:
You could identify the groups by those that have missing values on quantity
and price
. Then, make unique group identifiers for the text that will go together. Finally, you can paste the text together and take the values of quantity
and price
that belong to the first observation in each group:
library(dplyr)
item <- c("PURCHASE ORDER", "ITEM CODE ABC", "1 Computer soft", "ware", "2 Mathem", "atics curriculum", "and calculators")
quantity <- c(NA, NA, 3, NA, 2, NA, NA)
price <- c(NA, NA, 10.99, NA, 8.50, NA, NA)
df <- data.frame(item, quantity, price)
df
#> item quantity price
#> 1 PURCHASE ORDER NA NA
#> 2 ITEM CODE ABC NA NA
#> 3 1 Computer soft 3 10.99
#> 4 ware NA NA
#> 5 2 Mathem 2 8.50
#> 6 atics curriculum NA NA
#> 7 and calculators NA NA
df %>%
mutate(grp = ifelse(is.na(quantity) & is.na(price) & !row_number() == 1, 0, 1),
grp = cumsum(grp)) %>%
group_by(grp) %>%
reframe(item = paste(item, collapse=" "),
across(c(quantity, price), first)) %>%
select(-grp)
#> # A tibble: 3 × 3
#> item quantity price
#> <chr> <dbl> <dbl>
#> 1 PURCHASE ORDER ITEM CODE ABC NA NA
#> 2 1 Computer soft ware 3 11.0
#> 3 2 Mathem atics curriculum and calculators 2 8.5
<sup>Created on 2023-06-21 with reprex v2.0.2</sup>
答案2
得分: 1
以下是整理数据的一个选项:
注意:依我个人看法,前两行("PURCHASE ....")反映了列名,应该被删除。
library(dplyr, warn=FALSE)
library(tidyr)
library(stringr)
df |>
mutate(no = as.numeric(str_extract(item, "^\\d+"))) |>
fill(no) |>
replace_na(list(no = 0)) |>
group_by(no) |>
summarise(across(everything(), \(x) paste0(x[!is.na(x)], collapse = " "))) |>
select(-no)
#> # A tibble: 3 × 3
#> item quantity price
#> <chr> <chr> <chr>
#> 1 PURCHASE ORDER ITEM CODE ABC "" ""
#> 2 1 Computer soft ware "3" "10.99"
#> 3 2 Mathem atics curriculum and calculators "2" "8.5"
英文:
Here is one option to tidy up your data:
Note: IMHO the first two rows ("PURCHASE ...."
) reflect the column name and should be dropped.
library(dplyr, warn=FALSE)
library(tidyr)
library(stringr)
df |>
mutate(no = as.numeric(str_extract(item, "^\\d+"))) |>
fill(no) |>
replace_na(list(no = 0)) |>
group_by(no) |>
summarise(across(everything(), \(x) paste0(x[!is.na(x)], collapse = " "))) |>
select(-no)
#> # A tibble: 3 × 3
#> item quantity price
#> <chr> <chr> <chr>
#> 1 PURCHASE ORDER ITEM CODE ABC "" ""
#> 2 1 Computer soft ware "3" "10.99"
#> 3 2 Mathem atics curriculum and calculators "2" "8.5"
答案3
得分: 1
从您的帖子中,不清楚对于相同的行项目,quantity
或 price
是否会有一个值。因此,我只是取了 sum
,您可以决定如何聚合这些行。
library(dplyr)
df |>
mutate(line = cumsum(grepl("^\\d+", item))) |>
summarize(item = paste(item, collapse = " "),
across(c(quantity, price), ~if(any(!is.na(.x))) sum(.x, na.rm = T) else NA),
.by = line) |>
select(-line)
第一个 "line" 没有任何值,也不像其他行以数字开头。如@stefan所提到的,如果这些是您想要排除的某种标题,您可以简化 across 语句为 across(c(quantity, price), na.omit)
并将 summarize
更改为 reframe
。
输出
item quantity price
1 PURCHASE ORDER ITEM CODE ABC NA NA
2 1 Computer soft ware 3 10.99
3 2 Mathem atics curriculum and calculators 2 8.50
英文:
From your post, it is not clear if quantity
or price
will have a value for the same line item. So, I simply took the sum
, you can decide how to aggregate these lines.
library(dplyr)
df |>
mutate(line = cumsum(grepl("^\\d+", item))) |>
summarize(item = paste(item, collapse = " "),
across(c(quantity, price), ~if(any(!is.na(.x))) sum(.x, na.rm = T) else NA),
.by = line) |>
select(-line)
The first "line" does not have any values nor does it start with a digit like the others. As mentioned by @stefan if these are some type of header that you want to exclude, you can simplify the across statement to across(c(quantity, price), na.omit)
and change summarize
to reframe
.
Output
item quantity price
1 PURCHASE ORDER ITEM CODE ABC NA NA
2 1 Computer soft ware 3 10.99
3 2 Mathem atics curriculum and calculators 2 8.50
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论