如何在R中合并跨多行的文本

huangapple go评论68阅读模式
英文:

How to combine text that has spilled over multiple rows in R

问题

我使用tabulizer在R中抓取了一个PDF文件,并发现多行文本经常溢出到多行中,如下所示:

我想将跨多行的文本组合成有限数量的行,使其看起来像这样:

以下是重新创建这些tibble的代码:
英文:

I scraped a PDF in R using tabulizer and saw that multi-line text often spilled over into multiple rows, like so:

              item quantity price
1   PURCHASE ORDER       NA    NA
2    ITEM CODE ABC       NA    NA
3  1 Computer soft        3 10.99
4             ware       NA    NA
5         2 Mathem        2  8.50
6 atics curriculum       NA    NA
7  and calculators       NA    NA

I would like to combine the text across multiple rows into a limited number of rows, so it looks like this:

  item                                    quantity price
1 PURCHASE ORDER ITEM CODE ABC                NA    NA  
2 1 Computer software                           3 10.99
3 2 Mathematics curriculum and calculators      2  8.5 

Here is the code for recreating these tibbles:

#Messy tibble
item <- c("PURCHASE ORDER", "ITEM CODE ABC", "1 Computer soft", "ware", "2 Mathem", "atics curriculum", "and calculators")
quantity <- c(NA, NA, 3, NA, 2, NA, NA)
price <- c(NA, NA, 10.99, NA, 8.50, NA, NA)
df <- data.frame(item, quantity, price)
df

#Clean tibble
item <- c("PURCHASE ORDER ITEM CODE ABC", "1 Computer software", "2 Mathematics curriculum and calculators")
quantity <- c(NA, 3, 2)
price <- c(NA, 10.99, 8.50)
df <- data.frame(item, quantity, price)
df

答案1

得分: 2

你可以通过那些在 quantityprice 上有缺失值的组来识别它们。然后,为将要在一起的文本创建唯一的组标识符。最后,你可以将文本粘贴在一起,并获取属于每个组中第一个观察值的 quantityprice 的值:

library(dplyr)
item <- c("PURCHASE ORDER", "ITEM CODE ABC", "1 Computer soft", "ware", "2 Mathem", "atics curriculum", "and calculators")
quantity <- c(NA, NA, 3, NA, 2, NA, NA)
price <- c(NA, NA, 10.99, NA, 8.50, NA, NA)
df <- data.frame(item, quantity, price)
df
#>               item quantity price
#> 1   PURCHASE ORDER       NA    NA
#> 2    ITEM CODE ABC       NA    NA
#> 3  1 Computer soft        3 10.99
#> 4             ware       NA    NA
#> 5         2 Mathem        2  8.50
#> 6 atics curriculum       NA    NA
#> 7  and calculators       NA    NA

df %>%
  mutate(grp = ifelse(is.na(quantity) & is.na(price) & !row_number() == 1, 0, 1),
         grp = cumsum(grp)) %>%
  group_by(grp) %>%
  reframe(item = paste(item, collapse=" "), 
            across(c(quantity, price), first)) %>%
  select(-grp)
#> # A tibble: 3 × 3
#>   item                                      quantity price
#>   <chr>                                        <dbl> <dbl>
#> 1 PURCHASE ORDER ITEM CODE ABC                    NA  NA  
#> 2 1 Computer soft ware                             3  11.0
#> 3 2 Mathem atics curriculum and calculators        2   8.5

创建于2023-06-21,使用 reprex v2.0.2

英文:

You could identify the groups by those that have missing values on quantity and price. Then, make unique group identifiers for the text that will go together. Finally, you can paste the text together and take the values of quantity and price that belong to the first observation in each group:

library(dplyr)
item &lt;- c(&quot;PURCHASE ORDER&quot;, &quot;ITEM CODE ABC&quot;, &quot;1 Computer soft&quot;, &quot;ware&quot;, &quot;2 Mathem&quot;, &quot;atics curriculum&quot;, &quot;and calculators&quot;)
quantity &lt;- c(NA, NA, 3, NA, 2, NA, NA)
price &lt;- c(NA, NA, 10.99, NA, 8.50, NA, NA)
df &lt;- data.frame(item, quantity, price)
df
#&gt;               item quantity price
#&gt; 1   PURCHASE ORDER       NA    NA
#&gt; 2    ITEM CODE ABC       NA    NA
#&gt; 3  1 Computer soft        3 10.99
#&gt; 4             ware       NA    NA
#&gt; 5         2 Mathem        2  8.50
#&gt; 6 atics curriculum       NA    NA
#&gt; 7  and calculators       NA    NA

df %&gt;% 
  mutate(grp = ifelse(is.na(quantity) &amp; is.na(price) &amp; !row_number() == 1, 0, 1),
         grp = cumsum(grp)) %&gt;%
  group_by(grp) %&gt;% 
  reframe(item = paste(item, collapse=&quot; &quot;), 
            across(c(quantity, price), first)) %&gt;%
  select(-grp)
#&gt; # A tibble: 3 &#215; 3
#&gt;   item                                      quantity price
#&gt;   &lt;chr&gt;                                        &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 PURCHASE ORDER ITEM CODE ABC                    NA  NA  
#&gt; 2 1 Computer soft ware                             3  11.0
#&gt; 3 2 Mathem atics curriculum and calculators        2   8.5

<sup>Created on 2023-06-21 with reprex v2.0.2</sup>

答案2

得分: 1

以下是整理数据的一个选项:

注意:依我个人看法,前两行("PURCHASE ....")反映了列名,应该被删除。

library(dplyr, warn=FALSE)
library(tidyr)
library(stringr)

df |&gt; 
  mutate(no = as.numeric(str_extract(item, "^\\d+"))) |&gt; 
  fill(no) |&gt; 
  replace_na(list(no = 0)) |&gt; 
  group_by(no) |&gt; 
  summarise(across(everything(), \(x) paste0(x[!is.na(x)], collapse = " "))) |&gt; 
  select(-no)
#&gt; # A tibble: 3 &#215; 3
#&gt;   item                                      quantity price  
#&gt;   &lt;chr&gt;                                     &lt;chr&gt;    &lt;chr&gt;  
#&gt; 1 PURCHASE ORDER ITEM CODE ABC              ""       ""     
#&gt; 2 1 Computer soft ware                      "3"      "10.99"
#&gt; 3 2 Mathem atics curriculum and calculators "2"      "8.5"
英文:

Here is one option to tidy up your data:

Note: IMHO the first two rows (&quot;PURCHASE ....&quot;) reflect the column name and should be dropped.

library(dplyr, warn=FALSE)
library(tidyr)
library(stringr)

df |&gt; 
  mutate(no = as.numeric(str_extract(item, &quot;^\\d+&quot;))) |&gt; 
  fill(no) |&gt; 
  replace_na(list(no = 0)) |&gt; 
  group_by(no) |&gt; 
  summarise(across(everything(), \(x) paste0(x[!is.na(x)], collapse = &quot; &quot;))) |&gt; 
  select(-no)
#&gt; # A tibble: 3 &#215; 3
#&gt;   item                                      quantity price  
#&gt;   &lt;chr&gt;                                     &lt;chr&gt;    &lt;chr&gt;  
#&gt; 1 PURCHASE ORDER ITEM CODE ABC              &quot;&quot;       &quot;&quot;     
#&gt; 2 1 Computer soft ware                      &quot;3&quot;      &quot;10.99&quot;
#&gt; 3 2 Mathem atics curriculum and calculators &quot;2&quot;      &quot;8.5&quot;

答案3

得分: 1

从您的帖子中,不清楚对于相同的行项目,quantityprice 是否会有一个值。因此,我只是取了 sum,您可以决定如何聚合这些行。

library(dplyr)

df |&gt;
  mutate(line = cumsum(grepl(&quot;^\\d+&quot;, item))) |&gt;
  summarize(item = paste(item, collapse = &quot; &quot;),
            across(c(quantity, price), ~if(any(!is.na(.x))) sum(.x, na.rm = T) else NA),
            .by = line) |&gt;
  select(-line)

第一个 "line" 没有任何值,也不像其他行以数字开头。如@stefan所提到的,如果这些是您想要排除的某种标题,您可以简化 across 语句为 across(c(quantity, price), na.omit) 并将 summarize 更改为 reframe

输出

                                       item quantity price
1              PURCHASE ORDER ITEM CODE ABC       NA    NA
2                      1 Computer soft ware        3 10.99
3 2 Mathem atics curriculum and calculators        2  8.50
英文:

From your post, it is not clear if quantity or price will have a value for the same line item. So, I simply took the sum, you can decide how to aggregate these lines.

library(dplyr)

df |&gt;
  mutate(line = cumsum(grepl(&quot;^\\d+&quot;, item))) |&gt;
  summarize(item = paste(item, collapse = &quot; &quot;),
            across(c(quantity, price), ~if(any(!is.na(.x))) sum(.x, na.rm = T) else NA),
            .by = line) |&gt;
  select(-line)

The first "line" does not have any values nor does it start with a digit like the others. As mentioned by @stefan if these are some type of header that you want to exclude, you can simplify the across statement to across(c(quantity, price), na.omit) and change summarize to reframe.

Output

                                       item quantity price
1              PURCHASE ORDER ITEM CODE ABC       NA    NA
2                      1 Computer soft ware        3 10.99
3 2 Mathem atics curriculum and calculators        2  8.50

huangapple
  • 本文由 发表于 2023年6月22日 05:21:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/76527223.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定