2023年6月22日 05:21:14go评论101阅读模式

英文:

How to combine text that has spilled over multiple rows in R

问题

我使用tabulizer在R中抓取了一个PDF文件，并发现多行文本经常溢出到多行中，如下所示：
我想将跨多行的文本组合成有限数量的行，使其看起来像这样：
以下是重新创建这些tibble的代码：

英文:

I scraped a PDF in R using tabulizer and saw that multi-line text often spilled over into multiple rows, like so:

              item quantity price
1   PURCHASE ORDER       NA    NA
2    ITEM CODE ABC       NA    NA
3  1 Computer soft        3 10.99
4             ware       NA    NA
5         2 Mathem        2  8.50
6 atics curriculum       NA    NA
7  and calculators       NA    NA

I would like to combine the text across multiple rows into a limited number of rows, so it looks like this:

  item                                    quantity price
1 PURCHASE ORDER ITEM CODE ABC                NA    NA  
2 1 Computer software                           3 10.99
3 2 Mathematics curriculum and calculators      2  8.5

Here is the code for recreating these tibbles:

#Messy tibble
item &lt;- c(&quot;PURCHASE ORDER&quot;, &quot;ITEM CODE ABC&quot;, &quot;1 Computer soft&quot;, &quot;ware&quot;, &quot;2 Mathem&quot;, &quot;atics curriculum&quot;, &quot;and calculators&quot;)
quantity &lt;- c(NA, NA, 3, NA, 2, NA, NA)
price &lt;- c(NA, NA, 10.99, NA, 8.50, NA, NA)
df &lt;- data.frame(item, quantity, price)
df
#Clean tibble
item &lt;- c(&quot;PURCHASE ORDER ITEM CODE ABC&quot;, &quot;1 Computer software&quot;, &quot;2 Mathematics curriculum and calculators&quot;)
quantity &lt;- c(NA, 3, 2)
price &lt;- c(NA, 10.99, 8.50)
df &lt;- data.frame(item, quantity, price)
df

答案1

得分: 2

你可以通过那些在 quantity 和 price 上有缺失值的组来识别它们。然后，为将要在一起的文本创建唯一的组标识符。最后，你可以将文本粘贴在一起，并获取属于每个组中第一个观察值的 quantity 和 price 的值：

library(dplyr)
item <- c("PURCHASE ORDER", "ITEM CODE ABC", "1 Computer soft", "ware", "2 Mathem", "atics curriculum", "and calculators")
quantity <- c(NA, NA, 3, NA, 2, NA, NA)
price <- c(NA, NA, 10.99, NA, 8.50, NA, NA)
df <- data.frame(item, quantity, price)
df
#>               item quantity price
#> 1   PURCHASE ORDER       NA    NA
#> 2    ITEM CODE ABC       NA    NA
#> 3  1 Computer soft        3 10.99
#> 4             ware       NA    NA
#> 5         2 Mathem        2  8.50
#> 6 atics curriculum       NA    NA
#> 7  and calculators       NA    NA
df %>%
  mutate(grp = ifelse(is.na(quantity) & is.na(price) & !row_number() == 1, 0, 1),
         grp = cumsum(grp)) %>%
  group_by(grp) %>%
  reframe(item = paste(item, collapse=" "), 
            across(c(quantity, price), first)) %>%
  select(-grp)
#> # A tibble: 3 × 3
#>   item                                      quantity price
#>   <chr>                                        <dbl> <dbl>
#> 1 PURCHASE ORDER ITEM CODE ABC                    NA  NA  
#> 2 1 Computer soft ware                             3  11.0
#> 3 2 Mathem atics curriculum and calculators        2   8.5

^{创建于2023-06-21，使用 reprex v2.0.2}

英文:

You could identify the groups by those that have missing values on quantity and price. Then, make unique group identifiers for the text that will go together. Finally, you can paste the text together and take the values of quantity and price that belong to the first observation in each group:

library(dplyr)
item &lt;- c(&quot;PURCHASE ORDER&quot;, &quot;ITEM CODE ABC&quot;, &quot;1 Computer soft&quot;, &quot;ware&quot;, &quot;2 Mathem&quot;, &quot;atics curriculum&quot;, &quot;and calculators&quot;)
quantity &lt;- c(NA, NA, 3, NA, 2, NA, NA)
price &lt;- c(NA, NA, 10.99, NA, 8.50, NA, NA)
df &lt;- data.frame(item, quantity, price)
df
#&gt;               item quantity price
#&gt; 1   PURCHASE ORDER       NA    NA
#&gt; 2    ITEM CODE ABC       NA    NA
#&gt; 3  1 Computer soft        3 10.99
#&gt; 4             ware       NA    NA
#&gt; 5         2 Mathem        2  8.50
#&gt; 6 atics curriculum       NA    NA
#&gt; 7  and calculators       NA    NA
df %&gt;% 
  mutate(grp = ifelse(is.na(quantity) &amp; is.na(price) &amp; !row_number() == 1, 0, 1),
         grp = cumsum(grp)) %&gt;%
  group_by(grp) %&gt;% 
  reframe(item = paste(item, collapse=&quot; &quot;), 
            across(c(quantity, price), first)) %&gt;%
  select(-grp)
#&gt; # A tibble: 3 &#215; 3
#&gt;   item                                      quantity price
#&gt;   &lt;chr&gt;                                        &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 PURCHASE ORDER ITEM CODE ABC                    NA  NA  
#&gt; 2 1 Computer soft ware                             3  11.0
#&gt; 3 2 Mathem atics curriculum and calculators        2   8.5

<sup>Created on 2023-06-21 with reprex v2.0.2</sup>

答案2

得分: 1

以下是整理数据的一个选项：

注意：依我个人看法，前两行（"PURCHASE ...."）反映了列名，应该被删除。

library(dplyr, warn=FALSE)
library(tidyr)
library(stringr)
df |&gt; 
  mutate(no = as.numeric(str_extract(item, "^\\d+"))) |&gt; 
  fill(no) |&gt; 
  replace_na(list(no = 0)) |&gt; 
  group_by(no) |&gt; 
  summarise(across(everything(), \(x) paste0(x[!is.na(x)], collapse = " "))) |&gt; 
  select(-no)
#&gt; # A tibble: 3 &#215; 3
#&gt;   item                                      quantity price  
#&gt;   &lt;chr&gt;                                     &lt;chr&gt;    &lt;chr&gt;  
#&gt; 1 PURCHASE ORDER ITEM CODE ABC              ""       ""     
#&gt; 2 1 Computer soft ware                      "3"      "10.99"
#&gt; 3 2 Mathem atics curriculum and calculators "2"      "8.5"

英文:

Here is one option to tidy up your data:

Note: IMHO the first two rows ("PURCHASE ....") reflect the column name and should be dropped.

library(dplyr, warn=FALSE)
library(tidyr)
library(stringr)
df |&gt; 
  mutate(no = as.numeric(str_extract(item, &quot;^\\d+&quot;))) |&gt; 
  fill(no) |&gt; 
  replace_na(list(no = 0)) |&gt; 
  group_by(no) |&gt; 
  summarise(across(everything(), \(x) paste0(x[!is.na(x)], collapse = &quot; &quot;))) |&gt; 
  select(-no)
#&gt; # A tibble: 3 &#215; 3
#&gt;   item                                      quantity price  
#&gt;   &lt;chr&gt;                                     &lt;chr&gt;    &lt;chr&gt;  
#&gt; 1 PURCHASE ORDER ITEM CODE ABC              &quot;&quot;       &quot;&quot;     
#&gt; 2 1 Computer soft ware                      &quot;3&quot;      &quot;10.99&quot;
#&gt; 3 2 Mathem atics curriculum and calculators &quot;2&quot;      &quot;8.5&quot;

答案3

得分: 1

从您的帖子中，不清楚对于相同的行项目，quantity 或 price 是否会有一个值。因此，我只是取了 sum，您可以决定如何聚合这些行。

library(dplyr)
df |&gt;
  mutate(line = cumsum(grepl(&quot;^\\d+&quot;, item))) |&gt;
  summarize(item = paste(item, collapse = &quot; &quot;),
            across(c(quantity, price), ~if(any(!is.na(.x))) sum(.x, na.rm = T) else NA),
            .by = line) |&gt;
  select(-line)

第一个 "line" 没有任何值，也不像其他行以数字开头。如@stefan所提到的，如果这些是您想要排除的某种标题，您可以简化 across 语句为 across(c(quantity, price), na.omit) 并将 summarize 更改为 reframe。

输出

                                       item quantity price
1              PURCHASE ORDER ITEM CODE ABC       NA    NA
2                      1 Computer soft ware        3 10.99
3 2 Mathem atics curriculum and calculators        2  8.50

英文:

From your post, it is not clear if quantity or price will have a value for the same line item. So, I simply took the sum, you can decide how to aggregate these lines.

library(dplyr)
df |&gt;
  mutate(line = cumsum(grepl(&quot;^\\d+&quot;, item))) |&gt;
  summarize(item = paste(item, collapse = &quot; &quot;),
            across(c(quantity, price), ~if(any(!is.na(.x))) sum(.x, na.rm = T) else NA),
            .by = line) |&gt;
  select(-line)

The first "line" does not have any values nor does it start with a digit like the others. As mentioned by @stefan if these are some type of header that you want to exclude, you can simplify the across statement to across(c(quantity, price), na.omit) and change summarize to reframe.

Output

                                       item quantity price
1              PURCHASE ORDER ITEM CODE ABC       NA    NA
2                      1 Computer soft ware        3 10.99
3 2 Mathem atics curriculum and calculators        2  8.50

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在R中合并跨多行的文本

问题

答案1

答案2

答案3

提取半结构化 .txt 中的文本和表格。

Polars从字典值的有条件查找中的列

图的阶数 n

如何使用具有相同分组变量的两个数据框创建带有标准差的误差条图？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。