2023年3月8日 15:39:57go评论105阅读模式

英文:

Creating new row for the item in a data frame if seperated by special character such as "+" sign in R

问题

我有一个包含多列数据的文本文件，我想以不丢失任何信息的方式处理数据，某些列包含两个或更多信息，用特殊字符分隔，比如“+”加号，我想将这些组合信息放在同一列的不同行中，例如我在这里粘贴了数据
我的数据框看起来像下面这样

df <- data.frame(G1=c("GH13_22+CBM4", "GH109+PL7+GH9","GT57", "AA3","",""),
                 G2=c("GH13_22","","GT57+GH15","AA3", "GT41","PL+PL2"),
                 G3=c("GH13", "GH1O9","", "CBM34+GH13+CBM48", "GT41","GH16+CBM4+CBM54+CBM32"))

期望的结果应该如下

df2 <- data.frame(G1=c("GH13_22","CBM4", "GH109","PL7","GH9","GT57", "AA3","","","","",""),
                  G2=c("GH13_22","","GT57","GH15","AA3", "GT41","PL","PL2","","",""),
                  G3=c("GH13", "GH1O9","", "CBM34","GH13","CBM48", "GT41","GH16","CBM4","CBM54","CBM32"))

感谢任何帮助
谢谢

英文:

I have a data in text file which contain several column, I would like to process data in such a way that I should not loose any information, some coulmn include two or more information seperated with special character such as "+" plus sign, I would like to put this combined information in differnt row within same column, for example I pasted data below here
My dataframe look like following

df &lt;- data.frame(G1=c(&quot;GH13_22+CBM4&quot;,  &quot;GH109+PL7+GH9&quot;,&quot;GT57&quot;, &quot;AA3&quot;,&quot;&quot;,&quot;&quot;),
                 G2=c(&quot;GH13_22&quot;,&quot;&quot;,&quot;GT57+GH15&quot;,&quot;AA3&quot;, &quot;GT41&quot;,&quot;PL+PL2&quot;),
                 G3=c(&quot;GH13&quot;, &quot;GH1O9&quot;,&quot;&quot;, &quot;CBM34+GH13+CBM48&quot;, &quot;GT41&quot;,&quot;GH16+CBM4+CBM54+CBM32&quot;))

             G1        G2                    G3
1  GH13_22+CBM4   GH13_22                  GH13
2 GH109+PL7+GH9                           GH1O9
3          GT57 GT57+GH15
4           AA3       AA3      CBM34+GH13+CBM48
5                    GT41                  GT41
6                  PL+PL2 GH16+CBM4+CBM54+CBM32

Expected Results should look like

df2 &lt;- data.frame(G1=c(&quot;GH13_22&quot;,&quot;CBM4&quot;,  &quot;GH109&quot;,&quot;PL7&quot;,&quot;GH9&quot;,&quot;GT57&quot;, &quot;AA3&quot;,&quot;&quot;,&quot;&quot;,&quot;&quot;,&quot;&quot;),
                  G2=c(&quot;GH13_22&quot;,&quot;&quot;,&quot;GT57&quot;,&quot;GH15&quot;,&quot;AA3&quot;, &quot;GT41&quot;,&quot;PL&quot;,&quot;PL2&quot;,&quot;&quot;,&quot;&quot;,&quot;&quot;),
                  G3=c(&quot;GH13&quot;, &quot;GH1O9&quot;,&quot;&quot;, &quot;CBM34&quot;,&quot;GH13&quot;,&quot;CBM48&quot;, &quot;GT41&quot;,&quot;GH16&quot;,&quot;CBM4&quot;,&quot;CBM54&quot;,&quot;CBM32&quot;))

        G1      G2    G3
1  GH13_22 GH13_22  GH13
2     CBM4         GH1O9
3    GH109    GT57
4      PL7    GH15 CBM34
5      GH9     AA3  GH13
6     GT57    GT41 CBM48
7      AA3      PL  GT41
8              PL2  GH16
9                   CBM4
10                 CBM54
11                 CBM32

Appreciation for any help
Thanks

答案1

得分: 2

A base solution:

split &lt;- lapply(df, \(x) unlist(strsplit(replace(x, x == &#39;&#39;, NA_character_), &#39;\\+&#39;)))
as.data.frame(lapply(split, `[`, 1:max(lengths(split))))
        G1      G2    G3
1  GH13_22 GH13_22  GH13
2     CBM4    &lt;NA&gt; GH1O9
3    GH109    GT57  &lt;NA&gt;
4      PL7    GH15 CBM34
5      GH9     AA3  GH13
6     GT57    GT41 CBM48
7      AA3      PL  GT41
8     &lt;NA&gt;     PL2  GH16
9     &lt;NA&gt;    &lt;NA&gt;  CBM4
10    &lt;NA&gt;    &lt;NA&gt; CBM54
11    &lt;NA&gt;    &lt;NA&gt; CBM32

英文:

A base solution:

split &lt;- lapply(df, \(x) unlist(strsplit(replace(x, x == &#39;&#39;, NA_character_), &#39;\\+&#39;)))
as.data.frame(lapply(split, `[`, 1:max(lengths(split))))
        G1      G2    G3
1  GH13_22 GH13_22  GH13
2     CBM4    &lt;NA&gt; GH1O9
3    GH109    GT57  &lt;NA&gt;
4      PL7    GH15 CBM34
5      GH9     AA3  GH13
6     GT57    GT41 CBM48
7      AA3      PL  GT41
8     &lt;NA&gt;     PL2  GH16
9     &lt;NA&gt;    &lt;NA&gt;  CBM4
10    &lt;NA&gt;    &lt;NA&gt; CBM54
11    &lt;NA&gt;    &lt;NA&gt; CBM32

答案2

得分: 1

separate_rows()已被separate_longer_delim()取代，因为它在API上与其他分离函数更一致。被取代的函数不会消失，但只会接收关键错误修复。 <https://tidyr.tidyverse.org/reference/separate_rows.html>

我们将数据转换为长格式
使用dplyr中的na_if将空白替换为NA
使用这行代码 summarise(cur_data()[seq(max(id)), ])，我们扩展每个组到id的最大值。
最后，我们将准备好的数据框再次转换为宽格式：

library(dplyr)
library(tidyr)
df %>% 
  pivot_longer(everything()) %>% 
  separate_longer_delim(value, "+") %>% 
  mutate(value = na_if(value, "")) %>% 
  group_by(name) %>% 
  mutate(id = row_number()) %>% 
  summarise(cur_data()[seq(max(id)), ]) %>% 
  pivot_wider(names_from = name, values_from = value) 
      id G1      G2      G3   
   <int> <chr>   <chr>   <chr>
 1     1 GH13_22 GH13_22 GH13 
 2     2 CBM4    NA      GH1O9
 3     3 GH109   GT57    NA   
 4     4 PL7     GH15    CBM34
 5     5 GH9     AA3     GH13 
 6     6 GT57    GT41    CBM48
 7     7 AA3     PL      GT41 
 8     8 NA      PL2     GH16 
 9     9 NA      NA      CBM4 
10    10 NA      NA      CBM54
11    11 NA      NA      CBM32

英文:

separate_rows() has been superseded in favour of separate_longer_delim() because it has a more consistent API with other separate functions. Superseded functions will not go away, but will only receive critical bug fixes. <https://tidyr.tidyverse.org/reference/separate_rows.html>

We bring data in long format
replace blank with NA using na_if from dplyr
With this line of code summarise(cur_data()[seq(max(id)), ]) we expandd each group to the max of id.
Finally we pivot back the prepared data frame:

library(dplyr)
library(tidyr)
df %&gt;% 
  pivot_longer(everything()) %&gt;% 
  separate_longer_delim(value, &quot;+&quot;) %&gt;% 
  mutate(value = na_if(value, &quot;&quot;)) %&gt;% 
  group_by(name) %&gt;% 
  mutate(id = row_number()) %&gt;% 
  summarise(cur_data()[seq(max(id)), ]) %&gt;% 
  pivot_wider(names_from = name, values_from = value) 
      id G1      G2      G3   
   &lt;int&gt; &lt;chr&gt;   &lt;chr&gt;   &lt;chr&gt;
 1     1 GH13_22 GH13_22 GH13 
 2     2 CBM4    NA      GH1O9
 3     3 GH109   GT57    NA   
 4     4 PL7     GH15    CBM34
 5     5 GH9     AA3     GH13 
 6     6 GT57    GT41    CBM48
 7     7 AA3     PL      GT41 
 8     8 NA      PL2     GH16 
 9     9 NA      NA      CBM4 
10    10 NA      NA      CBM54
11    11 NA      NA      CBM32

答案3

得分: 1

受@Peter M在此帖的启发，另一种选择是：

library(tidyverse)
library(stringr)
# 找出最长的向量并相应地填充其他向量
makePaddedDataFrame <- function(l){
  maxlen <- max(sapply(l, length))
  data.frame(lapply(l, \(x) x[1:maxlen])) # 用NA填充向量
}
df %>%
  mutate(across(.fns = function(x) str_split(x, pattern = "\\+"))) %>%
  lapply(function(x) do.call(c, x)) %>%
  makePaddedDataFrame %>%
  replace(is.na(.), " ") # 如果您想要空字符串而不是NA

得到的数据框如下：

       G1      G2    G3
1  GH13_22 GH13_22  GH13
2     CBM4    GH109      
3    GH109    GT57      
4      PL7    GH15 CBM34
5      GH9     AA3  GH13
6     GT57    GT41 CBM48
7      AA3      PL  GT41
8              PL2  GH16
9                   CBM4
10                 CBM54
11                 CBM32

希望这对您有所帮助。

英文:

Another option, inspired by @Peter M in this post

library(tidyverse)
library(stringr)
# finds which vector is the longest and pads the other vectors accordingly
makePaddedDataFrame &lt;- function(l){
  maxlen &lt;- max(sapply(l,length))
  data.frame(lapply(l,\(x) x[1:maxlen])) # pads vectors with na
}
df %&gt;% 
  mutate(across(.fns = function(x) str_split(x, pattern=&quot;\\+&quot;))) %&gt;% 
  lapply(function(x) do.call(c, x)) %&gt;% 
  makePaddedDataFrame %&gt;% 
  replace(is.na(.), &quot; &quot;) # if you want empty strings instead of na
        G1      G2    G3
1  GH13_22 GH13_22  GH13
2     CBM4         GH1O9
3    GH109    GT57      
4      PL7    GH15 CBM34
5      GH9     AA3  GH13
6     GT57    GT41 CBM48
7      AA3      PL  GT41
8              PL2  GH16
9                   CBM4
10                 CBM54
11                 CBM32

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在R中，如果数据框中的项目被特殊字符如”+”分隔，可以创建一个新的行。

问题

答案1

答案2

答案3

在Shiny R中保存具有不同尺寸的多个PDF页面。

有没有一种更简洁的方法来从我的R数据集中获取最早的诊断和代码？

如何解析一系列数字？

Markdown嵌入RMarkdown的边框

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。