在R中,如果数据框中的项目被特殊字符如”+”分隔,可以创建一个新的行。

huangapple go评论105阅读模式
英文:

Creating new row for the item in a data frame if seperated by special character such as "+" sign in R

问题

我有一个包含多列数据的文本文件,我想以不丢失任何信息的方式处理数据,某些列包含两个或更多信息,用特殊字符分隔,比如“+”加号,我想将这些组合信息放在同一列的不同行中,例如我在这里粘贴了数据
我的数据框看起来像下面这样

  1. df <- data.frame(G1=c("GH13_22+CBM4", "GH109+PL7+GH9","GT57", "AA3","",""),
  2. G2=c("GH13_22","","GT57+GH15","AA3", "GT41","PL+PL2"),
  3. G3=c("GH13", "GH1O9","", "CBM34+GH13+CBM48", "GT41","GH16+CBM4+CBM54+CBM32"))

期望的结果应该如下

  1. df2 <- data.frame(G1=c("GH13_22","CBM4", "GH109","PL7","GH9","GT57", "AA3","","","","",""),
  2. G2=c("GH13_22","","GT57","GH15","AA3", "GT41","PL","PL2","","",""),
  3. G3=c("GH13", "GH1O9","", "CBM34","GH13","CBM48", "GT41","GH16","CBM4","CBM54","CBM32"))

感谢任何帮助
谢谢

英文:

I have a data in text file which contain several column, I would like to process data in such a way that I should not loose any information, some coulmn include two or more information seperated with special character such as "+" plus sign, I would like to put this combined information in differnt row within same column, for example I pasted data below here
My dataframe look like following

  1. df &lt;- data.frame(G1=c(&quot;GH13_22+CBM4&quot;, &quot;GH109+PL7+GH9&quot;,&quot;GT57&quot;, &quot;AA3&quot;,&quot;&quot;,&quot;&quot;),
  2. G2=c(&quot;GH13_22&quot;,&quot;&quot;,&quot;GT57+GH15&quot;,&quot;AA3&quot;, &quot;GT41&quot;,&quot;PL+PL2&quot;),
  3. G3=c(&quot;GH13&quot;, &quot;GH1O9&quot;,&quot;&quot;, &quot;CBM34+GH13+CBM48&quot;, &quot;GT41&quot;,&quot;GH16+CBM4+CBM54+CBM32&quot;))
  1. G1 G2 G3
  2. 1 GH13_22+CBM4 GH13_22 GH13
  3. 2 GH109+PL7+GH9 GH1O9
  4. 3 GT57 GT57+GH15
  5. 4 AA3 AA3 CBM34+GH13+CBM48
  6. 5 GT41 GT41
  7. 6 PL+PL2 GH16+CBM4+CBM54+CBM32

Expected Results should look like

  1. df2 &lt;- data.frame(G1=c(&quot;GH13_22&quot;,&quot;CBM4&quot;, &quot;GH109&quot;,&quot;PL7&quot;,&quot;GH9&quot;,&quot;GT57&quot;, &quot;AA3&quot;,&quot;&quot;,&quot;&quot;,&quot;&quot;,&quot;&quot;),
  2. G2=c(&quot;GH13_22&quot;,&quot;&quot;,&quot;GT57&quot;,&quot;GH15&quot;,&quot;AA3&quot;, &quot;GT41&quot;,&quot;PL&quot;,&quot;PL2&quot;,&quot;&quot;,&quot;&quot;,&quot;&quot;),
  3. G3=c(&quot;GH13&quot;, &quot;GH1O9&quot;,&quot;&quot;, &quot;CBM34&quot;,&quot;GH13&quot;,&quot;CBM48&quot;, &quot;GT41&quot;,&quot;GH16&quot;,&quot;CBM4&quot;,&quot;CBM54&quot;,&quot;CBM32&quot;))
  1. G1 G2 G3
  2. 1 GH13_22 GH13_22 GH13
  3. 2 CBM4 GH1O9
  4. 3 GH109 GT57
  5. 4 PL7 GH15 CBM34
  6. 5 GH9 AA3 GH13
  7. 6 GT57 GT41 CBM48
  8. 7 AA3 PL GT41
  9. 8 PL2 GH16
  10. 9 CBM4
  11. 10 CBM54
  12. 11 CBM32

Appreciation for any help
Thanks

答案1

得分: 2

A base solution:

  1. split &lt;- lapply(df, \(x) unlist(strsplit(replace(x, x == &#39;&#39;, NA_character_), &#39;\\+&#39;)))
  2. as.data.frame(lapply(split, `[`, 1:max(lengths(split))))
  3. G1 G2 G3
  4. 1 GH13_22 GH13_22 GH13
  5. 2 CBM4 &lt;NA&gt; GH1O9
  6. 3 GH109 GT57 &lt;NA&gt;
  7. 4 PL7 GH15 CBM34
  8. 5 GH9 AA3 GH13
  9. 6 GT57 GT41 CBM48
  10. 7 AA3 PL GT41
  11. 8 &lt;NA&gt; PL2 GH16
  12. 9 &lt;NA&gt; &lt;NA&gt; CBM4
  13. 10 &lt;NA&gt; &lt;NA&gt; CBM54
  14. 11 &lt;NA&gt; &lt;NA&gt; CBM32
英文:

A base solution:

  1. split &lt;- lapply(df, \(x) unlist(strsplit(replace(x, x == &#39;&#39;, NA_character_), &#39;\\+&#39;)))
  2. as.data.frame(lapply(split, `[`, 1:max(lengths(split))))
  3. G1 G2 G3
  4. 1 GH13_22 GH13_22 GH13
  5. 2 CBM4 &lt;NA&gt; GH1O9
  6. 3 GH109 GT57 &lt;NA&gt;
  7. 4 PL7 GH15 CBM34
  8. 5 GH9 AA3 GH13
  9. 6 GT57 GT41 CBM48
  10. 7 AA3 PL GT41
  11. 8 &lt;NA&gt; PL2 GH16
  12. 9 &lt;NA&gt; &lt;NA&gt; CBM4
  13. 10 &lt;NA&gt; &lt;NA&gt; CBM54
  14. 11 &lt;NA&gt; &lt;NA&gt; CBM32

答案2

得分: 1

separate_rows()已被separate_longer_delim()取代,因为它在API上与其他分离函数更一致。被取代的函数不会消失,但只会接收关键错误修复。 <https://tidyr.tidyverse.org/reference/separate_rows.html>

  1. 我们将数据转换为长格式
  2. 使用dplyr中的na_if将空白替换为NA
  3. 使用这行代码 summarise(cur_data()[seq(max(id)), ]),我们扩展每个组到id的最大值。
  4. 最后,我们将准备好的数据框再次转换为宽格式:
  1. library(dplyr)
  2. library(tidyr)
  3. df %>%
  4. pivot_longer(everything()) %>%
  5. separate_longer_delim(value, "+") %>%
  6. mutate(value = na_if(value, "")) %>%
  7. group_by(name) %>%
  8. mutate(id = row_number()) %>%
  9. summarise(cur_data()[seq(max(id)), ]) %>%
  10. pivot_wider(names_from = name, values_from = value)
  11. id G1 G2 G3
  12. <int> <chr> <chr> <chr>
  13. 1 1 GH13_22 GH13_22 GH13
  14. 2 2 CBM4 NA GH1O9
  15. 3 3 GH109 GT57 NA
  16. 4 4 PL7 GH15 CBM34
  17. 5 5 GH9 AA3 GH13
  18. 6 6 GT57 GT41 CBM48
  19. 7 7 AA3 PL GT41
  20. 8 8 NA PL2 GH16
  21. 9 9 NA NA CBM4
  22. 10 10 NA NA CBM54
  23. 11 11 NA NA CBM32
英文:

separate_rows() has been superseded in favour of separate_longer_delim() because it has a more consistent API with other separate functions. Superseded functions will not go away, but will only receive critical bug fixes. <https://tidyr.tidyverse.org/reference/separate_rows.html>

  1. We bring data in long format
  2. replace blank with NA using na_if from dplyr
  3. With this line of code summarise(cur_data()[seq(max(id)), ]) we expandd each group to the max of id.
  4. Finally we pivot back the prepared data frame:
  1. library(dplyr)
  2. library(tidyr)
  3. df %&gt;%
  4. pivot_longer(everything()) %&gt;%
  5. separate_longer_delim(value, &quot;+&quot;) %&gt;%
  6. mutate(value = na_if(value, &quot;&quot;)) %&gt;%
  7. group_by(name) %&gt;%
  8. mutate(id = row_number()) %&gt;%
  9. summarise(cur_data()[seq(max(id)), ]) %&gt;%
  10. pivot_wider(names_from = name, values_from = value)
  11. id G1 G2 G3
  12. &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
  13. 1 1 GH13_22 GH13_22 GH13
  14. 2 2 CBM4 NA GH1O9
  15. 3 3 GH109 GT57 NA
  16. 4 4 PL7 GH15 CBM34
  17. 5 5 GH9 AA3 GH13
  18. 6 6 GT57 GT41 CBM48
  19. 7 7 AA3 PL GT41
  20. 8 8 NA PL2 GH16
  21. 9 9 NA NA CBM4
  22. 10 10 NA NA CBM54
  23. 11 11 NA NA CBM32

答案3

得分: 1

受@Peter M在此帖的启发,另一种选择是:

  1. library(tidyverse)
  2. library(stringr)
  3. # 找出最长的向量并相应地填充其他向量
  4. makePaddedDataFrame <- function(l){
  5. maxlen <- max(sapply(l, length))
  6. data.frame(lapply(l, \(x) x[1:maxlen])) # 用NA填充向量
  7. }
  8. df %>%
  9. mutate(across(.fns = function(x) str_split(x, pattern = "\\+"))) %>%
  10. lapply(function(x) do.call(c, x)) %>%
  11. makePaddedDataFrame %>%
  12. replace(is.na(.), " ") # 如果您想要空字符串而不是NA

得到的数据框如下:

  1. G1 G2 G3
  2. 1 GH13_22 GH13_22 GH13
  3. 2 CBM4 GH109
  4. 3 GH109 GT57
  5. 4 PL7 GH15 CBM34
  6. 5 GH9 AA3 GH13
  7. 6 GT57 GT41 CBM48
  8. 7 AA3 PL GT41
  9. 8 PL2 GH16
  10. 9 CBM4
  11. 10 CBM54
  12. 11 CBM32

希望这对您有所帮助。

英文:

Another option, inspired by @Peter M in this post

  1. library(tidyverse)
  2. library(stringr)
  3. # finds which vector is the longest and pads the other vectors accordingly
  4. makePaddedDataFrame &lt;- function(l){
  5. maxlen &lt;- max(sapply(l,length))
  6. data.frame(lapply(l,\(x) x[1:maxlen])) # pads vectors with na
  7. }
  8. df %&gt;%
  9. mutate(across(.fns = function(x) str_split(x, pattern=&quot;\\+&quot;))) %&gt;%
  10. lapply(function(x) do.call(c, x)) %&gt;%
  11. makePaddedDataFrame %&gt;%
  12. replace(is.na(.), &quot; &quot;) # if you want empty strings instead of na
  13. G1 G2 G3
  14. 1 GH13_22 GH13_22 GH13
  15. 2 CBM4 GH1O9
  16. 3 GH109 GT57
  17. 4 PL7 GH15 CBM34
  18. 5 GH9 AA3 GH13
  19. 6 GT57 GT41 CBM48
  20. 7 AA3 PL GT41
  21. 8 PL2 GH16
  22. 9 CBM4
  23. 10 CBM54
  24. 11 CBM32

huangapple
  • 本文由 发表于 2023年3月8日 15:39:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/75670407.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定