在R中,如果数据由“+”符号分隔,将其添加到新列中。

huangapple go评论106阅读模式
英文:

adding a data to new column if seperated by "+" sign using R

问题

请注意,以下是已翻译的部分,没有包括代码部分:

根据先前的问题,我有额外的数据信息,已将基因与数据一起包括。由于相同的基因被预测为不同的酶,结果被合并为“+”号,但现在我想将结果拆分如下所示:

我的数据框如下所示:

  1. df <- data.frame(Gene = c("A", "B", "C", "D", "E", "F"),
  2. G1 = c("GH13_22+CBM4", "GH109+PL7+GH9", "GT57", "AA3", "", ""),
  3. G2 = c("GH13_22", "", "GT57+GH15", "AA3", "GT41", "PL+PL2"),
  4. G3 = c("GH13", "GH1O9", "", "CBM34+GH13+CBM48", "GT41", "GH16+CBM4+CBM54+CBM32"))

输出如下所示:

  1. df2 <- data.frame(Gene = c("A", "A", "B", "B", "B", "C", "C", "D", "D", "D", "E", "F", "F", "F", "F"),
  2. G1 = c("GH13_22", "CBM4", "GH109", "PL7", "GH9", "GT57", "GT57", "AA3", "AA3", "AA3", "", "", "", "", ""),
  3. G2 = c("GH13_22", "GH13_22", "", "", "", "GT57", "GH15", "AA3", "AA3", "AA3", "GT41", "PL", "PL2", "", ""),
  4. G3 = c("GH13", "", "GH1O9", "GH1O9", "GH1O9", "", "", "CBM34", "GH13", "CBM48", "GT41", "GH16", "CBM4", "CBM54", "CBM32"))

请让我知道如果您需要进一步的帮助。

英文:

Following previous question,enter link description here I have extra informations with my data,I included the gene with the data. Since same gene were predicted as different enzyme, results were combined as "+" sign, but now I would like to split the results as given her below
My dataframe look like following

  1. df &lt;-data.frame(Gene= c(&quot;A&quot;, &quot;B&quot;, &quot;C&quot;,&quot;D&quot;,&quot;E&quot;,&quot;F&quot;),
  2. G1=c(&quot;GH13_22+CBM4&quot;, &quot;GH109+PL7+GH9&quot;,&quot;GT57&quot;, &quot;AA3&quot;,&quot;&quot;,&quot;&quot;),
  3. G2=c(&quot;GH13_22&quot;,&quot;&quot;,&quot;GT57+GH15&quot;,&quot;AA3&quot;, &quot;GT41&quot;,&quot;PL+PL2&quot;),
  4. G3=c(&quot;GH13&quot;, &quot;GH1O9&quot;,&quot;&quot;, &quot;CBM34+GH13+CBM48&quot;, &quot;GT41&quot;,&quot;GH16+CBM4+CBM54+CBM32&quot;))

and output if like this one down here

  1. df2&lt;-data.frame(Gene= c(&quot;A&quot;,&quot;A&quot;,&quot;B&quot;, &quot;B&quot;,&quot;B&quot;,&quot;C&quot;,&quot;C&quot;,&quot;D&quot;,&quot;D&quot;,&quot;D&quot;,&quot;E&quot;,&quot;F&quot;,&quot;F&quot;,&quot;F&quot;,&quot;F&quot;),
  2. G1=c(&quot;GH13_22&quot;,&quot;CBM4&quot;,&quot;GH109&quot;,&quot;PL7&quot;,&quot;GH9&quot;,&quot;GT57&quot;,&quot;GT57&quot;,&quot;AA3&quot;,&quot;AA3&quot;,&quot;AA3&quot;,&quot;&quot;,&quot;&quot;,&quot;&quot;,&quot;&quot;,&quot;&quot;),
  3. G2=c(&quot;GH13_22&quot;,&quot;GH13_22&quot;,&quot;&quot;,&quot;&quot;,&quot;&quot;,&quot;GT57&quot;,&quot;GH15&quot;,&quot;AA3&quot;,&quot;AA3&quot;,&quot;AA3&quot;, &quot;GT41&quot;,&quot;PL&quot;,&quot;PL2&quot;,&quot;&quot;,&quot;&quot;),
  4. G3=c(&quot;GH13&quot;,&quot;&quot;,&quot;GH1O9&quot;,&quot;GH1O9&quot;, &quot;GH1O9&quot;,&quot;&quot;,&quot;&quot;,&quot;CBM34&quot;,&quot;GH13&quot;,&quot;CBM48&quot;, &quot;GT41&quot;,&quot;GH16&quot;,&quot;CBM4&quot;,&quot;CBM54&quot;,&quot;CBM32&quot;))

Kindly help

答案1

得分: 1

以下是您要翻译的内容:

"The main idea is to use the function str_split_fixed to split string and return a fixed number of separated values, with &quot;&quot; padded if the input is too short. Note: I selected 4 here, but you can choose an upper bound much higher to accommodate for longer strings."

"这个主要思想是使用函数 str_split_fixed 来拆分字符串并返回固定数量的分隔值,如果输入太短,会使用 &quot;&quot; 进行填充。注意:这里我选择了 4,但您可以选择更大的上限来适应更长的字符串。"

"This results in a data.frame with G1:G3 as column-matrix, i.e. each element is a matrix of size 1 x 4. Then, the remaining code unnests the matrices to multiple elements in long format, replace empty strings with NAs, remove rows with only NAs, and then fill the remaining values by group:"

"这将导致一个数据框,其中 G1:G3 是列矩阵,即每个元素都是大小为 1 x 4 的矩阵。然后,剩余的代码对矩阵进行unnest操作,将它们转换为长格式的多个元素,将空字符串替换为NAs,删除仅包含NAs的行,然后按组对剩余的值进行fill操作:"

英文:

It was harder than I thought but here's a way.

The main idea is to use the function str_split_fixed to split string and return a fixed number of separated values, with &quot;&quot; padded if the input is too short. Note: I selected 4 here, but you can choose an upper bound much higher to accommodate for longer strings.

  1. library(stringr)
  2. df[-1] &lt;- lapply(df[-1], \(x) asplit(str_split_fixed(x, &quot;\\+&quot;, 4), 1))
  3. # Gene G1 G2 G3
  4. #1 A GH13_22, CBM4, , GH13_22, , , GH13, , ,
  5. #2 B GH109, PL7, GH9, , , , GH1O9, , ,
  6. #3 C GT57, , , GT57, GH15, , , , ,
  7. #4 D AA3, , , AA3, , , CBM34, GH13, CBM48,
  8. #5 E , , , GT41, , , GT41, , ,
  9. #6 F , , , PL, PL2, , GH16, CBM4, CBM54, CBM32

This results in a data.frame with G1:G3 as column-matrix, i.e. each element is a matrix of size 1 x 4. Then, the remaining code unnests the matrices to multiple elements in long format, replace empty strings with NAs, remove rows with only NAs, and then fill the remaining values by group:

  1. library(dplyr)
  2. library(tidyr)
  3. unnest_longer(df, col = G1:G3) %&gt;%
  4. mutate(across(G1:G3, ~ na_if(.x, &quot;&quot;))) %&gt;%
  5. filter(if_any(G1:G3, complete.cases)) %&gt;%
  6. group_by(Gene) %&gt;%
  7. fill(G1:G3)
  8. Gene G1 G2 G3
  9. 1 A GH13_22 GH13_22 GH13
  10. 2 A CBM4 GH13_22 GH13
  11. 3 B GH109 &lt;NA&gt; GH1O9
  12. 4 B PL7 &lt;NA&gt; GH1O9
  13. 5 B GH9 &lt;NA&gt; GH1O9
  14. 6 C GT57 GT57 &lt;NA&gt;
  15. 7 C GT57 GH15 &lt;NA&gt;
  16. 8 D AA3 AA3 CBM34
  17. 9 D AA3 AA3 GH13
  18. 10 D AA3 AA3 CBM48
  19. 11 E &lt;NA&gt; GT41 GT41
  20. 12 F &lt;NA&gt; PL GH16
  21. 13 F &lt;NA&gt; PL2 CBM4
  22. 14 F &lt;NA&gt; PL2 CBM54
  23. 15 F &lt;NA&gt; PL2 CBM32

答案2

得分: 1

以下是代码的翻译部分:

  1. 你也可以这样做:
  2. library(dplyr) #版本 &gt;=1.10
  3. df %&gt;%
  4. pivot_longer(-Gene)%&gt;%
  5. filter(nzchar(value)) %&gt;%
  6. separate_rows(value, sep = '&#39;\\+&#39;') %&gt;%
  7. mutate(Id = row_number(), .by = c(Gene, name))%&gt;%
  8. pivot_wider()
  9. # 一个数据框: 15 &#215; 5
  10. Gene Id G1 G2 G3
  11. &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
  12. 1 A 1 GH13_22 GH13_22 GH13
  13. 2 A 2 CBM4 NA NA
  14. 3 B 1 GH109 NA GH1O9
  15. 4 B 2 PL7 NA NA
  16. 5 B 3 GH9 NA NA
  17. 6 C 1 GT57 GT57 NA
  18. 7 C 2 NA GH15 NA
  19. 8 D 1 AA3 AA3 CBM34
  20. 9 D 2 NA NA GH13
  21. 10 D 3 NA NA CBM48
  22. 11 E 1 NA GT41 GT41
  23. 12 F 1 NA PL GH16
  24. 13 F 2 NA PL2 CBM4
  25. 14 F 3 NA NA CBM54
  26. 15 F 4 NA NA CBM32
  27. 你可以使用 `%&gt;select(-Id)` 删除 `Id` 列。
英文:

You could also do:

  1. library(dplyr) #version &gt;=1.10
  2. df %&gt;%
  3. pivot_longer(-Gene)%&gt;%
  4. filter(nzchar(value)) %&gt;%
  5. separate_rows(value, sep = &#39;\\+&#39;) %&gt;%
  6. mutate(Id = row_number(), .by = c(Gene, name))%&gt;%
  7. pivot_wider()
  8. # A tibble: 15 &#215; 5
  9. Gene Id G1 G2 G3
  10. &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
  11. 1 A 1 GH13_22 GH13_22 GH13
  12. 2 A 2 CBM4 NA NA
  13. 3 B 1 GH109 NA GH1O9
  14. 4 B 2 PL7 NA NA
  15. 5 B 3 GH9 NA NA
  16. 6 C 1 GT57 GT57 NA
  17. 7 C 2 NA GH15 NA
  18. 8 D 1 AA3 AA3 CBM34
  19. 9 D 2 NA NA GH13
  20. 10 D 3 NA NA CBM48
  21. 11 E 1 NA GT41 GT41
  22. 12 F 1 NA PL GH16
  23. 13 F 2 NA PL2 CBM4
  24. 14 F 3 NA NA CBM54
  25. 15 F 4 NA NA CBM32

You can drop the Id column by using %&gt;%select(-Id)

huangapple
  • 本文由 发表于 2023年3月8日 17:39:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/75671417.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定