在R中,如果数据由“+”符号分隔,将其添加到新列中。

huangapple go评论64阅读模式
英文:

adding a data to new column if seperated by "+" sign using R

问题

请注意,以下是已翻译的部分,没有包括代码部分:

根据先前的问题,我有额外的数据信息,已将基因与数据一起包括。由于相同的基因被预测为不同的酶,结果被合并为“+”号,但现在我想将结果拆分如下所示:

我的数据框如下所示:

df <- data.frame(Gene = c("A", "B", "C", "D", "E", "F"),
                 G1 = c("GH13_22+CBM4", "GH109+PL7+GH9", "GT57", "AA3", "", ""),
                 G2 = c("GH13_22", "", "GT57+GH15", "AA3", "GT41", "PL+PL2"),
                 G3 = c("GH13", "GH1O9", "", "CBM34+GH13+CBM48", "GT41", "GH16+CBM4+CBM54+CBM32"))

输出如下所示:

df2 <- data.frame(Gene = c("A", "A", "B", "B", "B", "C", "C", "D", "D", "D", "E", "F", "F", "F", "F"),
                  G1 = c("GH13_22", "CBM4", "GH109", "PL7", "GH9", "GT57", "GT57", "AA3", "AA3", "AA3", "", "", "", "", ""),
                  G2 = c("GH13_22", "GH13_22", "", "", "", "GT57", "GH15", "AA3", "AA3", "AA3", "GT41", "PL", "PL2", "", ""),
                  G3 = c("GH13", "", "GH1O9", "GH1O9", "GH1O9", "", "", "CBM34", "GH13", "CBM48", "GT41", "GH16", "CBM4", "CBM54", "CBM32"))

请让我知道如果您需要进一步的帮助。

英文:

Following previous question,enter link description here I have extra informations with my data,I included the gene with the data. Since same gene were predicted as different enzyme, results were combined as "+" sign, but now I would like to split the results as given her below
My dataframe look like following

df &lt;-data.frame(Gene= c(&quot;A&quot;, &quot;B&quot;, &quot;C&quot;,&quot;D&quot;,&quot;E&quot;,&quot;F&quot;),
                 G1=c(&quot;GH13_22+CBM4&quot;,  &quot;GH109+PL7+GH9&quot;,&quot;GT57&quot;, &quot;AA3&quot;,&quot;&quot;,&quot;&quot;),
                 G2=c(&quot;GH13_22&quot;,&quot;&quot;,&quot;GT57+GH15&quot;,&quot;AA3&quot;, &quot;GT41&quot;,&quot;PL+PL2&quot;),
                 G3=c(&quot;GH13&quot;, &quot;GH1O9&quot;,&quot;&quot;, &quot;CBM34+GH13+CBM48&quot;, &quot;GT41&quot;,&quot;GH16+CBM4+CBM54+CBM32&quot;))

and output if like this one down here

df2&lt;-data.frame(Gene= c(&quot;A&quot;,&quot;A&quot;,&quot;B&quot;, &quot;B&quot;,&quot;B&quot;,&quot;C&quot;,&quot;C&quot;,&quot;D&quot;,&quot;D&quot;,&quot;D&quot;,&quot;E&quot;,&quot;F&quot;,&quot;F&quot;,&quot;F&quot;,&quot;F&quot;),
                G1=c(&quot;GH13_22&quot;,&quot;CBM4&quot;,&quot;GH109&quot;,&quot;PL7&quot;,&quot;GH9&quot;,&quot;GT57&quot;,&quot;GT57&quot;,&quot;AA3&quot;,&quot;AA3&quot;,&quot;AA3&quot;,&quot;&quot;,&quot;&quot;,&quot;&quot;,&quot;&quot;,&quot;&quot;),
                G2=c(&quot;GH13_22&quot;,&quot;GH13_22&quot;,&quot;&quot;,&quot;&quot;,&quot;&quot;,&quot;GT57&quot;,&quot;GH15&quot;,&quot;AA3&quot;,&quot;AA3&quot;,&quot;AA3&quot;, &quot;GT41&quot;,&quot;PL&quot;,&quot;PL2&quot;,&quot;&quot;,&quot;&quot;),
            G3=c(&quot;GH13&quot;,&quot;&quot;,&quot;GH1O9&quot;,&quot;GH1O9&quot;, &quot;GH1O9&quot;,&quot;&quot;,&quot;&quot;,&quot;CBM34&quot;,&quot;GH13&quot;,&quot;CBM48&quot;, &quot;GT41&quot;,&quot;GH16&quot;,&quot;CBM4&quot;,&quot;CBM54&quot;,&quot;CBM32&quot;))

Kindly help

答案1

得分: 1

以下是您要翻译的内容:

"The main idea is to use the function str_split_fixed to split string and return a fixed number of separated values, with &quot;&quot; padded if the input is too short. Note: I selected 4 here, but you can choose an upper bound much higher to accommodate for longer strings."

"这个主要思想是使用函数 str_split_fixed 来拆分字符串并返回固定数量的分隔值,如果输入太短,会使用 &quot;&quot; 进行填充。注意:这里我选择了 4,但您可以选择更大的上限来适应更长的字符串。"

"This results in a data.frame with G1:G3 as column-matrix, i.e. each element is a matrix of size 1 x 4. Then, the remaining code unnests the matrices to multiple elements in long format, replace empty strings with NAs, remove rows with only NAs, and then fill the remaining values by group:"

"这将导致一个数据框,其中 G1:G3 是列矩阵,即每个元素都是大小为 1 x 4 的矩阵。然后,剩余的代码对矩阵进行unnest操作,将它们转换为长格式的多个元素,将空字符串替换为NAs,删除仅包含NAs的行,然后按组对剩余的值进行fill操作:"

英文:

It was harder than I thought but here's a way.

The main idea is to use the function str_split_fixed to split string and return a fixed number of separated values, with &quot;&quot; padded if the input is too short. Note: I selected 4 here, but you can choose an upper bound much higher to accommodate for longer strings.

library(stringr)
df[-1] &lt;- lapply(df[-1], \(x) asplit(str_split_fixed(x, &quot;\\+&quot;, 4), 1))

#  Gene                G1             G2                       G3
#1    A GH13_22, CBM4, ,   GH13_22, , ,                GH13, , , 
#2    B GH109, PL7, GH9,          , , ,               GH1O9, , , 
#3    C        GT57, , ,  GT57, GH15, ,                    , , , 
#4    D         AA3, , ,       AA3, , ,      CBM34, GH13, CBM48, 
#5    E            , , ,      GT41, , ,                GT41, , , 
#6    F            , , ,     PL, PL2, ,  GH16, CBM4, CBM54, CBM32

This results in a data.frame with G1:G3 as column-matrix, i.e. each element is a matrix of size 1 x 4. Then, the remaining code unnests the matrices to multiple elements in long format, replace empty strings with NAs, remove rows with only NAs, and then fill the remaining values by group:

library(dplyr)
library(tidyr)

unnest_longer(df, col = G1:G3) %&gt;% 
  mutate(across(G1:G3, ~ na_if(.x, &quot;&quot;))) %&gt;% 
  filter(if_any(G1:G3, complete.cases)) %&gt;% 
  group_by(Gene) %&gt;% 
  fill(G1:G3)

   Gene      G1      G2    G3
1     A GH13_22 GH13_22  GH13
2     A    CBM4 GH13_22  GH13
3     B   GH109    &lt;NA&gt; GH1O9
4     B     PL7    &lt;NA&gt; GH1O9
5     B     GH9    &lt;NA&gt; GH1O9
6     C    GT57    GT57  &lt;NA&gt;
7     C    GT57    GH15  &lt;NA&gt;
8     D     AA3     AA3 CBM34
9     D     AA3     AA3  GH13
10    D     AA3     AA3 CBM48
11    E    &lt;NA&gt;    GT41  GT41
12    F    &lt;NA&gt;      PL  GH16
13    F    &lt;NA&gt;     PL2  CBM4
14    F    &lt;NA&gt;     PL2 CBM54
15    F    &lt;NA&gt;     PL2 CBM32

答案2

得分: 1

以下是代码的翻译部分:

你也可以这样做:

    library(dplyr) #版本 &gt;=1.10
    df %&gt;%
       pivot_longer(-Gene)%&gt;%
       filter(nzchar(value)) %&gt;%
       separate_rows(value, sep = '&#39;\\+&#39;') %&gt;%
       mutate(Id = row_number(), .by = c(Gene, name))%&gt;%
       pivot_wider()
    
    # 一个数据框: 15 &#215; 5
       Gene     Id G1      G2      G3   
       &lt;chr&gt; &lt;int&gt; &lt;chr&gt;   &lt;chr&gt;   &lt;chr&gt;
     1 A         1 GH13_22 GH13_22 GH13 
     2 A         2 CBM4    NA      NA   
     3 B         1 GH109   NA      GH1O9
     4 B         2 PL7     NA      NA   
     5 B         3 GH9     NA      NA   
     6 C         1 GT57    GT57    NA   
     7 C         2 NA      GH15    NA   
     8 D         1 AA3     AA3     CBM34
     9 D         2 NA      NA      GH13 
    10 D         3 NA      NA      CBM48
    11 E         1 NA      GT41    GT41 
    12 F         1 NA      PL      GH16 
    13 F         2 NA      PL2     CBM4 
    14 F         3 NA      NA      CBM54
    15 F         4 NA      NA      CBM32

你可以使用 `%&gt;select(-Id)` 删除 `Id` 列。
英文:

You could also do:

library(dplyr) #version &gt;=1.10
df %&gt;%
   pivot_longer(-Gene)%&gt;%
   filter(nzchar(value)) %&gt;%
   separate_rows(value, sep = &#39;\\+&#39;) %&gt;%
   mutate(Id = row_number(), .by = c(Gene, name))%&gt;%
   pivot_wider()

# A tibble: 15 &#215; 5
   Gene     Id G1      G2      G3   
   &lt;chr&gt; &lt;int&gt; &lt;chr&gt;   &lt;chr&gt;   &lt;chr&gt;
 1 A         1 GH13_22 GH13_22 GH13 
 2 A         2 CBM4    NA      NA   
 3 B         1 GH109   NA      GH1O9
 4 B         2 PL7     NA      NA   
 5 B         3 GH9     NA      NA   
 6 C         1 GT57    GT57    NA   
 7 C         2 NA      GH15    NA   
 8 D         1 AA3     AA3     CBM34
 9 D         2 NA      NA      GH13 
10 D         3 NA      NA      CBM48
11 E         1 NA      GT41    GT41 
12 F         1 NA      PL      GH16 
13 F         2 NA      PL2     CBM4 
14 F         3 NA      NA      CBM54
15 F         4 NA      NA      CBM32

You can drop the Id column by using %&gt;%select(-Id)

huangapple
  • 本文由 发表于 2023年3月8日 17:39:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/75671417.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定