2023年3月8日 17:39:20go评论106阅读模式

英文:

adding a data to new column if seperated by "+" sign using R

问题

请注意，以下是已翻译的部分，没有包括代码部分：

根据先前的问题，我有额外的数据信息，已将基因与数据一起包括。由于相同的基因被预测为不同的酶，结果被合并为“+”号，但现在我想将结果拆分如下所示：

我的数据框如下所示：

df <- data.frame(Gene = c("A", "B", "C", "D", "E", "F"),
                 G1 = c("GH13_22+CBM4", "GH109+PL7+GH9", "GT57", "AA3", "", ""),
                 G2 = c("GH13_22", "", "GT57+GH15", "AA3", "GT41", "PL+PL2"),
                 G3 = c("GH13", "GH1O9", "", "CBM34+GH13+CBM48", "GT41", "GH16+CBM4+CBM54+CBM32"))

输出如下所示：

df2 <- data.frame(Gene = c("A", "A", "B", "B", "B", "C", "C", "D", "D", "D", "E", "F", "F", "F", "F"),
                  G1 = c("GH13_22", "CBM4", "GH109", "PL7", "GH9", "GT57", "GT57", "AA3", "AA3", "AA3", "", "", "", "", ""),
                  G2 = c("GH13_22", "GH13_22", "", "", "", "GT57", "GH15", "AA3", "AA3", "AA3", "GT41", "PL", "PL2", "", ""),
                  G3 = c("GH13", "", "GH1O9", "GH1O9", "GH1O9", "", "", "CBM34", "GH13", "CBM48", "GT41", "GH16", "CBM4", "CBM54", "CBM32"))

请让我知道如果您需要进一步的帮助。

英文:

Following previous question,enter link description here I have extra informations with my data,I included the gene with the data. Since same gene were predicted as different enzyme, results were combined as "+" sign, but now I would like to split the results as given her below
My dataframe look like following

df &lt;-data.frame(Gene= c(&quot;A&quot;, &quot;B&quot;, &quot;C&quot;,&quot;D&quot;,&quot;E&quot;,&quot;F&quot;),
                 G1=c(&quot;GH13_22+CBM4&quot;,  &quot;GH109+PL7+GH9&quot;,&quot;GT57&quot;, &quot;AA3&quot;,&quot;&quot;,&quot;&quot;),
                 G2=c(&quot;GH13_22&quot;,&quot;&quot;,&quot;GT57+GH15&quot;,&quot;AA3&quot;, &quot;GT41&quot;,&quot;PL+PL2&quot;),
                 G3=c(&quot;GH13&quot;, &quot;GH1O9&quot;,&quot;&quot;, &quot;CBM34+GH13+CBM48&quot;, &quot;GT41&quot;,&quot;GH16+CBM4+CBM54+CBM32&quot;))

and output if like this one down here

df2&lt;-data.frame(Gene= c(&quot;A&quot;,&quot;A&quot;,&quot;B&quot;, &quot;B&quot;,&quot;B&quot;,&quot;C&quot;,&quot;C&quot;,&quot;D&quot;,&quot;D&quot;,&quot;D&quot;,&quot;E&quot;,&quot;F&quot;,&quot;F&quot;,&quot;F&quot;,&quot;F&quot;),
                G1=c(&quot;GH13_22&quot;,&quot;CBM4&quot;,&quot;GH109&quot;,&quot;PL7&quot;,&quot;GH9&quot;,&quot;GT57&quot;,&quot;GT57&quot;,&quot;AA3&quot;,&quot;AA3&quot;,&quot;AA3&quot;,&quot;&quot;,&quot;&quot;,&quot;&quot;,&quot;&quot;,&quot;&quot;),
                G2=c(&quot;GH13_22&quot;,&quot;GH13_22&quot;,&quot;&quot;,&quot;&quot;,&quot;&quot;,&quot;GT57&quot;,&quot;GH15&quot;,&quot;AA3&quot;,&quot;AA3&quot;,&quot;AA3&quot;, &quot;GT41&quot;,&quot;PL&quot;,&quot;PL2&quot;,&quot;&quot;,&quot;&quot;),
            G3=c(&quot;GH13&quot;,&quot;&quot;,&quot;GH1O9&quot;,&quot;GH1O9&quot;, &quot;GH1O9&quot;,&quot;&quot;,&quot;&quot;,&quot;CBM34&quot;,&quot;GH13&quot;,&quot;CBM48&quot;, &quot;GT41&quot;,&quot;GH16&quot;,&quot;CBM4&quot;,&quot;CBM54&quot;,&quot;CBM32&quot;))

Kindly help

答案1

得分: 1

以下是您要翻译的内容：

"The main idea is to use the function str_split_fixed to split string and return a fixed number of separated values, with "" padded if the input is too short. Note: I selected 4 here, but you can choose an upper bound much higher to accommodate for longer strings."

"这个主要思想是使用函数 str_split_fixed 来拆分字符串并返回固定数量的分隔值，如果输入太短，会使用 "" 进行填充。注意：这里我选择了 4，但您可以选择更大的上限来适应更长的字符串。"

"This results in a data.frame with G1:G3 as column-matrix, i.e. each element is a matrix of size 1 x 4. Then, the remaining code unnests the matrices to multiple elements in long format, replace empty strings with NAs, remove rows with only NAs, and then fill the remaining values by group:"

"这将导致一个数据框，其中 G1:G3 是列矩阵，即每个元素都是大小为 1 x 4 的矩阵。然后，剩余的代码对矩阵进行unnest操作，将它们转换为长格式的多个元素，将空字符串替换为NAs，删除仅包含NAs的行，然后按组对剩余的值进行fill操作："

英文:

It was harder than I thought but here's a way.

The main idea is to use the function str_split_fixed to split string and return a fixed number of separated values, with "" padded if the input is too short. Note: I selected 4 here, but you can choose an upper bound much higher to accommodate for longer strings.

library(stringr)
df[-1] &lt;- lapply(df[-1], \(x) asplit(str_split_fixed(x, &quot;\\+&quot;, 4), 1))
#  Gene                G1             G2                       G3
#1    A GH13_22, CBM4, ,   GH13_22, , ,                GH13, , , 
#2    B GH109, PL7, GH9,          , , ,               GH1O9, , , 
#3    C        GT57, , ,  GT57, GH15, ,                    , , , 
#4    D         AA3, , ,       AA3, , ,      CBM34, GH13, CBM48, 
#5    E            , , ,      GT41, , ,                GT41, , , 
#6    F            , , ,     PL, PL2, ,  GH16, CBM4, CBM54, CBM32

This results in a data.frame with G1:G3 as column-matrix, i.e. each element is a matrix of size 1 x 4. Then, the remaining code unnests the matrices to multiple elements in long format, replace empty strings with NAs, remove rows with only NAs, and then fill the remaining values by group:

library(dplyr)
library(tidyr)
unnest_longer(df, col = G1:G3) %&gt;% 
  mutate(across(G1:G3, ~ na_if(.x, &quot;&quot;))) %&gt;% 
  filter(if_any(G1:G3, complete.cases)) %&gt;% 
  group_by(Gene) %&gt;% 
  fill(G1:G3)
   Gene      G1      G2    G3
1     A GH13_22 GH13_22  GH13
2     A    CBM4 GH13_22  GH13
3     B   GH109    &lt;NA&gt; GH1O9
4     B     PL7    &lt;NA&gt; GH1O9
5     B     GH9    &lt;NA&gt; GH1O9
6     C    GT57    GT57  &lt;NA&gt;
7     C    GT57    GH15  &lt;NA&gt;
8     D     AA3     AA3 CBM34
9     D     AA3     AA3  GH13
10    D     AA3     AA3 CBM48
11    E    &lt;NA&gt;    GT41  GT41
12    F    &lt;NA&gt;      PL  GH16
13    F    &lt;NA&gt;     PL2  CBM4
14    F    &lt;NA&gt;     PL2 CBM54
15    F    &lt;NA&gt;     PL2 CBM32

答案2

得分: 1

以下是代码的翻译部分：

你也可以这样做：
    library(dplyr) #版本 &gt;=1.10
    df %&gt;%
       pivot_longer(-Gene)%&gt;%
       filter(nzchar(value)) %&gt;%
       separate_rows(value, sep = '&#39;\\+&#39;') %&gt;%
       mutate(Id = row_number(), .by = c(Gene, name))%&gt;%
       pivot_wider()
    
    # 一个数据框: 15 &#215; 5
       Gene     Id G1      G2      G3   
       &lt;chr&gt; &lt;int&gt; &lt;chr&gt;   &lt;chr&gt;   &lt;chr&gt;
     1 A         1 GH13_22 GH13_22 GH13 
     2 A         2 CBM4    NA      NA   
     3 B         1 GH109   NA      GH1O9
     4 B         2 PL7     NA      NA   
     5 B         3 GH9     NA      NA   
     6 C         1 GT57    GT57    NA   
     7 C         2 NA      GH15    NA   
     8 D         1 AA3     AA3     CBM34
     9 D         2 NA      NA      GH13 
    10 D         3 NA      NA      CBM48
    11 E         1 NA      GT41    GT41 
    12 F         1 NA      PL      GH16 
    13 F         2 NA      PL2     CBM4 
    14 F         3 NA      NA      CBM54
    15 F         4 NA      NA      CBM32
你可以使用 `%&gt;select(-Id)` 删除 `Id` 列。

英文:

You could also do:

library(dplyr) #version &gt;=1.10
df %&gt;%
   pivot_longer(-Gene)%&gt;%
   filter(nzchar(value)) %&gt;%
   separate_rows(value, sep = &#39;\\+&#39;) %&gt;%
   mutate(Id = row_number(), .by = c(Gene, name))%&gt;%
   pivot_wider()
# A tibble: 15 &#215; 5
   Gene     Id G1      G2      G3   
   &lt;chr&gt; &lt;int&gt; &lt;chr&gt;   &lt;chr&gt;   &lt;chr&gt;
 1 A         1 GH13_22 GH13_22 GH13 
 2 A         2 CBM4    NA      NA   
 3 B         1 GH109   NA      GH1O9
 4 B         2 PL7     NA      NA   
 5 B         3 GH9     NA      NA   
 6 C         1 GT57    GT57    NA   
 7 C         2 NA      GH15    NA   
 8 D         1 AA3     AA3     CBM34
 9 D         2 NA      NA      GH13 
10 D         3 NA      NA      CBM48
11 E         1 NA      GT41    GT41 
12 F         1 NA      PL      GH16 
13 F         2 NA      PL2     CBM4 
14 F         3 NA      NA      CBM54
15 F         4 NA      NA      CBM32

You can drop the Id column by using %>%select(-Id)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在R中，如果数据由“+”符号分隔，将其添加到新列中。

问题

答案1

答案2

Wrap horizontal code overflow in Quarto revealjs

Sankey plot with plotly in R: how do I get the plot to skip over the NAs and not try to plot dead ends of some nodes?

在Go语言中并发访问带有”range”的映射（maps）

assertthat在嵌套方式调用时，返回来自当前函数的消息。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。