英文:
adding a data to new column if seperated by "+" sign using R
问题
请注意,以下是已翻译的部分,没有包括代码部分:
根据先前的问题,我有额外的数据信息,已将基因与数据一起包括。由于相同的基因被预测为不同的酶,结果被合并为“+”号,但现在我想将结果拆分如下所示:
我的数据框如下所示:
df <- data.frame(Gene = c("A", "B", "C", "D", "E", "F"),
G1 = c("GH13_22+CBM4", "GH109+PL7+GH9", "GT57", "AA3", "", ""),
G2 = c("GH13_22", "", "GT57+GH15", "AA3", "GT41", "PL+PL2"),
G3 = c("GH13", "GH1O9", "", "CBM34+GH13+CBM48", "GT41", "GH16+CBM4+CBM54+CBM32"))
输出如下所示:
df2 <- data.frame(Gene = c("A", "A", "B", "B", "B", "C", "C", "D", "D", "D", "E", "F", "F", "F", "F"),
G1 = c("GH13_22", "CBM4", "GH109", "PL7", "GH9", "GT57", "GT57", "AA3", "AA3", "AA3", "", "", "", "", ""),
G2 = c("GH13_22", "GH13_22", "", "", "", "GT57", "GH15", "AA3", "AA3", "AA3", "GT41", "PL", "PL2", "", ""),
G3 = c("GH13", "", "GH1O9", "GH1O9", "GH1O9", "", "", "CBM34", "GH13", "CBM48", "GT41", "GH16", "CBM4", "CBM54", "CBM32"))
请让我知道如果您需要进一步的帮助。
英文:
Following previous question,enter link description here I have extra informations with my data,I included the gene with the data. Since same gene were predicted as different enzyme, results were combined as "+" sign, but now I would like to split the results as given her below
My dataframe look like following
df <-data.frame(Gene= c("A", "B", "C","D","E","F"),
G1=c("GH13_22+CBM4", "GH109+PL7+GH9","GT57", "AA3","",""),
G2=c("GH13_22","","GT57+GH15","AA3", "GT41","PL+PL2"),
G3=c("GH13", "GH1O9","", "CBM34+GH13+CBM48", "GT41","GH16+CBM4+CBM54+CBM32"))
and output if like this one down here
df2<-data.frame(Gene= c("A","A","B", "B","B","C","C","D","D","D","E","F","F","F","F"),
G1=c("GH13_22","CBM4","GH109","PL7","GH9","GT57","GT57","AA3","AA3","AA3","","","","",""),
G2=c("GH13_22","GH13_22","","","","GT57","GH15","AA3","AA3","AA3", "GT41","PL","PL2","",""),
G3=c("GH13","","GH1O9","GH1O9", "GH1O9","","","CBM34","GH13","CBM48", "GT41","GH16","CBM4","CBM54","CBM32"))
Kindly help
答案1
得分: 1
以下是您要翻译的内容:
"The main idea is to use the function str_split_fixed
to split string and return a fixed number of separated values, with ""
padded if the input is too short. Note: I selected 4 here, but you can choose an upper bound much higher to accommodate for longer strings."
"这个主要思想是使用函数 str_split_fixed
来拆分字符串并返回固定数量的分隔值,如果输入太短,会使用 ""
进行填充。注意:这里我选择了 4,但您可以选择更大的上限来适应更长的字符串。"
"This results in a data.frame with G1:G3 as column-matrix, i.e. each element is a matrix of size 1 x 4. Then, the remaining code unnest
s the matrices to multiple elements in long format, replace empty strings with NAs, remove rows with only NAs, and then fill
the remaining values by group:"
"这将导致一个数据框,其中 G1:G3 是列矩阵,即每个元素都是大小为 1 x 4 的矩阵。然后,剩余的代码对矩阵进行unnest
操作,将它们转换为长格式的多个元素,将空字符串替换为NAs,删除仅包含NAs的行,然后按组对剩余的值进行fill
操作:"
英文:
It was harder than I thought but here's a way.
The main idea is to use the function str_split_fixed
to split string and return a fixed number of separated values, with ""
padded if the input is too short. Note: I selected 4 here, but you can choose an upper bound much higher to accommodate for longer strings.
library(stringr)
df[-1] <- lapply(df[-1], \(x) asplit(str_split_fixed(x, "\\+", 4), 1))
# Gene G1 G2 G3
#1 A GH13_22, CBM4, , GH13_22, , , GH13, , ,
#2 B GH109, PL7, GH9, , , , GH1O9, , ,
#3 C GT57, , , GT57, GH15, , , , ,
#4 D AA3, , , AA3, , , CBM34, GH13, CBM48,
#5 E , , , GT41, , , GT41, , ,
#6 F , , , PL, PL2, , GH16, CBM4, CBM54, CBM32
This results in a data.frame with G1:G3 as column-matrix, i.e. each element is a matrix of size 1 x 4. Then, the remaining code unnest
s the matrices to multiple elements in long format, replace empty strings with NAs, remove rows with only NAs, and then fill
the remaining values by group:
library(dplyr)
library(tidyr)
unnest_longer(df, col = G1:G3) %>%
mutate(across(G1:G3, ~ na_if(.x, ""))) %>%
filter(if_any(G1:G3, complete.cases)) %>%
group_by(Gene) %>%
fill(G1:G3)
Gene G1 G2 G3
1 A GH13_22 GH13_22 GH13
2 A CBM4 GH13_22 GH13
3 B GH109 <NA> GH1O9
4 B PL7 <NA> GH1O9
5 B GH9 <NA> GH1O9
6 C GT57 GT57 <NA>
7 C GT57 GH15 <NA>
8 D AA3 AA3 CBM34
9 D AA3 AA3 GH13
10 D AA3 AA3 CBM48
11 E <NA> GT41 GT41
12 F <NA> PL GH16
13 F <NA> PL2 CBM4
14 F <NA> PL2 CBM54
15 F <NA> PL2 CBM32
答案2
得分: 1
以下是代码的翻译部分:
你也可以这样做:
library(dplyr) #版本 >=1.10
df %>%
pivot_longer(-Gene)%>%
filter(nzchar(value)) %>%
separate_rows(value, sep = ''\\+'') %>%
mutate(Id = row_number(), .by = c(Gene, name))%>%
pivot_wider()
# 一个数据框: 15 × 5
Gene Id G1 G2 G3
<chr> <int> <chr> <chr> <chr>
1 A 1 GH13_22 GH13_22 GH13
2 A 2 CBM4 NA NA
3 B 1 GH109 NA GH1O9
4 B 2 PL7 NA NA
5 B 3 GH9 NA NA
6 C 1 GT57 GT57 NA
7 C 2 NA GH15 NA
8 D 1 AA3 AA3 CBM34
9 D 2 NA NA GH13
10 D 3 NA NA CBM48
11 E 1 NA GT41 GT41
12 F 1 NA PL GH16
13 F 2 NA PL2 CBM4
14 F 3 NA NA CBM54
15 F 4 NA NA CBM32
你可以使用 `%>select(-Id)` 删除 `Id` 列。
英文:
You could also do:
library(dplyr) #version >=1.10
df %>%
pivot_longer(-Gene)%>%
filter(nzchar(value)) %>%
separate_rows(value, sep = '\\+') %>%
mutate(Id = row_number(), .by = c(Gene, name))%>%
pivot_wider()
# A tibble: 15 × 5
Gene Id G1 G2 G3
<chr> <int> <chr> <chr> <chr>
1 A 1 GH13_22 GH13_22 GH13
2 A 2 CBM4 NA NA
3 B 1 GH109 NA GH1O9
4 B 2 PL7 NA NA
5 B 3 GH9 NA NA
6 C 1 GT57 GT57 NA
7 C 2 NA GH15 NA
8 D 1 AA3 AA3 CBM34
9 D 2 NA NA GH13
10 D 3 NA NA CBM48
11 E 1 NA GT41 GT41
12 F 1 NA PL GH16
13 F 2 NA PL2 CBM4
14 F 3 NA NA CBM54
15 F 4 NA NA CBM32
You can drop the Id
column by using %>%select(-Id)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论