英文:
Creating new row for the item in a data frame if seperated by special character such as "+" sign in R
问题
我有一个包含多列数据的文本文件,我想以不丢失任何信息的方式处理数据,某些列包含两个或更多信息,用特殊字符分隔,比如“+”加号,我想将这些组合信息放在同一列的不同行中,例如我在这里粘贴了数据
我的数据框看起来像下面这样
df <- data.frame(G1=c("GH13_22+CBM4", "GH109+PL7+GH9","GT57", "AA3","",""),
G2=c("GH13_22","","GT57+GH15","AA3", "GT41","PL+PL2"),
G3=c("GH13", "GH1O9","", "CBM34+GH13+CBM48", "GT41","GH16+CBM4+CBM54+CBM32"))
期望的结果应该如下
df2 <- data.frame(G1=c("GH13_22","CBM4", "GH109","PL7","GH9","GT57", "AA3","","","","",""),
G2=c("GH13_22","","GT57","GH15","AA3", "GT41","PL","PL2","","",""),
G3=c("GH13", "GH1O9","", "CBM34","GH13","CBM48", "GT41","GH16","CBM4","CBM54","CBM32"))
感谢任何帮助
谢谢
英文:
I have a data in text file which contain several column, I would like to process data in such a way that I should not loose any information, some coulmn include two or more information seperated with special character such as "+" plus sign, I would like to put this combined information in differnt row within same column, for example I pasted data below here
My dataframe look like following
df <- data.frame(G1=c("GH13_22+CBM4", "GH109+PL7+GH9","GT57", "AA3","",""),
G2=c("GH13_22","","GT57+GH15","AA3", "GT41","PL+PL2"),
G3=c("GH13", "GH1O9","", "CBM34+GH13+CBM48", "GT41","GH16+CBM4+CBM54+CBM32"))
G1 G2 G3
1 GH13_22+CBM4 GH13_22 GH13
2 GH109+PL7+GH9 GH1O9
3 GT57 GT57+GH15
4 AA3 AA3 CBM34+GH13+CBM48
5 GT41 GT41
6 PL+PL2 GH16+CBM4+CBM54+CBM32
Expected Results should look like
df2 <- data.frame(G1=c("GH13_22","CBM4", "GH109","PL7","GH9","GT57", "AA3","","","",""),
G2=c("GH13_22","","GT57","GH15","AA3", "GT41","PL","PL2","","",""),
G3=c("GH13", "GH1O9","", "CBM34","GH13","CBM48", "GT41","GH16","CBM4","CBM54","CBM32"))
G1 G2 G3
1 GH13_22 GH13_22 GH13
2 CBM4 GH1O9
3 GH109 GT57
4 PL7 GH15 CBM34
5 GH9 AA3 GH13
6 GT57 GT41 CBM48
7 AA3 PL GT41
8 PL2 GH16
9 CBM4
10 CBM54
11 CBM32
Appreciation for any help
Thanks
答案1
得分: 2
A base
solution:
split <- lapply(df, \(x) unlist(strsplit(replace(x, x == '', NA_character_), '\\+')))
as.data.frame(lapply(split, `[`, 1:max(lengths(split))))
G1 G2 G3
1 GH13_22 GH13_22 GH13
2 CBM4 <NA> GH1O9
3 GH109 GT57 <NA>
4 PL7 GH15 CBM34
5 GH9 AA3 GH13
6 GT57 GT41 CBM48
7 AA3 PL GT41
8 <NA> PL2 GH16
9 <NA> <NA> CBM4
10 <NA> <NA> CBM54
11 <NA> <NA> CBM32
英文:
A base
solution:
split <- lapply(df, \(x) unlist(strsplit(replace(x, x == '', NA_character_), '\\+')))
as.data.frame(lapply(split, `[`, 1:max(lengths(split))))
G1 G2 G3
1 GH13_22 GH13_22 GH13
2 CBM4 <NA> GH1O9
3 GH109 GT57 <NA>
4 PL7 GH15 CBM34
5 GH9 AA3 GH13
6 GT57 GT41 CBM48
7 AA3 PL GT41
8 <NA> PL2 GH16
9 <NA> <NA> CBM4
10 <NA> <NA> CBM54
11 <NA> <NA> CBM32
答案2
得分: 1
separate_rows()
已被separate_longer_delim()
取代,因为它在API上与其他分离函数更一致。被取代的函数不会消失,但只会接收关键错误修复。 <https://tidyr.tidyverse.org/reference/separate_rows.html>
- 我们将数据转换为长格式
- 使用
dplyr
中的na_if
将空白替换为NA - 使用这行代码
summarise(cur_data()[seq(max(id)), ])
,我们扩展每个组到id的最大值。 - 最后,我们将准备好的数据框再次转换为宽格式:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(everything()) %>%
separate_longer_delim(value, "+") %>%
mutate(value = na_if(value, "")) %>%
group_by(name) %>%
mutate(id = row_number()) %>%
summarise(cur_data()[seq(max(id)), ]) %>%
pivot_wider(names_from = name, values_from = value)
id G1 G2 G3
<int> <chr> <chr> <chr>
1 1 GH13_22 GH13_22 GH13
2 2 CBM4 NA GH1O9
3 3 GH109 GT57 NA
4 4 PL7 GH15 CBM34
5 5 GH9 AA3 GH13
6 6 GT57 GT41 CBM48
7 7 AA3 PL GT41
8 8 NA PL2 GH16
9 9 NA NA CBM4
10 10 NA NA CBM54
11 11 NA NA CBM32
英文:
separate_rows()
has been superseded in favour of separate_longer_delim()
because it has a more consistent API with other separate functions. Superseded functions will not go away, but will only receive critical bug fixes. <https://tidyr.tidyverse.org/reference/separate_rows.html>
- We bring data in long format
- replace blank with NA using
na_if
fromdplyr
- With this line of code
summarise(cur_data()[seq(max(id)), ])
we expandd each group to the max of id. - Finally we pivot back the prepared data frame:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(everything()) %>%
separate_longer_delim(value, "+") %>%
mutate(value = na_if(value, "")) %>%
group_by(name) %>%
mutate(id = row_number()) %>%
summarise(cur_data()[seq(max(id)), ]) %>%
pivot_wider(names_from = name, values_from = value)
id G1 G2 G3
<int> <chr> <chr> <chr>
1 1 GH13_22 GH13_22 GH13
2 2 CBM4 NA GH1O9
3 3 GH109 GT57 NA
4 4 PL7 GH15 CBM34
5 5 GH9 AA3 GH13
6 6 GT57 GT41 CBM48
7 7 AA3 PL GT41
8 8 NA PL2 GH16
9 9 NA NA CBM4
10 10 NA NA CBM54
11 11 NA NA CBM32
答案3
得分: 1
受@Peter M在此帖的启发,另一种选择是:
library(tidyverse)
library(stringr)
# 找出最长的向量并相应地填充其他向量
makePaddedDataFrame <- function(l){
maxlen <- max(sapply(l, length))
data.frame(lapply(l, \(x) x[1:maxlen])) # 用NA填充向量
}
df %>%
mutate(across(.fns = function(x) str_split(x, pattern = "\\+"))) %>%
lapply(function(x) do.call(c, x)) %>%
makePaddedDataFrame %>%
replace(is.na(.), " ") # 如果您想要空字符串而不是NA
得到的数据框如下:
G1 G2 G3
1 GH13_22 GH13_22 GH13
2 CBM4 GH109
3 GH109 GT57
4 PL7 GH15 CBM34
5 GH9 AA3 GH13
6 GT57 GT41 CBM48
7 AA3 PL GT41
8 PL2 GH16
9 CBM4
10 CBM54
11 CBM32
希望这对您有所帮助。
英文:
Another option, inspired by @Peter M in this post
library(tidyverse)
library(stringr)
# finds which vector is the longest and pads the other vectors accordingly
makePaddedDataFrame <- function(l){
maxlen <- max(sapply(l,length))
data.frame(lapply(l,\(x) x[1:maxlen])) # pads vectors with na
}
df %>%
mutate(across(.fns = function(x) str_split(x, pattern="\\+"))) %>%
lapply(function(x) do.call(c, x)) %>%
makePaddedDataFrame %>%
replace(is.na(.), " ") # if you want empty strings instead of na
G1 G2 G3
1 GH13_22 GH13_22 GH13
2 CBM4 GH1O9
3 GH109 GT57
4 PL7 GH15 CBM34
5 GH9 AA3 GH13
6 GT57 GT41 CBM48
7 AA3 PL GT41
8 PL2 GH16
9 CBM4
10 CBM54
11 CBM32
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论