英文:
How to replace NA values in dataframe with adjacent columns value + additional text to differentiate them in R
问题
我好奇是否可以用左侧列中的文本(不是NA值的列)替换数据框中的NA值,然后在末尾添加"_unclassified"文本。
以下是一个示例数据框:
feature <- c("1",
"2",
"3",
"4",
"5" )
phylum <- c("Firmicutes",
"Firmicutes",
"Firmicutes",
"Proteobacteria",
"Firmicutes" )
class <- c(NA,
"Clostridia",
"Clostridia",
"Gammaproteobacteria",
"Bacilli" )
order <- c(NA,
NA,
"Oscillospirales",
"Enterobacterales",
"Staphylococcales" )
family <- c(NA,
NA,
NA,
"Enterobacteriaceae",
"Staphylococcaceae" )
genus <- c(NA,
NA,
NA,
NA,
"Staphylococcus")
df <- data.frame(feature, phylum, class, order, family, genus)
例如,
feature 1 将在class,order,family,genus中有Firmicutes_unclassified
feature 2 将在order,family和genus中有Clostridia_unclassified
feature 3 将在family和genus中有Oscillospirales_unclassified
feature 4 将在genus中有Enterobacteriaceae_unclassified。
英文:
I'm curious if I could replace NA values in my data frame with text from the column to the left (that does not have NA), with an additional "_unclassified" text on the end.
Here is an example data frame:
feature <- c("1",
"2",
"3",
"4",
"5" )
phylum <- c("Firmicutes",
"Firmicutes",
"Firmicutes",
"Proteobacteria",
"Firmicutes" )
class <- c(NA,
"Clostridia",
"Clostridia",
"Gammaproteobacteria",
"Bacilli" )
order <- c(NA,
NA,
"Oscillospirales",
"Enterobacterales",
"Staphylococcales" )
family <- c(NA,
NA,
NA,
"Enterobacteriaceae",
"Staphylococcaceae" )
genus <- c(NA,
NA,
NA,
NA,
"Staphylococcus")
df <- data.frame(feature, phylum, class, order, family, genus)
For example,
feature 1 would have Firmicutes_unclassified across class, order, family, genus
feature 2 would have Clostridia_unclassified across order, family, and genus
feature 3 would have Oscillospirales_unclassified across family and genus
feature 4 would have Enterobacteriaceae_unclassified for genus
答案1
得分: 1
> df
feature phylum class order family
1 1 厚壁菌门 厚壁菌门_未分类 厚壁菌门_未分类 厚壁菌门_未分类
2 2 厚壁菌门 梭菌纲 梭菌纲_未分类 梭菌纲_未分类
3 3 厚壁菌门 梭菌纲 压摩纲 压摩纲_未分类
4 4 变形菌门 伽马变形菌纲 肠杆菌目 肠杆菌科
5 5 厚壁菌门 乳杆菌纲 葡萄球菌目 葡萄球菌科
genus
1 厚壁菌门_未分类
2 梭菌纲_未分类
3 压摩纲_未分类
4 肠杆菌科_未分类
5 葡萄球菌
英文:
An option with na.locf
from zoo
library(zoo)
df[-1] <- t(apply(df[-1], 1, \(x) ifelse(is.na(x), paste0(na.locf0(x),
'_unclassified'), x)))
-output
> df
feature phylum class order family
1 1 Firmicutes Firmicutes_unclassified Firmicutes_unclassified Firmicutes_unclassified
2 2 Firmicutes Clostridia Clostridia_unclassified Clostridia_unclassified
3 3 Firmicutes Clostridia Oscillospirales Oscillospirales_unclassified
4 4 Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae
5 5 Firmicutes Bacilli Staphylococcales Staphylococcaceae
genus
1 Firmicutes_unclassified
2 Clostridia_unclassified
3 Oscillospirales_unclassified
4 Enterobacteriaceae_unclassified
5 Staphylococcus
答案2
得分: 1
使用纯粹的基础R,可以使用以下一行代码来执行这个操作:
df[-1] <- t(apply(df[-1], MARGIN=1, function(x) replace(x, is.na(x), paste0(tail(na.omit(x), n=1), '_unclassified'))))
df
这行代码会在数据框中去除第一列(df[-1]
)后,对每一行应用一个匿名函数,该函数会将包含NA值的元素替换为去除NA值后的最后一个元素,并在其后添加 '_unclassified' 后缀。
英文:
One-liner, using just base R.
df[-1] <- t(apply(df[-1], MARGIN=1, \(x) replace(x, is.na(x), paste0(tail(na.omit(x), n=1), '_unclassified'))))
df
# feature phylum class order family genus
# 1 1 Firmicutes Firmicutes_unclassified Firmicutes_unclassified Firmicutes_unclassified Firmicutes_unclassified
# 2 2 Firmicutes Clostridia Clostridia_unclassified Clostridia_unclassified Clostridia_unclassified
# 3 3 Firmicutes Clostridia Oscillospirales Oscillospirales_unclassified Oscillospirales_unclassified
# 4 4 Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Enterobacteriaceae_unclassified
# 5 5 Firmicutes Bacilli Staphylococcales Staphylococcaceae Staphylococcus
Explanation:
We apply
an anonymous function \(x)
on MARGIN=1
(i.e. row-wise) on the data frame while excluding first column df[-1]
. In the anonymous function we replace
in every row x
where is.na(x)
is TRUE
by the tail
of length n=1
of na.omit(x)
(i.e. x
without the NA
s while paste0
ing suffix '_unclassified'
to it.
答案3
得分: 0
以下是翻译好的部分:
library(tidyverse)
library(vctrs)
feature <- c("1", "2", "3", "4", "5" )
phylum <- c("Firmicutes", "Firmicutes", "Firmicutes", "Proteobacteria", "Firmicutes" )
class <- c(NA, "Clostridia", "Clostridia", "Gammaproteobacteria", "Bacilli" )
order <- c(NA, NA, "Oscillospirales", "Enterobacterales", "Staphylococcales" )
family <- c(NA, NA, NA, "Enterobacteriaceae", "Staphylococcaceae")
genus <- c(NA, NA, NA, NA, "Staphylococcus")
df <- data.frame(feature, phylum, class, order, family, genus)
df2 <- df %>%
t() %>%
as_tibble(.name_repair = ~vec_as_names(..., repair = "unique", quiet = TRUE)) %>%
mutate(across(everything(), ~if_else(
is.na(.x),
paste0(vec_fill_missing(.x, direction = "down"), "_unclassified"),
.x))) %>%
t() %>%
as_tibble(.name_repair = ~colnames(df))
df2
#> # A tibble: 5 × 6
#> feature phylum class order family genus
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 Firmicutes Firmicutes_unclassified Firmicutes_unclas… Firmi… Firm…
#> 2 2 Firmicutes Clostridia Clostridia_unclas… Clost… Clos…
#> 3 3 Firmicutes Clostridia Oscillospirales Oscil… Osci…
#> 4 4 Proteobacteria Gammaproteobacteria Enterobacterales Enter… Ente…
#> 5 5 Firmicutes Bacilli Staphylococcales Staph… Stap…
创建于2023年02月10日,使用 reprex v2.0.2
英文:
Here is one potential solution which transposes the dataframe, fills the NAs with the most recent non-NA value, then transposes the dataframe back again:
library(tidyverse)
library(vctrs)
feature <- c("1", "2", "3", "4", "5" )
phylum <- c("Firmicutes", "Firmicutes", "Firmicutes", "Proteobacteria", "Firmicutes" )
class <- c(NA, "Clostridia", "Clostridia", "Gammaproteobacteria", "Bacilli" )
order <- c(NA, NA, "Oscillospirales", "Enterobacterales", "Staphylococcales" )
family <- c(NA, NA, NA, "Enterobacteriaceae", "Staphylococcaceae")
genus <- c(NA, NA, NA, NA, "Staphylococcus")
df <- data.frame(feature, phylum, class, order, family, genus)
df2 <- df %>%
t() %>%
as_tibble(.name_repair = ~vec_as_names(..., repair = "unique", quiet = TRUE)) %>%
mutate(across(everything(), ~if_else(
is.na(.x),
paste0(vec_fill_missing(.x, direction = "down"), "_unclassified"),
.x))) %>%
t() %>%
as_tibble(.name_repair = ~colnames(df))
df2
#> # A tibble: 5 × 6
#> feature phylum class order family genus
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 Firmicutes Firmicutes_unclassified Firmicutes_unclas… Firmi… Firm…
#> 2 2 Firmicutes Clostridia Clostridia_unclas… Clost… Clos…
#> 3 3 Firmicutes Clostridia Oscillospirales Oscil… Osci…
#> 4 4 Proteobacteria Gammaproteobacteria Enterobacterales Enter… Ente…
#> 5 5 Firmicutes Bacilli Staphylococcales Staph… Stap…
<sup>Created on 2023-02-10 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论