如何在R中使用相邻列的值和附加文本来替换数据框中的NA值

huangapple go评论64阅读模式
英文:

How to replace NA values in dataframe with adjacent columns value + additional text to differentiate them in R

问题

我好奇是否可以用左侧列中的文本(不是NA值的列)替换数据框中的NA值,然后在末尾添加"_unclassified"文本。

以下是一个示例数据框:

feature <- c("1",
              "2", 
              "3", 
              "4", 
              "5" )
phylum <- c("Firmicutes",
               "Firmicutes", 
               "Firmicutes", 
               "Proteobacteria", 
               "Firmicutes" )
class <- c(NA,
              "Clostridia", 
              "Clostridia", 
              "Gammaproteobacteria", 
              "Bacilli" )
order <- c(NA,
               NA,
               "Oscillospirales", 
               "Enterobacterales", 
               "Staphylococcales" )
family <- c(NA,
              NA,
              NA,
              "Enterobacteriaceae", 
              "Staphylococcaceae" )
genus <- c(NA,
              NA,
              NA,
              NA,
              "Staphylococcus")

df <- data.frame(feature, phylum, class, order, family, genus)

例如,

feature 1 将在class,order,family,genus中有Firmicutes_unclassified

feature 2 将在order,family和genus中有Clostridia_unclassified

feature 3 将在family和genus中有Oscillospirales_unclassified

feature 4 将在genus中有Enterobacteriaceae_unclassified。

英文:

I'm curious if I could replace NA values in my data frame with text from the column to the left (that does not have NA), with an additional "_unclassified" text on the end.

Here is an example data frame:

feature &lt;- c(&quot;1&quot;,
                  &quot;2&quot;, 
                  &quot;3&quot;, 
                  &quot;4&quot;, 
                  &quot;5&quot; )
phylum &lt;- c(&quot;Firmicutes&quot;,
                   &quot;Firmicutes&quot;, 
                   &quot;Firmicutes&quot;, 
                   &quot;Proteobacteria&quot;, 
                   &quot;Firmicutes&quot; )
class &lt;- c(NA,
                  &quot;Clostridia&quot;, 
                  &quot;Clostridia&quot;, 
                  &quot;Gammaproteobacteria&quot;, 
                  &quot;Bacilli&quot; )
order &lt;- c(NA,
                   NA,
                   &quot;Oscillospirales&quot;, 
                   &quot;Enterobacterales&quot;, 
                   &quot;Staphylococcales&quot; )
family &lt;- c(NA,
                  NA,
                  NA,
                  &quot;Enterobacteriaceae&quot;, 
                  &quot;Staphylococcaceae&quot; )
genus &lt;- c(NA,
                  NA,
                  NA,
                  NA,
                  &quot;Staphylococcus&quot;)


df &lt;- data.frame(feature, phylum, class, order, family, genus)

For example,

feature 1 would have Firmicutes_unclassified across class, order, family, genus

feature 2 would have Clostridia_unclassified across order, family, and genus

feature 3 would have Oscillospirales_unclassified across family and genus

feature 4 would have Enterobacteriaceae_unclassified for genus

答案1

得分: 1

> df
  feature         phylum                   class                   order                       family
1       1     厚壁菌门 厚壁菌门_未分类 厚壁菌门_未分类 厚壁菌门_未分类
2       2     厚壁菌门              梭菌纲 梭菌纲_未分类 梭菌纲_未分类
3       3     厚壁菌门              梭菌纲         压摩纲 压摩纲_未分类
4       4  变形菌门 伽马变形菌纲 肠杆菌目 肠杆菌科
5       5     厚壁菌门                 乳杆菌纲        葡萄球菌目           葡萄球菌科
                            genus
1         厚壁菌门_未分类
2         梭菌纲_未分类
3    压摩纲_未分类
4 肠杆菌科_未分类
5                  葡萄球菌
英文:

An option with na.locf from zoo

library(zoo)
df[-1] &lt;- t(apply(df[-1], 1, \(x) ifelse(is.na(x), paste0(na.locf0(x), 
      &#39;_unclassified&#39;), x)))

-output

&gt; df
  feature         phylum                   class                   order                       family
1       1     Firmicutes Firmicutes_unclassified Firmicutes_unclassified      Firmicutes_unclassified
2       2     Firmicutes              Clostridia Clostridia_unclassified      Clostridia_unclassified
3       3     Firmicutes              Clostridia         Oscillospirales Oscillospirales_unclassified
4       4 Proteobacteria     Gammaproteobacteria        Enterobacterales           Enterobacteriaceae
5       5     Firmicutes                 Bacilli        Staphylococcales            Staphylococcaceae
                            genus
1         Firmicutes_unclassified
2         Clostridia_unclassified
3    Oscillospirales_unclassified
4 Enterobacteriaceae_unclassified
5                  Staphylococcus

答案2

得分: 1

使用纯粹的基础R,可以使用以下一行代码来执行这个操作:

df[-1] <- t(apply(df[-1], MARGIN=1, function(x) replace(x, is.na(x), paste0(tail(na.omit(x), n=1), '_unclassified'))))
df

这行代码会在数据框中去除第一列(df[-1])后,对每一行应用一个匿名函数,该函数会将包含NA值的元素替换为去除NA值后的最后一个元素,并在其后添加 '_unclassified' 后缀。

英文:

One-liner, using just base R.

df[-1] &lt;- t(apply(df[-1], MARGIN=1, \(x) replace(x, is.na(x), paste0(tail(na.omit(x), n=1), &#39;_unclassified&#39;))))
df
#   feature         phylum                   class                   order                       family                           genus
# 1       1     Firmicutes Firmicutes_unclassified Firmicutes_unclassified      Firmicutes_unclassified         Firmicutes_unclassified
# 2       2     Firmicutes              Clostridia Clostridia_unclassified      Clostridia_unclassified         Clostridia_unclassified
# 3       3     Firmicutes              Clostridia         Oscillospirales Oscillospirales_unclassified    Oscillospirales_unclassified
# 4       4 Proteobacteria     Gammaproteobacteria        Enterobacterales           Enterobacteriaceae Enterobacteriaceae_unclassified
# 5       5     Firmicutes                 Bacilli        Staphylococcales            Staphylococcaceae                  Staphylococcus

Explanation:

We apply an anonymous function \(x) on MARGIN=1 (i.e. row-wise) on the data frame while excluding first column df[-1]. In the anonymous function we replace in every row x where is.na(x) is TRUE by the tail of length n=1 of na.omit(x) (i.e. x without the NAs while paste0ing suffix &#39;_unclassified&#39; to it.

答案3

得分: 0

以下是翻译好的部分:

library(tidyverse)
library(vctrs)

feature &lt;- c(&quot;1&quot;, &quot;2&quot;, &quot;3&quot;, &quot;4&quot;, &quot;5&quot; )
phylum &lt;- c(&quot;Firmicutes&quot;, &quot;Firmicutes&quot;, &quot;Firmicutes&quot;, &quot;Proteobacteria&quot;, &quot;Firmicutes&quot; )
class &lt;- c(NA, &quot;Clostridia&quot;, &quot;Clostridia&quot;, &quot;Gammaproteobacteria&quot;, &quot;Bacilli&quot; )
order &lt;- c(NA, NA, &quot;Oscillospirales&quot;, &quot;Enterobacterales&quot;, &quot;Staphylococcales&quot; )
family &lt;- c(NA, NA, NA, &quot;Enterobacteriaceae&quot;, &quot;Staphylococcaceae&quot;)
genus &lt;- c(NA, NA, NA, NA, &quot;Staphylococcus&quot;)

df &lt;- data.frame(feature, phylum, class, order, family, genus)

df2 &lt;- df %&gt;%
  t() %&gt;%
  as_tibble(.name_repair = ~vec_as_names(..., repair = &quot;unique&quot;, quiet = TRUE)) %&gt;%
  mutate(across(everything(), ~if_else(
    is.na(.x),
    paste0(vec_fill_missing(.x, direction = &quot;down&quot;), &quot;_unclassified&quot;),
    .x))) %&gt;%
  t() %&gt;%
  as_tibble(.name_repair = ~colnames(df))
df2
#&gt; # A tibble: 5 &#215; 6
#&gt;   feature phylum         class                   order              family genus
#&gt;   &lt;chr&gt;   &lt;chr&gt;          &lt;chr&gt;                   &lt;chr&gt;              &lt;chr&gt;  &lt;chr&gt;
#&gt; 1 1       Firmicutes     Firmicutes_unclassified Firmicutes_unclas… Firmi… Firm…
#&gt; 2 2       Firmicutes     Clostridia              Clostridia_unclas… Clost… Clos…
#&gt; 3 3       Firmicutes     Clostridia              Oscillospirales    Oscil… Osci…
#&gt; 4 4       Proteobacteria Gammaproteobacteria     Enterobacterales   Enter… Ente…
#&gt; 5 5       Firmicutes     Bacilli                 Staphylococcales   Staph… Stap…

创建于2023年02月10日,使用 reprex v2.0.2

英文:

Here is one potential solution which transposes the dataframe, fills the NAs with the most recent non-NA value, then transposes the dataframe back again:

library(tidyverse)
library(vctrs)

feature &lt;- c(&quot;1&quot;, &quot;2&quot;, &quot;3&quot;, &quot;4&quot;, &quot;5&quot; )
phylum &lt;- c(&quot;Firmicutes&quot;, &quot;Firmicutes&quot;, &quot;Firmicutes&quot;, &quot;Proteobacteria&quot;, &quot;Firmicutes&quot; )
class &lt;- c(NA, &quot;Clostridia&quot;, &quot;Clostridia&quot;, &quot;Gammaproteobacteria&quot;, &quot;Bacilli&quot; )
order &lt;- c(NA, NA, &quot;Oscillospirales&quot;, &quot;Enterobacterales&quot;, &quot;Staphylococcales&quot; )
family &lt;- c(NA, NA, NA, &quot;Enterobacteriaceae&quot;, &quot;Staphylococcaceae&quot;)
genus &lt;- c(NA, NA, NA, NA, &quot;Staphylococcus&quot;)


df &lt;- data.frame(feature, phylum, class, order, family, genus)

df2 &lt;- df %&gt;%
  t() %&gt;%
  as_tibble(.name_repair = ~vec_as_names(..., repair = &quot;unique&quot;, quiet = TRUE)) %&gt;%
  mutate(across(everything(), ~if_else(
    is.na(.x),
    paste0(vec_fill_missing(.x, direction = &quot;down&quot;), &quot;_unclassified&quot;),
    .x))) %&gt;%
  t() %&gt;%
  as_tibble(.name_repair = ~colnames(df))
df2
#&gt; # A tibble: 5 &#215; 6
#&gt;   feature phylum         class                   order              family genus
#&gt;   &lt;chr&gt;   &lt;chr&gt;          &lt;chr&gt;                   &lt;chr&gt;              &lt;chr&gt;  &lt;chr&gt;
#&gt; 1 1       Firmicutes     Firmicutes_unclassified Firmicutes_unclas… Firmi… Firm…
#&gt; 2 2       Firmicutes     Clostridia              Clostridia_unclas… Clost… Clos…
#&gt; 3 3       Firmicutes     Clostridia              Oscillospirales    Oscil… Osci…
#&gt; 4 4       Proteobacteria Gammaproteobacteria     Enterobacterales   Enter… Ente…
#&gt; 5 5       Firmicutes     Bacilli                 Staphylococcales   Staph… Stap…

<sup>Created on 2023-02-10 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年2月10日 06:34:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/75405150.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定