Subset rows based on number of occurrence in specific column

huangapple go评论64阅读模式
英文:

Subset rows based on number of occurrence in specific column

问题

The gs_name 列的值是重复的,可能对应于hallmark数据框中的一个或多个行。我想保留只有在gs_name与少于25行或多于500行对应的数据框行。

for (i in hallmark$gs_name) {
  if (25 <= nrow(hallmark) >= 500) {
    subset.df <- hallmark
  }
}

输入:

> dput(hallmark[c(1:5,300:305),3:4])
structure(list(gs_name = c("adipogenesis", "adipogenesis", "adipogenesis", 
"adipogenesis", "adipogenesis", "bile_acid_metabolism", "bile_acid_metabolism", 
"bile_acid_metabolism", "bile_acid_metabolism", "bile_acid_metabolism", 
"bile_acid_metabolism"), gene_symbol = c("ABCA1", "ABCB8", "ACAA2", 
"ACADL", "ACADM", "HSD17B4", "HSD17B6", "HSD3B1", "HSD3B7", "IDH1", 
"IDH2")), row.names = c(NA, -11L), class = c("tbl_df", "tbl", 
"data.frame"))
英文:

The gs_name column value is repetitive and may correspond to one or more rows in the hallmark df. I want to keep only the rows of the dataframe where gs_name corresponds with less than to 25 rows or more than 500 rows.

for (i in hallmark$gs_name) {
  if (25 <= nrow(hallmark) >= 500) {
    subset.df <- hallmark
  }
}

Input:

> dput(hallmark[c(1:5,300:305),3:4])
structure(list(gs_name = c("adipogenesis", "adipogenesis", "adipogenesis", 
"adipogenesis", "adipogenesis", "bile_acid_metabolism", "bile_acid_metabolism", 
"bile_acid_metabolism", "bile_acid_metabolism", "bile_acid_metabolism", 
"bile_acid_metabolism"), gene_symbol = c("ABCA1", "ABCB8", "ACAA2", 
"ACADL", "ACADM", "HSD17B4", "HSD17B6", "HSD3B1", "HSD3B7", "IDH1", 
"IDH2")), row.names = c(NA, -11L), class = c("tbl_df", "tbl", 
"data.frame"))

答案1

得分: 1

**data.table**

    library(data.table)
    setDT(hallmark )[, .SD[.N <= 25 | .N >= 500], by = gs_name]
英文:

additional solution option

data.table

library(data.table)
setDT(hallmark )[, .SD[.N <= 25 | .N >= 500], by = gs_name]

答案2

得分: 0

最简单的方法是按照 gs_name 对数据进行分组,并使用 dplyr 函数 n() 来满足你的两个条件,即:

library(dplyr)

hallmark <- structure(list(gs_name = c("adipogenesis", "adipogenesis", "adipogenesis", 
                           "adipogenesis", "adipogenesis", "bile_acid_metabolism", "bile_acid_metabolism", 
                           "bile_acid_metabolism", "bile_acid_metabolism", "bile_acid_metabolism", 
                           "bile_acid_metabolism"), gene_symbol = c("ABCA1", "ABCB8", "ACAA2", 
                                                                    "ACADL", "ACADM", "HSD17B4", "HSD17B6", "HSD3B1", "HSD3B7", "IDH1", 
                                                                    "IDH2")), row.names = c(NA, -11L), class = c("tbl_df", "tbl", 
                                                                                                                 "data.frame"))

hallmark %>%
  group_by(gs_name) %>%
  filter(n() <= 25 | n() >= 500) %>%
  ungroup()
#> # A tibble: 11 × 2
#>    gs_name              gene_symbol
#>    <chr>                <chr>      
#>  1 adipogenesis         ABCA1      
#>  2 adipogenesis         ABCB8      
#>  3 adipogenesis         ACAA2      
#>  4 adipogenesis         ACADL      
#>  5 adipogenesis         ACADM      
#>  6 bile_acid_metabolism HSD17B4    
#>  7 bile_acid_metabolism HSD17B6    
#>  8 bile_acid_metabolism HSD3B1     
#>  9 bile_acid_metabolism HSD3B7     
#> 10 bile_acid_metabolism IDH1       
#> 11 bile_acid_metabolism IDH2

<sup>创建于2023年07月18日,使用 reprex v2.0.2</sup>

英文:

Easiest way would be to group your data by gs_name and use the dplyr function n() for your two conditions, i.e.

library(dplyr)

hallmark &lt;- structure(list(gs_name = c(&quot;adipogenesis&quot;, &quot;adipogenesis&quot;, &quot;adipogenesis&quot;, 
                           &quot;adipogenesis&quot;, &quot;adipogenesis&quot;, &quot;bile_acid_metabolism&quot;, &quot;bile_acid_metabolism&quot;, 
                           &quot;bile_acid_metabolism&quot;, &quot;bile_acid_metabolism&quot;, &quot;bile_acid_metabolism&quot;, 
                           &quot;bile_acid_metabolism&quot;), gene_symbol = c(&quot;ABCA1&quot;, &quot;ABCB8&quot;, &quot;ACAA2&quot;, 
                                                                    &quot;ACADL&quot;, &quot;ACADM&quot;, &quot;HSD17B4&quot;, &quot;HSD17B6&quot;, &quot;HSD3B1&quot;, &quot;HSD3B7&quot;, &quot;IDH1&quot;, 
                                                                    &quot;IDH2&quot;)), row.names = c(NA, -11L), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, 
                                                                                                                 &quot;data.frame&quot;))

hallmark %&gt;%
  group_by(gs_name) %&gt;%
  filter(n() &lt;= 25 | n() &gt;= 500) %&gt;%
  ungroup()
#&gt; # A tibble: 11 &#215; 2
#&gt;    gs_name              gene_symbol
#&gt;    &lt;chr&gt;                &lt;chr&gt;      
#&gt;  1 adipogenesis         ABCA1      
#&gt;  2 adipogenesis         ABCB8      
#&gt;  3 adipogenesis         ACAA2      
#&gt;  4 adipogenesis         ACADL      
#&gt;  5 adipogenesis         ACADM      
#&gt;  6 bile_acid_metabolism HSD17B4    
#&gt;  7 bile_acid_metabolism HSD17B6    
#&gt;  8 bile_acid_metabolism HSD3B1     
#&gt;  9 bile_acid_metabolism HSD3B7     
#&gt; 10 bile_acid_metabolism IDH1       
#&gt; 11 bile_acid_metabolism IDH2

<sup>Created on 2023-07-18 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年7月18日 06:07:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/76708375.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定