英文:
Subset rows based on number of occurrence in specific column
问题
The gs_name 列的值是重复的,可能对应于hallmark数据框中的一个或多个行。我想保留只有在gs_name与少于25行或多于500行对应的数据框行。
for (i in hallmark$gs_name) {
  if (25 <= nrow(hallmark) >= 500) {
    subset.df <- hallmark
  }
}
输入:
> dput(hallmark[c(1:5,300:305),3:4])
structure(list(gs_name = c("adipogenesis", "adipogenesis", "adipogenesis", 
"adipogenesis", "adipogenesis", "bile_acid_metabolism", "bile_acid_metabolism", 
"bile_acid_metabolism", "bile_acid_metabolism", "bile_acid_metabolism", 
"bile_acid_metabolism"), gene_symbol = c("ABCA1", "ABCB8", "ACAA2", 
"ACADL", "ACADM", "HSD17B4", "HSD17B6", "HSD3B1", "HSD3B7", "IDH1", 
"IDH2")), row.names = c(NA, -11L), class = c("tbl_df", "tbl", 
"data.frame"))
英文:
The gs_name column value is repetitive and may correspond to one or more rows in the hallmark df. I want to keep only the rows of the dataframe where gs_name corresponds with less than to 25 rows or more than 500 rows.
for (i in hallmark$gs_name) {
  if (25 <= nrow(hallmark) >= 500) {
    subset.df <- hallmark
  }
}
Input:
> dput(hallmark[c(1:5,300:305),3:4])
structure(list(gs_name = c("adipogenesis", "adipogenesis", "adipogenesis", 
"adipogenesis", "adipogenesis", "bile_acid_metabolism", "bile_acid_metabolism", 
"bile_acid_metabolism", "bile_acid_metabolism", "bile_acid_metabolism", 
"bile_acid_metabolism"), gene_symbol = c("ABCA1", "ABCB8", "ACAA2", 
"ACADL", "ACADM", "HSD17B4", "HSD17B6", "HSD3B1", "HSD3B7", "IDH1", 
"IDH2")), row.names = c(NA, -11L), class = c("tbl_df", "tbl", 
"data.frame"))
答案1
得分: 1
**data.table**
    library(data.table)
    setDT(hallmark )[, .SD[.N <= 25 | .N >= 500], by = gs_name]
英文:
additional solution option
data.table
library(data.table)
setDT(hallmark )[, .SD[.N <= 25 | .N >= 500], by = gs_name]
答案2
得分: 0
最简单的方法是按照 gs_name 对数据进行分组,并使用 dplyr 函数 n() 来满足你的两个条件,即:
library(dplyr)
hallmark <- structure(list(gs_name = c("adipogenesis", "adipogenesis", "adipogenesis", 
                           "adipogenesis", "adipogenesis", "bile_acid_metabolism", "bile_acid_metabolism", 
                           "bile_acid_metabolism", "bile_acid_metabolism", "bile_acid_metabolism", 
                           "bile_acid_metabolism"), gene_symbol = c("ABCA1", "ABCB8", "ACAA2", 
                                                                    "ACADL", "ACADM", "HSD17B4", "HSD17B6", "HSD3B1", "HSD3B7", "IDH1", 
                                                                    "IDH2")), row.names = c(NA, -11L), class = c("tbl_df", "tbl", 
                                                                                                                 "data.frame"))
hallmark %>%
  group_by(gs_name) %>%
  filter(n() <= 25 | n() >= 500) %>%
  ungroup()
#> # A tibble: 11 × 2
#>    gs_name              gene_symbol
#>    <chr>                <chr>      
#>  1 adipogenesis         ABCA1      
#>  2 adipogenesis         ABCB8      
#>  3 adipogenesis         ACAA2      
#>  4 adipogenesis         ACADL      
#>  5 adipogenesis         ACADM      
#>  6 bile_acid_metabolism HSD17B4    
#>  7 bile_acid_metabolism HSD17B6    
#>  8 bile_acid_metabolism HSD3B1     
#>  9 bile_acid_metabolism HSD3B7     
#> 10 bile_acid_metabolism IDH1       
#> 11 bile_acid_metabolism IDH2
<sup>创建于2023年07月18日,使用 reprex v2.0.2</sup>
英文:
Easiest way would be to group your data by gs_name and use the dplyr function n() for your two conditions, i.e.
library(dplyr)
hallmark <- structure(list(gs_name = c("adipogenesis", "adipogenesis", "adipogenesis", 
                           "adipogenesis", "adipogenesis", "bile_acid_metabolism", "bile_acid_metabolism", 
                           "bile_acid_metabolism", "bile_acid_metabolism", "bile_acid_metabolism", 
                           "bile_acid_metabolism"), gene_symbol = c("ABCA1", "ABCB8", "ACAA2", 
                                                                    "ACADL", "ACADM", "HSD17B4", "HSD17B6", "HSD3B1", "HSD3B7", "IDH1", 
                                                                    "IDH2")), row.names = c(NA, -11L), class = c("tbl_df", "tbl", 
                                                                                                                 "data.frame"))
hallmark %>%
  group_by(gs_name) %>%
  filter(n() <= 25 | n() >= 500) %>%
  ungroup()
#> # A tibble: 11 × 2
#>    gs_name              gene_symbol
#>    <chr>                <chr>      
#>  1 adipogenesis         ABCA1      
#>  2 adipogenesis         ABCB8      
#>  3 adipogenesis         ACAA2      
#>  4 adipogenesis         ACADL      
#>  5 adipogenesis         ACADM      
#>  6 bile_acid_metabolism HSD17B4    
#>  7 bile_acid_metabolism HSD17B6    
#>  8 bile_acid_metabolism HSD3B1     
#>  9 bile_acid_metabolism HSD3B7     
#> 10 bile_acid_metabolism IDH1       
#> 11 bile_acid_metabolism IDH2
<sup>Created on 2023-07-18 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论