英文:
Subset rows based on number of occurrence in specific column
问题
The gs_name
列的值是重复的,可能对应于hallmark
数据框中的一个或多个行。我想保留只有在gs_name
与少于25行或多于500行对应的数据框行。
for (i in hallmark$gs_name) {
if (25 <= nrow(hallmark) >= 500) {
subset.df <- hallmark
}
}
输入:
> dput(hallmark[c(1:5,300:305),3:4])
structure(list(gs_name = c("adipogenesis", "adipogenesis", "adipogenesis",
"adipogenesis", "adipogenesis", "bile_acid_metabolism", "bile_acid_metabolism",
"bile_acid_metabolism", "bile_acid_metabolism", "bile_acid_metabolism",
"bile_acid_metabolism"), gene_symbol = c("ABCA1", "ABCB8", "ACAA2",
"ACADL", "ACADM", "HSD17B4", "HSD17B6", "HSD3B1", "HSD3B7", "IDH1",
"IDH2")), row.names = c(NA, -11L), class = c("tbl_df", "tbl",
"data.frame"))
英文:
The gs_name
column value is repetitive and may correspond to one or more rows in the hallmark
df. I want to keep only the rows of the dataframe where gs_name
corresponds with less than to 25 rows or more than 500 rows.
for (i in hallmark$gs_name) {
if (25 <= nrow(hallmark) >= 500) {
subset.df <- hallmark
}
}
Input:
> dput(hallmark[c(1:5,300:305),3:4])
structure(list(gs_name = c("adipogenesis", "adipogenesis", "adipogenesis",
"adipogenesis", "adipogenesis", "bile_acid_metabolism", "bile_acid_metabolism",
"bile_acid_metabolism", "bile_acid_metabolism", "bile_acid_metabolism",
"bile_acid_metabolism"), gene_symbol = c("ABCA1", "ABCB8", "ACAA2",
"ACADL", "ACADM", "HSD17B4", "HSD17B6", "HSD3B1", "HSD3B7", "IDH1",
"IDH2")), row.names = c(NA, -11L), class = c("tbl_df", "tbl",
"data.frame"))
答案1
得分: 1
**data.table**
library(data.table)
setDT(hallmark )[, .SD[.N <= 25 | .N >= 500], by = gs_name]
英文:
additional solution option
data.table
library(data.table)
setDT(hallmark )[, .SD[.N <= 25 | .N >= 500], by = gs_name]
答案2
得分: 0
最简单的方法是按照 gs_name 对数据进行分组,并使用 dplyr 函数 n() 来满足你的两个条件,即:
library(dplyr)
hallmark <- structure(list(gs_name = c("adipogenesis", "adipogenesis", "adipogenesis",
"adipogenesis", "adipogenesis", "bile_acid_metabolism", "bile_acid_metabolism",
"bile_acid_metabolism", "bile_acid_metabolism", "bile_acid_metabolism",
"bile_acid_metabolism"), gene_symbol = c("ABCA1", "ABCB8", "ACAA2",
"ACADL", "ACADM", "HSD17B4", "HSD17B6", "HSD3B1", "HSD3B7", "IDH1",
"IDH2")), row.names = c(NA, -11L), class = c("tbl_df", "tbl",
"data.frame"))
hallmark %>%
group_by(gs_name) %>%
filter(n() <= 25 | n() >= 500) %>%
ungroup()
#> # A tibble: 11 × 2
#> gs_name gene_symbol
#> <chr> <chr>
#> 1 adipogenesis ABCA1
#> 2 adipogenesis ABCB8
#> 3 adipogenesis ACAA2
#> 4 adipogenesis ACADL
#> 5 adipogenesis ACADM
#> 6 bile_acid_metabolism HSD17B4
#> 7 bile_acid_metabolism HSD17B6
#> 8 bile_acid_metabolism HSD3B1
#> 9 bile_acid_metabolism HSD3B7
#> 10 bile_acid_metabolism IDH1
#> 11 bile_acid_metabolism IDH2
<sup>创建于2023年07月18日,使用 reprex v2.0.2</sup>
英文:
Easiest way would be to group your data by gs_name and use the dplyr function n() for your two conditions, i.e.
library(dplyr)
hallmark <- structure(list(gs_name = c("adipogenesis", "adipogenesis", "adipogenesis",
"adipogenesis", "adipogenesis", "bile_acid_metabolism", "bile_acid_metabolism",
"bile_acid_metabolism", "bile_acid_metabolism", "bile_acid_metabolism",
"bile_acid_metabolism"), gene_symbol = c("ABCA1", "ABCB8", "ACAA2",
"ACADL", "ACADM", "HSD17B4", "HSD17B6", "HSD3B1", "HSD3B7", "IDH1",
"IDH2")), row.names = c(NA, -11L), class = c("tbl_df", "tbl",
"data.frame"))
hallmark %>%
group_by(gs_name) %>%
filter(n() <= 25 | n() >= 500) %>%
ungroup()
#> # A tibble: 11 × 2
#> gs_name gene_symbol
#> <chr> <chr>
#> 1 adipogenesis ABCA1
#> 2 adipogenesis ABCB8
#> 3 adipogenesis ACAA2
#> 4 adipogenesis ACADL
#> 5 adipogenesis ACADM
#> 6 bile_acid_metabolism HSD17B4
#> 7 bile_acid_metabolism HSD17B6
#> 8 bile_acid_metabolism HSD3B1
#> 9 bile_acid_metabolism HSD3B7
#> 10 bile_acid_metabolism IDH1
#> 11 bile_acid_metabolism IDH2
<sup>Created on 2023-07-18 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论