Subset rows based on number of occurrence in specific column

huangapple go评论96阅读模式
英文:

Subset rows based on number of occurrence in specific column

问题

The gs_name 列的值是重复的,可能对应于hallmark数据框中的一个或多个行。我想保留只有在gs_name与少于25行或多于500行对应的数据框行。

  1. for (i in hallmark$gs_name) {
  2. if (25 <= nrow(hallmark) >= 500) {
  3. subset.df <- hallmark
  4. }
  5. }

输入:

  1. > dput(hallmark[c(1:5,300:305),3:4])
  2. structure(list(gs_name = c("adipogenesis", "adipogenesis", "adipogenesis",
  3. "adipogenesis", "adipogenesis", "bile_acid_metabolism", "bile_acid_metabolism",
  4. "bile_acid_metabolism", "bile_acid_metabolism", "bile_acid_metabolism",
  5. "bile_acid_metabolism"), gene_symbol = c("ABCA1", "ABCB8", "ACAA2",
  6. "ACADL", "ACADM", "HSD17B4", "HSD17B6", "HSD3B1", "HSD3B7", "IDH1",
  7. "IDH2")), row.names = c(NA, -11L), class = c("tbl_df", "tbl",
  8. "data.frame"))
英文:

The gs_name column value is repetitive and may correspond to one or more rows in the hallmark df. I want to keep only the rows of the dataframe where gs_name corresponds with less than to 25 rows or more than 500 rows.

  1. for (i in hallmark$gs_name) {
  2. if (25 <= nrow(hallmark) >= 500) {
  3. subset.df <- hallmark
  4. }
  5. }

Input:

  1. > dput(hallmark[c(1:5,300:305),3:4])
  2. structure(list(gs_name = c("adipogenesis", "adipogenesis", "adipogenesis",
  3. "adipogenesis", "adipogenesis", "bile_acid_metabolism", "bile_acid_metabolism",
  4. "bile_acid_metabolism", "bile_acid_metabolism", "bile_acid_metabolism",
  5. "bile_acid_metabolism"), gene_symbol = c("ABCA1", "ABCB8", "ACAA2",
  6. "ACADL", "ACADM", "HSD17B4", "HSD17B6", "HSD3B1", "HSD3B7", "IDH1",
  7. "IDH2")), row.names = c(NA, -11L), class = c("tbl_df", "tbl",
  8. "data.frame"))

答案1

得分: 1

  1. **data.table**
  2. library(data.table)
  3. setDT(hallmark )[, .SD[.N <= 25 | .N >= 500], by = gs_name]
英文:

additional solution option

data.table

  1. library(data.table)
  2. setDT(hallmark )[, .SD[.N <= 25 | .N >= 500], by = gs_name]

答案2

得分: 0

最简单的方法是按照 gs_name 对数据进行分组,并使用 dplyr 函数 n() 来满足你的两个条件,即:

  1. library(dplyr)
  2. hallmark <- structure(list(gs_name = c("adipogenesis", "adipogenesis", "adipogenesis",
  3. "adipogenesis", "adipogenesis", "bile_acid_metabolism", "bile_acid_metabolism",
  4. "bile_acid_metabolism", "bile_acid_metabolism", "bile_acid_metabolism",
  5. "bile_acid_metabolism"), gene_symbol = c("ABCA1", "ABCB8", "ACAA2",
  6. "ACADL", "ACADM", "HSD17B4", "HSD17B6", "HSD3B1", "HSD3B7", "IDH1",
  7. "IDH2")), row.names = c(NA, -11L), class = c("tbl_df", "tbl",
  8. "data.frame"))
  9. hallmark %>%
  10. group_by(gs_name) %>%
  11. filter(n() <= 25 | n() >= 500) %>%
  12. ungroup()
  13. #> # A tibble: 11 × 2
  14. #> gs_name gene_symbol
  15. #> <chr> <chr>
  16. #> 1 adipogenesis ABCA1
  17. #> 2 adipogenesis ABCB8
  18. #> 3 adipogenesis ACAA2
  19. #> 4 adipogenesis ACADL
  20. #> 5 adipogenesis ACADM
  21. #> 6 bile_acid_metabolism HSD17B4
  22. #> 7 bile_acid_metabolism HSD17B6
  23. #> 8 bile_acid_metabolism HSD3B1
  24. #> 9 bile_acid_metabolism HSD3B7
  25. #> 10 bile_acid_metabolism IDH1
  26. #> 11 bile_acid_metabolism IDH2

<sup>创建于2023年07月18日,使用 reprex v2.0.2</sup>

英文:

Easiest way would be to group your data by gs_name and use the dplyr function n() for your two conditions, i.e.

  1. library(dplyr)
  2. hallmark &lt;- structure(list(gs_name = c(&quot;adipogenesis&quot;, &quot;adipogenesis&quot;, &quot;adipogenesis&quot;,
  3. &quot;adipogenesis&quot;, &quot;adipogenesis&quot;, &quot;bile_acid_metabolism&quot;, &quot;bile_acid_metabolism&quot;,
  4. &quot;bile_acid_metabolism&quot;, &quot;bile_acid_metabolism&quot;, &quot;bile_acid_metabolism&quot;,
  5. &quot;bile_acid_metabolism&quot;), gene_symbol = c(&quot;ABCA1&quot;, &quot;ABCB8&quot;, &quot;ACAA2&quot;,
  6. &quot;ACADL&quot;, &quot;ACADM&quot;, &quot;HSD17B4&quot;, &quot;HSD17B6&quot;, &quot;HSD3B1&quot;, &quot;HSD3B7&quot;, &quot;IDH1&quot;,
  7. &quot;IDH2&quot;)), row.names = c(NA, -11L), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;,
  8. &quot;data.frame&quot;))
  9. hallmark %&gt;%
  10. group_by(gs_name) %&gt;%
  11. filter(n() &lt;= 25 | n() &gt;= 500) %&gt;%
  12. ungroup()
  13. #&gt; # A tibble: 11 &#215; 2
  14. #&gt; gs_name gene_symbol
  15. #&gt; &lt;chr&gt; &lt;chr&gt;
  16. #&gt; 1 adipogenesis ABCA1
  17. #&gt; 2 adipogenesis ABCB8
  18. #&gt; 3 adipogenesis ACAA2
  19. #&gt; 4 adipogenesis ACADL
  20. #&gt; 5 adipogenesis ACADM
  21. #&gt; 6 bile_acid_metabolism HSD17B4
  22. #&gt; 7 bile_acid_metabolism HSD17B6
  23. #&gt; 8 bile_acid_metabolism HSD3B1
  24. #&gt; 9 bile_acid_metabolism HSD3B7
  25. #&gt; 10 bile_acid_metabolism IDH1
  26. #&gt; 11 bile_acid_metabolism IDH2

<sup>Created on 2023-07-18 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年7月18日 06:07:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/76708375.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定