根据R中某一列中特定数量的唯一值,筛选数据框。

huangapple go评论100阅读模式
英文:

Subset dataframe based on a specific number of unique values in a column in R

问题

以下是您要的翻译内容:

我有一个数据框:

  1. group_eg <- data.frame(
  2. CID = c(091, 091, 091, 091,
  3. 101, 101, 101,
  4. 102, 102, 102,
  5. 103, 103,
  6. 104, 104, 104),
  7. PID_A = c(1091, 1091, 1091, 1091,
  8. 1101, 1101, 1101,
  9. 1102, 1102, 1102,
  10. 1103, 1103,
  11. 1104, 1104, 1104),
  12. PID_B = c(2091, 2091, 2091, 2091,
  13. 2101, 2101, 2101,
  14. 2102, 2102, 2102,
  15. 2103, 2103,
  16. 2104, 2104, 2104),
  17. text = c("eg1", "eg2", "eg3", "eg3",
  18. "eg4","eg5", "eg6",
  19. "eg7", "eg8","eg9",
  20. "eg10", "eg11",
  21. "eg12", "eg13", "eg14")
  22. )

我想将这个数据框分成一组较小的数据框。每个较小的数据框应仅包含按升序排列的两个唯一CID属于的行。如果CID的数量不可被2整除,则最后一个数据框可以包含只属于1个CID的行。请注意,每个唯一CID的观察次数不同,因此我有点困惑。

以下是示例输出:

  1. output1 <- data.frame(
  2. CID = c(091, 091, 091, 091,
  3. 101, 101, 101),
  4. PID_A = c(1091, 1091, 1091, 1091,
  5. 1101, 1101, 1101),
  6. PID_B = c(2091, 2091, 2091, 2091,
  7. 2101, 2101, 2101),
  8. text = c("eg1", "eg2", "eg3", "eg3",
  9. "eg4","eg5", "eg6")
  10. )
  11. output2 <- data.frame(
  12. CID = c(102, 102, 102,
  13. 103, 103),
  14. PID_A = c(1102, 1102, 1102,
  15. 1103, 1103),
  16. PID_B = c(2102, 2102, 2102,
  17. 2103, 2103),
  18. text = c( "eg7", "eg8","eg9",
  19. "eg10", "eg11")
  20. )
  21. output3 <- data.frame(
  22. CID = c(104, 104, 104),
  23. PID_A = c(1104, 1104, 1104),
  24. PID_B = c(2104, 2104, 2104),
  25. text = c("eg12", "eg13", "eg14")
  26. )

有人知道如何做到这一点吗?谢谢!

英文:

I have a dataframe:

  1. group_eg &lt;- data.frame(
  2. CID = c(091, 091, 091, 091,
  3. 101, 101, 101,
  4. 102, 102, 102,
  5. 103, 103,
  6. 104, 104, 104),
  7. PID_A = c(1091, 1091, 1091, 1091,
  8. 1101, 1101, 1101,
  9. 1102, 1102, 1102,
  10. 1103, 1103,
  11. 1104, 1104, 1104),
  12. PID_B = c(2091, 2091, 2091, 2091,
  13. 2101, 2101, 2101,
  14. 2102, 2102, 2102,
  15. 2103, 2103,
  16. 2104, 2104, 2104),
  17. text = c(&quot;eg1&quot;, &quot;eg2&quot;, &quot;eg3&quot;, &quot;eg3&quot;,
  18. &quot;eg4&quot;,&quot;eg5&quot;, &quot;eg6&quot;,
  19. &quot;eg7&quot;, &quot;eg8&quot;,&quot;eg9&quot;,
  20. &quot;eg10&quot;, &quot;eg11&quot;,
  21. &quot;eg12&quot;, &quot;eg13&quot;, &quot;eg14&quot;)
  22. )

I want to divide this dataframe into a list of smaller dataframes. Each of the smaller dataframes should only contain rows that belong to 2 unique CIDs in ascending order. If the number of CIDs is not divisible by 2, the last dataframe can contain the rows that belong to just 1 CID. Notice that there are different number of observations for each unique CID, so I'm a bit stuck.

Here are the example outputs:

  1. output1 &lt;- data.frame(
  2. CID = c(091, 091, 091, 091,
  3. 101, 101, 101),
  4. PID_A = c(1091, 1091, 1091, 1091,
  5. 1101, 1101, 1101),
  6. PID_B = c(2091, 2091, 2091, 2091,
  7. 2101, 2101, 2101),
  8. text = c(&quot;eg1&quot;, &quot;eg2&quot;, &quot;eg3&quot;, &quot;eg3&quot;,
  9. &quot;eg4&quot;,&quot;eg5&quot;, &quot;eg6&quot;)
  10. )
  11. output2 &lt;- data.frame(
  12. CID = c(102, 102, 102,
  13. 103, 103),
  14. PID_A = c(1102, 1102, 1102,
  15. 1103, 1103),
  16. PID_B = c(2102, 2102, 2102,
  17. 2103, 2103),
  18. text = c( &quot;eg7&quot;, &quot;eg8&quot;,&quot;eg9&quot;,
  19. &quot;eg10&quot;, &quot;eg11&quot;)
  20. )
  21. output3 &lt;- data.frame(
  22. CID = c(104, 104, 104),
  23. PID_A = c(1104, 1104, 1104),
  24. PID_B = c(2104, 2104, 2104),
  25. text = c(&quot;eg12&quot;, &quot;eg13&quot;, &quot;eg14&quot;)
  26. )

Does anyone know how to do this? Thank you!

答案1

得分: 3

使用 dplyr::consecutive_id 和整数除法 %/%,你可以这样做:

  1. library(dplyr, warn = FALSE)
  2. group_eg |&gt;
  3. group_by(group = (consecutive_id(CID) + 1) %/% 2) |&gt;
  4. group_split()

#> <list_of<
#> tbl_df<
#> CID : double
#> PID_A: double
#> PID_B: double
#> text : character
#> group: double
#> >
#> >[3]>
#> [[1]]
#> # A tibble: 7 × 5
#> CID PID_A PID_B text group
#> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 91 1091 2091 eg1 1
#> 2 91 1091 2091 eg2 1
#> 3 91 1091 2091 eg3 1
#> 4 91 1091 2091 eg3 1
#> 5 101 1101 2101 eg4 1
#> 6 101 1101 2101 eg5 1
#> 7 101 1101 2101 eg6 1
#>
#> [[2]]
#> # A tibble: 5 × 5
#> CID PID_A PID_B text group
#> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 102 1102 2102 eg7 2
#> 2 102 1102 2102 eg8 2
#> 3 102 1102 2102 eg9 2
#> 4 103 1103 2103 eg10 2
#> 5 103 1103 2103 eg11 2
#>
#> [[3]]
#> # A tibble: 3 × 5
#> CID PID_A PID_B text group
#> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 104 1104 2104 eg12 3
#> 2 104 1104 2104 eg13 3
#> 3 104 1104 2104 eg14 3

  1. <details>
  2. <summary>英文:</summary>
  3. Using `dplyr::consecutive_id` and integer division `%/%` you could do:
  4. ``` r
  5. library(dplyr, warn = FALSE)
  6. group_eg |&gt;
  7. group_by(group = (consecutive_id(CID) + 1) %/% 2) |&gt;
  8. group_split()
  9. #&gt; &lt;list_of&lt;
  10. #&gt; tbl_df&lt;
  11. #&gt; CID : double
  12. #&gt; PID_A: double
  13. #&gt; PID_B: double
  14. #&gt; text : character
  15. #&gt; group: double
  16. #&gt; &gt;
  17. #&gt; &gt;[3]&gt;
  18. #&gt; [[1]]
  19. #&gt; # A tibble: 7 &#215; 5
  20. #&gt; CID PID_A PID_B text group
  21. #&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt;
  22. #&gt; 1 91 1091 2091 eg1 1
  23. #&gt; 2 91 1091 2091 eg2 1
  24. #&gt; 3 91 1091 2091 eg3 1
  25. #&gt; 4 91 1091 2091 eg3 1
  26. #&gt; 5 101 1101 2101 eg4 1
  27. #&gt; 6 101 1101 2101 eg5 1
  28. #&gt; 7 101 1101 2101 eg6 1
  29. #&gt;
  30. #&gt; [[2]]
  31. #&gt; # A tibble: 5 &#215; 5
  32. #&gt; CID PID_A PID_B text group
  33. #&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt;
  34. #&gt; 1 102 1102 2102 eg7 2
  35. #&gt; 2 102 1102 2102 eg8 2
  36. #&gt; 3 102 1102 2102 eg9 2
  37. #&gt; 4 103 1103 2103 eg10 2
  38. #&gt; 5 103 1103 2103 eg11 2
  39. #&gt;
  40. #&gt; [[3]]
  41. #&gt; # A tibble: 3 &#215; 5
  42. #&gt; CID PID_A PID_B text group
  43. #&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt;
  44. #&gt; 1 104 1104 2104 eg12 3
  45. #&gt; 2 104 1104 2104 eg13 3
  46. #&gt; 3 104 1104 2104 eg14 3

答案2

得分: 3

使用data.table::rleid函数的帮助,您可以执行split操作,这将生成一个包含三个数据框的列表。

  1. library(data.table)
  2. group_eg_split <- split(group_eg, paste0("output_", ceiling(rleid(group_eg$CID)/2)))
  3. group_eg_split
  4. $output_1
  5. CID PID_A PID_B text
  6. 1 91 1091 2091 eg1
  7. 2 91 1091 2091 eg2
  8. 3 91 1091 2091 eg3
  9. 4 91 1091 2091 eg3
  10. 5 101 1101 2101 eg4
  11. 6 101 1101 2101 eg5
  12. 7 101 1101 2101 eg6
  13. $output_2
  14. CID PID_A PID_B text
  15. 8 102 1102 2102 eg7
  16. 9 102 1102 2102 eg8
  17. 10 102 1102 2102 eg9
  18. 11 103 1103 2103 eg10
  19. 12 103 1103 2103 eg11
  20. $output_3
  21. CID PID_A PID_B text
  22. 13 104 1104 2104 eg12
  23. 14 104 1104 2104 eg13
  24. 15 104 1104 2104 eg14

要将列表元素分配给单独的对象,使用list2env。执行此操作后,将在您的环境中生成三个名称分别为"output_1"到"output_3"的对象。

  1. list2env(group_eg_split, envir = .GlobalEnv)
英文:

With the help from data.table::rleid, you can do a split, which gives a list of three dataframes.

  1. library(data.table)
  2. group_eg_split &lt;- split(group_eg, paste0(&quot;output_&quot;, ceiling(rleid(group_eg$CID)/2)))
  3. group_eg_split
  4. $output_1
  5. CID PID_A PID_B text
  6. 1 91 1091 2091 eg1
  7. 2 91 1091 2091 eg2
  8. 3 91 1091 2091 eg3
  9. 4 91 1091 2091 eg3
  10. 5 101 1101 2101 eg4
  11. 6 101 1101 2101 eg5
  12. 7 101 1101 2101 eg6
  13. $output_2
  14. CID PID_A PID_B text
  15. 8 102 1102 2102 eg7
  16. 9 102 1102 2102 eg8
  17. 10 102 1102 2102 eg9
  18. 11 103 1103 2103 eg10
  19. 12 103 1103 2103 eg11
  20. $output_3
  21. CID PID_A PID_B text
  22. 13 104 1104 2104 eg12
  23. 14 104 1104 2104 eg13
  24. 15 104 1104 2104 eg14

To assign the list elements into individual objects, use list2env. After this, three objects with name "output_1" to "output_3" will be generated in your environment.

  1. list2env(group_eg_split, envir = .GlobalEnv)

答案3

得分: 2

使用基本的R语言:

  1. CID_uniq <- as.character(unique(group_eg$CID))
  2. hash <- ceiling(setNames(seq_along(CID_uniq), CID_uniq) / 2)
  3. list_of_dataframes <-
  4. split(group_eg,
  5. f = hash[as.character(group_eg$CID)]
  6. )
  1. ## > str(list_of_dataframes)
  2. ## List of 3
  3. ## $ 1:'data.frame': 7 obs. of 4 variables:
  4. ## ..$ CID : num [1:7] 91 91 91 91 101 101 101
  5. ## ..$ PID_A: num [1:7] 1091 1091 1091 1091 1101 ...
  6. ## ..$ PID_B: num [1:7] 2091 2091 2091 2091 2101 ...
  7. ## ..$ text : chr [1:7] "eg1" "eg2" "eg3" "eg3" ...
  8. ## $ 2:'data.frame': 5 obs. of 4 variables:
  9. ## ..$ CID : num [1:5] 102 102 102 103 103
  10. ## ..$ PID_A: num [1:5] 1102 1102 1102 1103 1103
  11. ## ..$ PID_B: num [1:5] 2102 2102 2102 2103 2103
  12. ## ..$ text : chr [1:5] "eg7" "eg8" "eg9" "eg10" ...
  13. ## $ 3:'data.frame': 3 obs. of 4 variables:
  14. ## ..$ CID : num [1:3] 104 104 104
  15. ## ..$ PID_A: num [1:3] 1104 1104 1104
  16. ## ..$ PID_B: num [1:3] 2104 2104 2104
  17. ## ..$ text : chr [1:3] "eg12" "eg13" "eg14"
英文:

with base R:

  1. CID_uniq &lt;- as.character(unique(group_eg$CID))
  2. hash &lt;- ceiling(setNames(seq_along(CID_uniq), CID_uniq) / 2)
  3. list_of_dataframes &lt;-
  4. split(group_eg,
  5. f = hash[as.character(group_eg$CID)]
  6. )
  1. ## &gt; str(list_of_dataframes)
  2. ## List of 3
  3. ## $ 1:&#39;data.frame&#39;: 7 obs. of 4 variables:
  4. ## ..$ CID : num [1:7] 91 91 91 91 101 101 101
  5. ## ..$ PID_A: num [1:7] 1091 1091 1091 1091 1101 ...
  6. ## ..$ PID_B: num [1:7] 2091 2091 2091 2091 2101 ...
  7. ## ..$ text : chr [1:7] &quot;eg1&quot; &quot;eg2&quot; &quot;eg3&quot; &quot;eg3&quot; ...
  8. ## $ 2:&#39;data.frame&#39;: 5 obs. of 4 variables:
  9. ## ..$ CID : num [1:5] 102 102 102 103 103
  10. ## ..$ PID_A: num [1:5] 1102 1102 1102 1103 1103
  11. ## ..$ PID_B: num [1:5] 2102 2102 2102 2103 2103
  12. ## ..$ text : chr [1:5] &quot;eg7&quot; &quot;eg8&quot; &quot;eg9&quot; &quot;eg10&quot; ...
  13. ## $ 3:&#39;data.frame&#39;: 3 obs. of 4 variables:
  14. ## ..$ CID : num [1:3] 104 104 104
  15. ## ..$ PID_A: num [1:3] 1104 1104 1104
  16. ## ..$ PID_B: num [1:3] 2104 2104 2104
  17. ## ..$ text : chr [1:3] &quot;eg12&quot; &quot;eg13&quot; &quot;eg14&quot;

huangapple
  • 本文由 发表于 2023年6月6日 13:48:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/76411740.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定