2023年6月6日 13:48:55go评论100阅读模式

英文:

Subset dataframe based on a specific number of unique values in a column in R

问题

以下是您要的翻译内容：

我有一个数据框：

group_eg <- data.frame(
  CID = c(091, 091, 091, 091,
          101, 101, 101,
          102, 102, 102, 
          103, 103,
          104, 104, 104),
  PID_A = c(1091, 1091, 1091, 1091,
            1101, 1101, 1101,
            1102, 1102, 1102,
            1103, 1103,
            1104, 1104, 1104),
  PID_B = c(2091, 2091, 2091, 2091,
            2101, 2101, 2101, 
            2102, 2102, 2102,
            2103, 2103,
            2104, 2104, 2104),
  text = c("eg1", "eg2", "eg3", "eg3", 
           "eg4","eg5", "eg6", 
           "eg7", "eg8","eg9", 
           "eg10", "eg11", 
           "eg12", "eg13", "eg14")
)

我想将这个数据框分成一组较小的数据框。每个较小的数据框应仅包含按升序排列的两个唯一CID属于的行。如果CID的数量不可被2整除，则最后一个数据框可以包含只属于1个CID的行。请注意，每个唯一CID的观察次数不同，因此我有点困惑。

以下是示例输出：

output1 <- data.frame(
  CID = c(091, 091, 091, 091,
          101, 101, 101),
  PID_A = c(1091, 1091, 1091, 1091,
            1101, 1101, 1101),
  PID_B = c(2091, 2091, 2091, 2091,
            2101, 2101, 2101),
  text = c("eg1", "eg2", "eg3", "eg3", 
           "eg4","eg5", "eg6")
)
output2 <- data.frame(
  CID = c(102, 102, 102, 
          103, 103),
  PID_A = c(1102, 1102, 1102,
            1103, 1103),
  PID_B = c(2102, 2102, 2102,
            2103, 2103),
  text = c( "eg7", "eg8","eg9", 
           "eg10", "eg11")
)
output3 <- data.frame(
  CID = c(104, 104, 104),
  PID_A = c(1104, 1104, 1104),
  PID_B = c(2104, 2104, 2104),
  text = c("eg12", "eg13", "eg14")
)

有人知道如何做到这一点吗？谢谢！

英文:

I have a dataframe:

group_eg &lt;- data.frame(
  CID = c(091, 091, 091, 091,
          101, 101, 101,
          102, 102, 102, 
          103, 103,
          104, 104, 104),
  PID_A = c(1091, 1091, 1091, 1091,
            1101, 1101, 1101,
            1102, 1102, 1102,
            1103, 1103,
            1104, 1104, 1104),
  PID_B = c(2091, 2091, 2091, 2091,
            2101, 2101, 2101, 
            2102, 2102, 2102,
            2103, 2103,
            2104, 2104, 2104),
  text = c(&quot;eg1&quot;, &quot;eg2&quot;, &quot;eg3&quot;, &quot;eg3&quot;, 
           &quot;eg4&quot;,&quot;eg5&quot;, &quot;eg6&quot;, 
           &quot;eg7&quot;, &quot;eg8&quot;,&quot;eg9&quot;, 
           &quot;eg10&quot;, &quot;eg11&quot;, 
           &quot;eg12&quot;, &quot;eg13&quot;, &quot;eg14&quot;)
)

I want to divide this dataframe into a list of smaller dataframes. Each of the smaller dataframes should only contain rows that belong to 2 unique CIDs in ascending order. If the number of CIDs is not divisible by 2, the last dataframe can contain the rows that belong to just 1 CID. Notice that there are different number of observations for each unique CID, so I'm a bit stuck.

Here are the example outputs:

output1 &lt;- data.frame(
  CID = c(091, 091, 091, 091,
          101, 101, 101),
  PID_A = c(1091, 1091, 1091, 1091,
            1101, 1101, 1101),
  PID_B = c(2091, 2091, 2091, 2091,
            2101, 2101, 2101),
  text = c(&quot;eg1&quot;, &quot;eg2&quot;, &quot;eg3&quot;, &quot;eg3&quot;, 
           &quot;eg4&quot;,&quot;eg5&quot;, &quot;eg6&quot;)
)
output2 &lt;- data.frame(
  CID = c(102, 102, 102, 
          103, 103),
  PID_A = c(1102, 1102, 1102,
            1103, 1103),
  PID_B = c(2102, 2102, 2102,
            2103, 2103),
  text = c( &quot;eg7&quot;, &quot;eg8&quot;,&quot;eg9&quot;, 
           &quot;eg10&quot;, &quot;eg11&quot;)
)
output3 &lt;- data.frame(
  CID = c(104, 104, 104),
  PID_A = c(1104, 1104, 1104),
  PID_B = c(2104, 2104, 2104),
  text = c(&quot;eg12&quot;, &quot;eg13&quot;, &quot;eg14&quot;)
)

Does anyone know how to do this? Thank you!

答案1

得分: 3

使用 dplyr::consecutive_id 和整数除法 %/%，你可以这样做：

library(dplyr, warn = FALSE)
group_eg |&gt;
  group_by(group = (consecutive_id(CID) + 1) %/% 2) |&gt;
  group_split()

#> <list_of<
#> tbl_df<
#> CID : double
#> PID_A: double
#> PID_B: double
#> text : character
#> group: double
#> >
#> >[3]>
#> [[1]]
#> # A tibble: 7 × 5
#> CID PID_A PID_B text group
#> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 91 1091 2091 eg1 1
#> 2 91 1091 2091 eg2 1
#> 3 91 1091 2091 eg3 1
#> 4 91 1091 2091 eg3 1
#> 5 101 1101 2101 eg4 1
#> 6 101 1101 2101 eg5 1
#> 7 101 1101 2101 eg6 1
#>
#> [[2]]
#> # A tibble: 5 × 5
#> CID PID_A PID_B text group
#> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 102 1102 2102 eg7 2
#> 2 102 1102 2102 eg8 2
#> 3 102 1102 2102 eg9 2
#> 4 103 1103 2103 eg10 2
#> 5 103 1103 2103 eg11 2
#>
#> [[3]]
#> # A tibble: 3 × 5
#> CID PID_A PID_B text group
#> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 104 1104 2104 eg12 3
#> 2 104 1104 2104 eg13 3
#> 3 104 1104 2104 eg14 3


<details>
<summary>英文:</summary>
Using `dplyr::consecutive_id` and integer division `%/%` you could do:
``` r
library(dplyr, warn = FALSE)
group_eg |&gt;
  group_by(group = (consecutive_id(CID) + 1) %/% 2) |&gt;
  group_split()
#&gt; &lt;list_of&lt;
#&gt;   tbl_df&lt;
#&gt;     CID  : double
#&gt;     PID_A: double
#&gt;     PID_B: double
#&gt;     text : character
#&gt;     group: double
#&gt;   &gt;
#&gt; &gt;[3]&gt;
#&gt; [[1]]
#&gt; # A tibble: 7 &#215; 5
#&gt;     CID PID_A PID_B text  group
#&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1    91  1091  2091 eg1       1
#&gt; 2    91  1091  2091 eg2       1
#&gt; 3    91  1091  2091 eg3       1
#&gt; 4    91  1091  2091 eg3       1
#&gt; 5   101  1101  2101 eg4       1
#&gt; 6   101  1101  2101 eg5       1
#&gt; 7   101  1101  2101 eg6       1
#&gt; 
#&gt; [[2]]
#&gt; # A tibble: 5 &#215; 5
#&gt;     CID PID_A PID_B text  group
#&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1   102  1102  2102 eg7       2
#&gt; 2   102  1102  2102 eg8       2
#&gt; 3   102  1102  2102 eg9       2
#&gt; 4   103  1103  2103 eg10      2
#&gt; 5   103  1103  2103 eg11      2
#&gt; 
#&gt; [[3]]
#&gt; # A tibble: 3 &#215; 5
#&gt;     CID PID_A PID_B text  group
#&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1   104  1104  2104 eg12      3
#&gt; 2   104  1104  2104 eg13      3
#&gt; 3   104  1104  2104 eg14      3

答案2

得分: 3

使用data.table::rleid函数的帮助，您可以执行split操作，这将生成一个包含三个数据框的列表。

library(data.table)
group_eg_split <- split(group_eg, paste0("output_", ceiling(rleid(group_eg$CID)/2)))
group_eg_split 
$output_1
  CID PID_A PID_B text
1  91  1091  2091  eg1
2  91  1091  2091  eg2
3  91  1091  2091  eg3
4  91  1091  2091  eg3
5 101  1101  2101  eg4
6 101  1101  2101  eg5
7 101  1101  2101  eg6
$output_2
   CID PID_A PID_B text
8  102  1102  2102  eg7
9  102  1102  2102  eg8
10 102  1102  2102  eg9
11 103  1103  2103 eg10
12 103  1103  2103 eg11
$output_3
   CID PID_A PID_B text
13 104  1104  2104 eg12
14 104  1104  2104 eg13
15 104  1104  2104 eg14

要将列表元素分配给单独的对象，使用list2env。执行此操作后，将在您的环境中生成三个名称分别为"output_1"到"output_3"的对象。

list2env(group_eg_split, envir = .GlobalEnv)

英文:

With the help from data.table::rleid, you can do a split, which gives a list of three dataframes.

library(data.table)
group_eg_split &lt;- split(group_eg, paste0(&quot;output_&quot;, ceiling(rleid(group_eg$CID)/2)))
group_eg_split 
$output_1
  CID PID_A PID_B text
1  91  1091  2091  eg1
2  91  1091  2091  eg2
3  91  1091  2091  eg3
4  91  1091  2091  eg3
5 101  1101  2101  eg4
6 101  1101  2101  eg5
7 101  1101  2101  eg6
$output_2
   CID PID_A PID_B text
8  102  1102  2102  eg7
9  102  1102  2102  eg8
10 102  1102  2102  eg9
11 103  1103  2103 eg10
12 103  1103  2103 eg11
$output_3
   CID PID_A PID_B text
13 104  1104  2104 eg12
14 104  1104  2104 eg13
15 104  1104  2104 eg14

To assign the list elements into individual objects, use list2env. After this, three objects with name "output_1" to "output_3" will be generated in your environment.

list2env(group_eg_split, envir = .GlobalEnv)

答案3

得分: 2

使用基本的R语言：

CID_uniq <- as.character(unique(group_eg$CID))
hash <- ceiling(setNames(seq_along(CID_uniq), CID_uniq) / 2)
list_of_dataframes <- 
  split(group_eg,
        f = hash[as.character(group_eg$CID)]
        )

## > str(list_of_dataframes)
## List of 3
##  $ 1:'data.frame':	7 obs. of  4 variables:
##   ..$ CID  : num [1:7] 91 91 91 91 101 101 101
##   ..$ PID_A: num [1:7] 1091 1091 1091 1091 1101 ...
##   ..$ PID_B: num [1:7] 2091 2091 2091 2091 2101 ...
##   ..$ text : chr [1:7] "eg1" "eg2" "eg3" "eg3" ...
##  $ 2:'data.frame':	5 obs. of  4 variables:
##   ..$ CID  : num [1:5] 102 102 102 103 103
##   ..$ PID_A: num [1:5] 1102 1102 1102 1103 1103
##   ..$ PID_B: num [1:5] 2102 2102 2102 2103 2103
##   ..$ text : chr [1:5] "eg7" "eg8" "eg9" "eg10" ...
##  $ 3:'data.frame':	3 obs. of  4 variables:
##   ..$ CID  : num [1:3] 104 104 104
##   ..$ PID_A: num [1:3] 1104 1104 1104
##   ..$ PID_B: num [1:3] 2104 2104 2104
##   ..$ text : chr [1:3] "eg12" "eg13" "eg14"

英文:

with base R:

CID_uniq &lt;- as.character(unique(group_eg$CID))
hash &lt;- ceiling(setNames(seq_along(CID_uniq), CID_uniq) / 2)
list_of_dataframes &lt;- 
  split(group_eg,
        f = hash[as.character(group_eg$CID)]
        )

## &gt; str(list_of_dataframes)
## List of 3
##  $ 1:&#39;data.frame&#39;:	7 obs. of  4 variables:
##   ..$ CID  : num [1:7] 91 91 91 91 101 101 101
##   ..$ PID_A: num [1:7] 1091 1091 1091 1091 1101 ...
##   ..$ PID_B: num [1:7] 2091 2091 2091 2091 2101 ...
##   ..$ text : chr [1:7] &quot;eg1&quot; &quot;eg2&quot; &quot;eg3&quot; &quot;eg3&quot; ...
##  $ 2:&#39;data.frame&#39;:	5 obs. of  4 variables:
##   ..$ CID  : num [1:5] 102 102 102 103 103
##   ..$ PID_A: num [1:5] 1102 1102 1102 1103 1103
##   ..$ PID_B: num [1:5] 2102 2102 2102 2103 2103
##   ..$ text : chr [1:5] &quot;eg7&quot; &quot;eg8&quot; &quot;eg9&quot; &quot;eg10&quot; ...
##  $ 3:&#39;data.frame&#39;:	3 obs. of  4 variables:
##   ..$ CID  : num [1:3] 104 104 104
##   ..$ PID_A: num [1:3] 1104 1104 1104
##   ..$ PID_B: num [1:3] 2104 2104 2104
##   ..$ text : chr [1:3] &quot;eg12&quot; &quot;eg13&quot; &quot;eg14&quot;

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

根据R中某一列中特定数量的唯一值，筛选数据框。

问题

答案1

答案2

答案3

比较 emmeans 估计与零分布。

使用lapply函数来修改R中的多个矩阵。

将一个因素添加到cut()函数中。

重新定义因子水平和组内顺序。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。