从具有分组变量的数据框中随机抽取行的样本。

huangapple go评论97阅读模式
英文:

Take random sample of rows from dataframe with grouping variables

问题

以下是代码部分的翻译:

  1. 我有一个数据框,结构如下:
  2. dat <- tibble(
  3. item_type = rep(1:36, each = 6),
  4. condition1 = rep(c("a", "b", "c"), times = 72),
  5. condition2 = rep(c("y", "z"), each = 3, times = 36),
  6. ) %>%
  7. unite(unique, item_type, condition1, condition2, sep = "-", remove = TRUE)

看起来像这样:

  1. # 一个 tibble: 216 × 4
  2. unique item_type condition1 condition2
  3. <chr> <int> <chr> <chr>
  4. 1 1-a-y 1 a y
  5. 2 1-b-y 1 b y
  6. 3 1-c-y 1 c y
  7. 4 1-a-z 1 a z
  8. 5 1-b-z 1 b z
  9. 6 1-c-z 1 c z
  10. 7 2-a-y 2 a y
  11. 8 2-b-y 2 b y
  12. 9 2-c-y 2 c y
  13. 10 2-a-z 2 a z

我想随机抽取36行数据。抽样应包括6个 condition1condition2 组合的重复,而不重复 item_type

使用 slice_sample() 似乎可以得到我想要的子集:

  1. set.seed(1)
  2. dat %>%
  3. slice_sample(n = 6, by = c("condition1", "condition2")) %>%
  4. count(condition1, condition2)
  1. condition1 condition2 n
  2. 1 a y 6
  3. 2 a z 6
  4. 3 b y 6
  5. 4 b z 6
  6. 5 c y 6
  7. 6 c z 6

但仔细检查后,我发现 item_type 被重复了。

  1. set.seed(1)
  2. dat %>%
  3. slice_sample(n = 6, by = c("condition1", "condition2")) %>%
  4. count(item_type) %>%
  5. arrange(desc(n))
  1. # 一个 tibble: 22 × 2
  2. item_type n
  3. <int> <int>
  4. 1 10 3
  5. 2 34 3
  6. 3 1 2
  7. 4 6 2
  8. 5 7 2
  9. 6 15 2
  10. 7 20 2
  11. 8 21 2
  12. 9 23 2
  13. 10 25 2
  14. # … 还有更多行

换句话说,我希望从 item_type 中只获得唯一的抽样。是否可能使用 slice_sample() 实现这一点?

编辑
添加第二个示例的数据:

  1. dat <- tibble(
  2. item_type = rep(1:36, each = 3),
  3. condition1 = rep(c("a", "b"), each = 54),
  4. condition2 = rep(c("x", "y", "z"), times = 36),
  5. ) %>%
  6. unite(unique, item_type, condition1, condition2, sep = "-", remove = TRUE)

看起来像这样:

  1. # 一个 tibble: 108 × 4
  2. unique item_type condition1 condition2
  3. <chr> <int> <chr> <chr>
  4. 1 1-a-x 1 a x
  5. 2 1-a-y 1 a y
  6. 3 1-a-z 1 a z
  7. 4 2-a-x 2 a x
  8. 5 2-a-y 2 a y
  9. 6 2-a-z 2 a z
  10. 7 3-a-x 3 a x
  11. 8 3-a-y 3 a y
  12. 9 3-a-z 3 a z
  13. 10 4-a-x 4 a x

尝试进行抽样:

  1. inner_join(
  2. dat,
  3. distinct(dat, condition1, condition2) %>%
  4. uncount(n()) %>%
  5. mutate(item_type = sample(n()))
  6. )

这将生成一个长度为20的数据框,具有以下特点:

  1. condition1 condition2 n
  2. 1 a x 4
  3. 2 a y 4
  4. 3 a z 4
  5. 4 b x 3
  6. 5 b y 4
  7. 6 b z 5
英文:

I have a dataframe with the following structure:

  1. dat &lt;- tibble(
  2. item_type = rep(1:36, each = 6),
  3. condition1 = rep(c(&quot;a&quot;, &quot;b&quot;, &quot;c&quot;), times = 72),
  4. condition2 = rep(c(&quot;y&quot;, &quot;z&quot;), each = 3, times = 36),
  5. ) %&gt;%
  6. unite(unique, item_type, condition1, condition2, sep = &quot;-&quot;, remove = F)

which looks like this:

  1. # A tibble: 216 &#215; 4
  2. unique item_type condition1 condition2
  3. &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
  4. 1 1-a-y 1 a y
  5. 2 1-b-y 1 b y
  6. 3 1-c-y 1 c y
  7. 4 1-a-z 1 a z
  8. 5 1-b-z 1 b z
  9. 6 1-c-z 1 c z
  10. 7 2-a-y 2 a y
  11. 8 2-b-y 2 b y
  12. 9 2-c-y 2 c y
  13. 10 2-a-z 2 a z

I would like to take a random sample of 36 rows. The sample should include 6 repetitions of the condition1 by condition2 combinations without repeating item_type.

Using slice_sample() it seems I can get the subset I want...

  1. set.seed(1)
  2. dat %&gt;%
  3. slice_sample(n = 6, by = c(&quot;condition1&quot;, &quot;condition2&quot;)) %&gt;%
  4. count(condition1, condition2)
  1. condition1 condition2 n
  2. &lt;chr&gt; &lt;chr&gt; &lt;int&gt;
  3. 1 a y 6
  4. 2 a z 6
  5. 3 b y 6
  6. 4 b z 6
  7. 5 c y 6
  8. 6 c z 6

But on closer inspection I see that item_type is repeated.

  1. set.seed(1)
  2. dat %&gt;%
  3. slice_sample(n = 6, by = c(&quot;condition1&quot;, &quot;condition2&quot;)) %&gt;%
  4. count(item_type) %&gt;%
  5. arrange(desc(n))
  1. # A tibble: 22 &#215; 2
  2. item_type n
  3. &lt;int&gt; &lt;int&gt;
  4. 1 10 3
  5. 2 34 3
  6. 3 1 2
  7. 4 6 2
  8. 5 7 2
  9. 6 15 2
  10. 7 20 2
  11. 8 21 2
  12. 9 23 2
  13. 10 25 2
  14. # … with 12 more rows

In other words, I would like only unique pulls overall from item_type.
Is it possible to get slice_sample() to do this?

EDIT
Adding second toy data example.

  1. dat &lt;- tibble(
  2. item_type = rep(1:36, each = 3),
  3. condition1 = rep(c(&quot;a&quot;, &quot;b&quot;), each = 54),
  4. condition2 = rep(c(&quot;x&quot;, &quot;y&quot;, &quot;z&quot;), times = 36),
  5. ) %&gt;%
  6. unite(unique, item_type, condition1, condition2, sep = &quot;-&quot;, remove = F)

Which looks like this:

  1. # A tibble: 108 &#215; 4
  2. unique item_type condition1 condition2
  3. &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
  4. 1 1-a-x 1 a x
  5. 2 1-a-y 1 a y
  6. 3 1-a-z 1 a z
  7. 4 2-a-x 2 a x
  8. 5 2-a-y 2 a y
  9. 6 2-a-z 2 a z
  10. 7 3-a-x 3 a x
  11. 8 3-a-y 3 a y
  12. 9 3-a-z 3 a z
  13. 10 4-a-x 4 a x

Attempt to sample:

  1. inner_join(
  2. dat,
  3. distinct(dat,condition1, condition2) %&gt;%
  4. uncount(n()) %&gt;%
  5. mutate(item_type = sample(n()))
  6. )

Which produces a dataframe of length 20 with the following characteristics:

  1. condition1 condition2 n
  2. &lt;chr&gt; &lt;chr&gt; &lt;int&gt;
  3. 1 a x 4
  4. 2 a y 4
  5. 3 a z 4
  6. 4 b x 3
  7. 5 b y 4
  8. 6 b z 5

答案1

得分: 2

以下是您要翻译的代码部分:

  1. You could do this:

inner_join(
dat,
distinct(dat,condition1, condition2) %>%
uncount(n()) %>%
mutate(item_type=sample(n())),
)

  1. Output:

A tibble: 36 × 4

unique item_type condition1 condition2
<chr> <int> <chr> <chr>
1 1-b-z 1 b z
2 2-a-z 2 a z
3 3-c-y 3 c y
4 4-c-z 4 c z
5 5-b-z 5 b z
6 6-a-y 6 a y
7 7-c-y 7 c y
8 8-a-y 8 a y
9 9-a-y 9 a y
10 10-c-z 10 c z

… with 26 more rows

  1. On the second dataset, you need to get the min/max range to sample:
  2. ```R
  3. inner_join(
  4. dat,
  5. distinct(dat,condition1, condition2) %&gt;%
  6. uncount(n()) %&gt;%
  7. inner_join(dat %&gt;% group_by(condition1, condition2) %&gt;% summarize(imin = min(item_type), imax=max(item_type), .groups=&quot;drop&quot;)) %&gt;%
  8. group_by(condition1) %&gt;%
  9. mutate(item_type = sample(imin[1]:imax[1],size = n())) %&gt;%
  10. ungroup() %&gt;%
  11. select(-c(imin:imax))
  12. )

Output:

  1. # A tibble: 36 &#215; 4
  2. unique item_type condition1 condition2
  3. &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
  4. 1 1-a-y 1 a y
  5. 2 2-a-z 2 a z
  6. 3 3-a-z 3 a z
  7. 4 4-a-y 4 a y
  8. 5 5-a-z 5 a z
  9. 6 6-a-y 6 a y
  10. 7 7-a-x 7 a x
  11. 8 8-a-z 8 a z
  12. 9 9-a-y 9 a y
  13. 10 10-a-z 10 a z
  14. # … with 26 more rows
英文:

You could do this:

  1. inner_join(
  2. dat,
  3. distinct(dat,condition1, condition2) %&gt;%
  4. uncount(n()) %&gt;%
  5. mutate(item_type=sample(n())),
  6. )

Output:

  1. # A tibble: 36 &#215; 4
  2. unique item_type condition1 condition2
  3. &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
  4. 1 1-b-z 1 b z
  5. 2 2-a-z 2 a z
  6. 3 3-c-y 3 c y
  7. 4 4-c-z 4 c z
  8. 5 5-b-z 5 b z
  9. 6 6-a-y 6 a y
  10. 7 7-c-y 7 c y
  11. 8 8-a-y 8 a y
  12. 9 9-a-y 9 a y
  13. 10 10-c-z 10 c z
  14. # … with 26 more rows

On the second dataset, you need to get the min/max range to sample:

  1. inner_join(
  2. dat,
  3. distinct(dat,condition1, condition2) %&gt;%
  4. uncount(n()) %&gt;%
  5. inner_join(dat %&gt;% group_by(condition1, condition2) %&gt;% summarize(imin = min(item_type), imax=max(item_type), .groups=&quot;drop&quot;)) %&gt;%
  6. group_by(condition1) %&gt;%
  7. mutate(item_type = sample(imin[1]:imax[1],size = n())) %&gt;%
  8. ungroup() %&gt;%
  9. select(-c(imin:imax))
  10. )

Output:

  1. # A tibble: 36 &#215; 4
  2. unique item_type condition1 condition2
  3. &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
  4. 1 1-a-y 1 a y
  5. 2 2-a-z 2 a z
  6. 3 3-a-z 3 a z
  7. 4 4-a-y 4 a y
  8. 5 5-a-z 5 a z
  9. 6 6-a-y 6 a y
  10. 7 7-a-x 7 a x
  11. 8 8-a-z 8 a z
  12. 9 9-a-y 9 a y
  13. 10 10-a-z 10 a z
  14. # … with 26 more rows

答案2

得分: 1

以下是您要求的代码翻译:

Try

  1. library(nplyr)
  2. library(dplyr)
  3. library(tidyr)
  4. dat %>%
  5. nest(data = -item_type) %>%
  6. nest_slice_sample(data, n = 1) %>%
  7. unnest(data)

-output

  1. # A tibble: 36 × 4
  2. item_type unique condition1 condition2
  3. <int> <chr> <chr> <chr>
  4. 1 1 1-c-z c z
  5. 2 2 2-b-z b z
  6. 3 3 3-b-y b y
  7. 4 4 4-c-y c y
  8. 5 5 5-c-z c z
  9. 6 6 6-b-z b z
  10. 7 7 7-a-z a z
  11. 8 8 8-c-z c z
  12. 9 9 9-b-y b y
  13. 10 10 10-a-y a y
  14. # … with 26 more rows

Or perhaps we need

  1. lst1 <- split(dat, dat[c("condition1", "condition2")], drop = TRUE)
  2. lst2 <- vector('list', length(lst1))
  3. item_type_rm <- numeric(0)
  4. for(i in seq_along(lst1))
  5. {
  6. tmp <- lst1[[i]]
  7. tmp1 <- tmp %>%
  8. filter(!item_type %in% item_type_rm) %>%
  9. slice_sample(n = 6)
  10. item_type_rm <- c(item_type_rm, tmp1$item_type)
  11. lst2[[i]] <- tmp1
  12. }
  13. out <- bind_rows(lst2)
  14. out
  15. # A tibble: 36 × 4
  16. unique item_type condition1 condition2
  17. <chr> <int> <chr> <chr>
  18. 1 17-a-x 17 a x
  19. 2 5-a-x 5 a x
  20. 3 9-a-x 9 a x
  21. 4 2-a-x 2 a x
  22. 5 7-a-x 7 a x
  23. 6 3-a-x 3 a x
  24. 7 31-b-x 31 b x
  25. 8 27-b-x 27 b x
  26. 9 36-b-x 36 b x
  27. 10 19-b-x 19 b x
  28. # … with 26 more rows
  29. > out %>% count(item_type) %>% pull(n)
  30. [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

请注意,这里只翻译了代码部分,没有包括注释和输出。

英文:

Try

  1. library(nplyr)
  2. library(dplyr)
  3. library(tidyr)
  4. dat %&gt;%
  5. nest(data = -item_type) %&gt;%
  6. nest_slice_sample(data, n = 1) %&gt;%
  7. unnest(data)

-output

  1. # A tibble: 36 &#215; 4
  2. item_type unique condition1 condition2
  3. &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
  4. 1 1 1-c-z c z
  5. 2 2 2-b-z b z
  6. 3 3 3-b-y b y
  7. 4 4 4-c-y c y
  8. 5 5 5-c-z c z
  9. 6 6 6-b-z b z
  10. 7 7 7-a-z a z
  11. 8 8 8-c-z c z
  12. 9 9 9-b-y b y
  13. 10 10 10-a-y a y
  14. # … with 26 more rows

Or perhaps we need

  1. lst1 &lt;- split(dat, dat[c(&quot;condition1&quot;, &quot;condition2&quot;)], drop = TRUE)
  2. lst2 &lt;- vector(&#39;list&#39;, length(lst1))
  3. item_type_rm &lt;- numeric(0)
  4. for(i in seq_along(lst1))
  5. {
  6. tmp &lt;- lst1[[i]]
  7. tmp1 &lt;- tmp %&gt;%
  8. filter(!item_type %in% item_type_rm) %&gt;%
  9. slice_sample(n = 6)
  10. item_type_rm &lt;- c(item_type_rm, tmp1$item_type)
  11. lst2[[i]] &lt;- tmp1
  12. }
  13. out &lt;- bind_rows(lst2)
  14. out
  15. # A tibble: 36 &#215; 4
  16. unique item_type condition1 condition2
  17. &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
  18. 1 17-a-x 17 a x
  19. 2 5-a-x 5 a x
  20. 3 9-a-x 9 a x
  21. 4 2-a-x 2 a x
  22. 5 7-a-x 7 a x
  23. 6 3-a-x 3 a x
  24. 7 31-b-x 31 b x
  25. 8 27-b-x 27 b x
  26. 9 36-b-x 36 b x
  27. 10 19-b-x 19 b x
  28. # … with 26 more rows
  29. &gt; out %&gt;% count(item_type) %&gt;% pull(n)
  30. [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

huangapple
  • 本文由 发表于 2023年2月27日 07:22:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/75575621.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定