在组内进行抽样,匹配子组大小。

huangapple go评论132阅读模式
英文:

r: sampling within groups, matching subgroup sizes

问题

我有一个非常不平衡的数据集,包含多个组,每个组又分为两种类型的观测值。
这里是一个结构的合成示例(实际数据包含数百个组和数百万个观测值):

  1. df <- data.frame(
  2. group = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2),
  3. type = c(0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1),
  4. value = c(1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 9)
  5. )

我需要比较每个组内每种类型的标准差。类似这样:

  1. df %>%
  2. group_by(group, type) %>%
  3. ## 这里缺少一些内容...
  4. summarise(n = n(),
  5. mean = mean(value))

然而,每种类型的观测值数量不平衡,这可能会导致比较结果有偏差。

我想通过对"类型0"的观测值进行抽样,使其数量与每个组的"类型1"观测值数量相匹配。
我找到了一些建议使用slice_sample(),但在这种组-类型匹配的情况下无法使其正常工作...

如何从每个组+类型0的观测值池中进行抽样,使其大小与相应的组+类型1的观测值池相匹配?

英文:

I have a very unbalanced dataset, containing multiple groups, each divided into to two types of observations.
Here's a synthetic example of the structure (actual data contains hundreds of groups and millions of observations):

  1. df &lt;- data.frame(
  2. group = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2),
  3. type = c(0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1),
  4. value = c(1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 9)
  5. )

I need to compare the standard deviation of each type, within each group. Something like that:

  1. df %&gt;%
  2. group_by(group, type) %&gt;%
  3. ## something is missing here...
  4. summarise(n = n(),
  5. mean = mean(value))

However, the number of observation of each type is unbalanced, and this can bias the comparison.

I thought to reduce the number of "type 0" observation by way of sampling them, to match the number of "type 1" observation for each group.
I found some suggestions to use slice_sample() but couldn't get it to work with this situation of group-type matching...

How can I sample from each pool of group+type0 observations, matching the size of each corresponding group+type1 pool?

答案1

得分: 1

你可以简单地运行以下代码,而不是使用 slice_sample,如下所示:

  1. df %>%
  2. filter(
  3. row_number() %in% c(sample(which(!type), sum(type)), which(type == 1)),
  4. .by = group
  5. )

这将得到以下结果:

  1. group type value
  2. 1 1 0 3
  3. 2 1 0 5
  4. 3 1 1 6
  5. 4 1 1 7
  6. 5 2 0 1
  7. 6 2 0 2
  8. 7 2 0 4
  9. 8 2 1 7
  10. 9 2 1 8
  11. 10 2 1 9

希望对你有帮助!

英文:

You can simply run filter instead of slice_sample like below

  1. df %&gt;%
  2. filter(
  3. row_number() %in% c(sample(which(!type), sum(type)), which(type == 1)),
  4. .by = group
  5. )

which gives, for example

  1. group type value
  2. 1 1 0 3
  3. 2 1 0 5
  4. 3 1 1 6
  5. 4 1 1 7
  6. 5 2 0 1
  7. 6 2 0 2
  8. 7 2 0 4
  9. 8 2 1 7
  10. 9 2 1 8
  11. 10 2 1 9

答案2

得分: 0

你可以尝试这段代码:

  1. df %>%
  2. group_by(group) %>%
  3. tidyr::nest() %>%
  4. mutate(data = lapply(data, function(i) {
  5. type <- i$type
  6. ones_n <- length(type[type == 1])
  7. zeros <- type[type == 0]
  8. sampled_obs <- sample(seq_len(length(zeros)), size = ones_n)
  9. sampled_zeros <- seq_len(length(zeros)) %in% sampled_obs
  10. take_this <- rep(TRUE, length(type))
  11. take_this[type == 1] <- TRUE
  12. take_this[type == 0] <- sampled_zeros
  13. i$take_this <- take_this
  14. i
  15. })) %>%
  16. tidyr::unnest(data) %>%
  17. filter(take_this)

在这里,我使用了嵌套操作,允许在每个组内进行自定义操作。我无法使sample_slice起作用,所以我使用了slice。思路是用take_this = TRUE/FALSE标记每个观测值。如果一个零行的索引被抽样到,则被选中。

英文:

You can try this code:

  1. df %&gt;%
  2. group_by(group) %&gt;%
  3. tidyr::nest() %&gt;%
  4. mutate(data = lapply(data, function(i) {
  5. type &lt;- i$type
  6. ones_n &lt;- length(type[type == 1])
  7. zeros &lt;- type[type == 0]
  8. sampled_obs &lt;- sample(seq_len(length(zeros)), size = ones_n)
  9. sampled_zeros &lt;- seq_len(length(zeros)) %in% sampled_obs
  10. take_this &lt;- rep(TRUE, length(type))
  11. take_this[type == 1] &lt;- TRUE
  12. take_this[type == 0] &lt;- sampled_zeros
  13. i$take_this &lt;- take_this
  14. i
  15. })) %&gt;%
  16. tidyr::unnest(data) %&gt;%
  17. filter(take_this)

Here I use nesting which allows for custom operations inside each group. I couldn't make sample_slice working, I used slice instead. The idea is to mark every observation with take_this = TRUE/FALSE. If the index of a zero row is sampled then it's taken.

答案3

得分: 0

这是一个有些不太优雅的方法,但可以实现你的目标。首先,我创建了一个变量,根据grouptype分组时的最小观测数量确定样本数。这部分使用了dplyr库:

  1. library(dplyr)
  2. df2 <- df %>%
  3. mutate(nmax = max(row_number()), .by = c(group, type)) %>%
  4. mutate(sample_n = min(nmax), .by = group) %>% select(-nmax)

这创建了一个临时变量sample_n,它告诉我们最小组中每个group包含的观测数量。在这个例子中,group == 1的值应该是2,group == 2的值应该是3,因为type == 1的组大小分别是2和3:

  1. # unique(df2[, c("group", "sample_n")])
  2. # group sample_n
  3. # 1 1 2
  4. # 8 2 3

然后,在基本的R中,我们可以使用splitsamplelapply函数:

  1. set.seed(123)
  2. ll <- lapply(split(df2, ~ df2$group + df2$type),
  3. function(x) x[sample(nrow(x), max(x$sample_n)), -4])

然后使用do.call函数将结果合并:

  1. df_final <- do.call(rbind, ll)
  2. df_final <- df_final[order(df_final$group),]

将所有代码放在一起并输出结果:

  1. df2 <- df %>%
  2. mutate(nmax = max(row_number()), .by = c(group, type)) %>%
  3. mutate(sample_n = min(nmax), .by = group) %>% select(-nmax)
  4. ll <- lapply(split(df2, ~ df2$group + df2$type),
  5. function(x) x[sample(nrow(x), max(x$sample_n)), -4])
  6. df_final <- do.call(rbind, ll)
  7. df_final <- df_final[order(df_final$group),]
  8. # group type value
  9. #1.0.3 1 0 3
  10. #1.0.2 1 0 2
  11. #1.1.7 1 1 7
  12. #1.1.6 1 1 6
  13. #2.0.10 2 0 3
  14. #2.0.9 2 0 2
  15. #2.0.12 2 0 5
  16. #2.1.14 2 1 7
  17. #2.1.15 2 1 8
  18. #2.1.16 2 1 9

希望对你有帮助!

英文:

Here is a somewhat inelegant approach, but should accomplish this. First, I created a variable that determines the number of samples based on the minimum number of observations when grouped by group and type. This part uses dplyr:

  1. library(dplyr)
  2. df2 &lt;- df %&gt;%
  3. mutate(nmax = max(row_number()), .by = c(group, type)) %&gt;%
  4. mutate(sample_n = min(nmax), .by = group) %&gt;% select(-nmax)

This creates a temporary variable sample_n that tells us what the minimum number of observations the smallest group contains, per group. Here we should have 2 for group == 1 and 3 for group == 2 since group sizes of type == 1 was 2 and 3, respectively:

  1. # unique(df2[, c(&quot;group&quot;, &quot;sample_n&quot;)])
  2. # group sample_n
  3. # 1 1 2
  4. # 8 2 3

Then in base R we can use split and sample with lapply:

  1. set.seed(123)
  2. ll &lt;- lapply(split(df2, ~ df2$group + df2$type),
  3. function(x) x[sample(nrow(x), max(x$sample_n)), -4])

Then combine with do.call:

  1. df_final &lt;- do.call(rbind, ll)
  2. df_final &lt;- df_final[order(df_final$group),]

All together with output:

  1. df2 &lt;- df %&gt;%
  2. mutate(nmax = max(row_number()), .by = c(group, type)) %&gt;%
  3. mutate(sample_n = min(nmax), .by = group) %&gt;% select(-nmax)
  4. ll &lt;- lapply(split(df2, ~ df2$group + df2$type),
  5. function(x) x[sample(nrow(x), max(x$sample_n)), -4])
  6. df_final &lt;- do.call(rbind, ll)
  7. df_final &lt;- df_final[order(df_final$group),]
  8. # group type value
  9. #1.0.3 1 0 3
  10. #1.0.2 1 0 2
  11. #1.1.7 1 1 7
  12. #1.1.6 1 1 6
  13. #2.0.10 2 0 3
  14. #2.0.9 2 0 2
  15. #2.0.12 2 0 5
  16. #2.1.14 2 1 7
  17. #2.1.15 2 1 8
  18. #2.1.16 2 1 9

答案4

得分: 0

这是一个稍微简化了的替代方案:

  1. set.seed(123)
  2. df %>%
  3. group_by(group) %>%
  4. mutate(size = sum(type)) %>%
  5. filter(type == 0) %>%
  6. slice(sample.int(n(), size = size[1L])) %>%
  7. bind_rows(filter(df, type == 1))
  8. # 一个 tibble: 10 × 4
  9. # 分组: group [2]
  10. group type value size
  11. <dbl> <dbl> <dbl> <dbl>
  12. 1 1 0 3 2
  13. 2 1 0 2 2
  14. 3 2 0 3 3
  15. 4 2 0 2 3
  16. 5 2 0 5 3
  17. 6 1 1 6 NA
  18. 7 1 1 7 NA
  19. 8 2 1 7 NA
  20. 9 2 1 8 NA
  21. 10 2 1 9 NA
英文:

Here's an alternative with slightly less code

  1. set.seed(123)
  2. df %&gt;%
  3. group_by(group) %&gt;%
  4. mutate(size = sum(type)) %&gt;%
  5. filter(type == 0) %&gt;%
  6. slice(sample.int(n(), size = size[1L])) %&gt;%
  7. bind_rows(filter(df, type == 1))
  8. # A tibble: 10 &#215; 4
  9. # Groups: group [2]
  10. group type value size
  11. &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
  12. 1 1 0 3 2
  13. 2 1 0 2 2
  14. 3 2 0 3 3
  15. 4 2 0 2 3
  16. 5 2 0 5 3
  17. 6 1 1 6 NA
  18. 7 1 1 7 NA
  19. 8 2 1 7 NA
  20. 9 2 1 8 NA
  21. 10 2 1 9 NA

答案5

得分: 0

我已经概括了ThomasIsCoding的答案,并避免了对于示例数据而言是正确的但对于其他情况是错误的假设。

这也是我在最后使用的方法:

  1. df %>%
  2. arrange(group, type) %>%
  3. filter(row_number() %in% c(sample(which(type == 0),
  4. sum(type == 1)),
  5. which(type == 1)),
  6. .by = group) %>%
  7. group_by(group, type) %>%
  8. summarise(n = n(),
  9. mean = mean(value))
英文:

I've generalized ThomasIsCoding's answer, and avoided the assumptions that are true for the example data, but false otherwise.

This is also the approach I used at the end:

  1. df %&gt;%
  2. arrange(group, type) %&gt;%
  3. filter(row_number() %in% c(sample(which(type == 0),
  4. sum(type == 1)),
  5. which(type == 1)),
  6. .by = group) %&gt;%
  7. group_by(group, type) %&gt;%
  8. summarise(n = n(),
  9. mean = mean(value))

huangapple
  • 本文由 发表于 2023年8月8日 21:49:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/76860196.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定