在组内进行抽样,匹配子组大小。

huangapple go评论97阅读模式
英文:

r: sampling within groups, matching subgroup sizes

问题

我有一个非常不平衡的数据集,包含多个组,每个组又分为两种类型的观测值。
这里是一个结构的合成示例(实际数据包含数百个组和数百万个观测值):

df <- data.frame(
  group = c(1, 1, 1, 1, 1,  1, 1,    2, 2, 2, 2, 2, 2,  2, 2, 2),
  type  = c(0, 0, 0, 0, 0,  1, 1,    0, 0, 0, 0, 0, 0,  1, 1, 1),
  value = c(1, 2, 3, 4, 5,  6, 7,    1, 2, 3, 4, 5, 6,  7, 8, 9)
)

我需要比较每个组内每种类型的标准差。类似这样:

df %>%
  group_by(group, type) %>%
  ## 这里缺少一些内容...
  summarise(n    = n(), 
            mean = mean(value))

然而,每种类型的观测值数量不平衡,这可能会导致比较结果有偏差。

我想通过对"类型0"的观测值进行抽样,使其数量与每个组的"类型1"观测值数量相匹配。
我找到了一些建议使用slice_sample(),但在这种组-类型匹配的情况下无法使其正常工作...

如何从每个组+类型0的观测值池中进行抽样,使其大小与相应的组+类型1的观测值池相匹配?

英文:

I have a very unbalanced dataset, containing multiple groups, each divided into to two types of observations.
Here's a synthetic example of the structure (actual data contains hundreds of groups and millions of observations):

df &lt;- data.frame(
  group = c(1, 1, 1, 1, 1,  1, 1,    2, 2, 2, 2, 2, 2,  2, 2, 2),
  type  = c(0, 0, 0, 0, 0,  1, 1,    0, 0, 0, 0, 0, 0,  1, 1, 1),
  value = c(1, 2, 3, 4, 5,  6, 7,    1, 2, 3, 4, 5, 6,  7, 8, 9)
)

I need to compare the standard deviation of each type, within each group. Something like that:

df %&gt;% 
  group_by(group, type) %&gt;%
  ## something is missing here...
  summarise(n    = n(), 
            mean = mean(value))

However, the number of observation of each type is unbalanced, and this can bias the comparison.

I thought to reduce the number of "type 0" observation by way of sampling them, to match the number of "type 1" observation for each group.
I found some suggestions to use slice_sample() but couldn't get it to work with this situation of group-type matching...

How can I sample from each pool of group+type0 observations, matching the size of each corresponding group+type1 pool?

答案1

得分: 1

你可以简单地运行以下代码,而不是使用 slice_sample,如下所示:

df %>%
    filter(
        row_number() %in% c(sample(which(!type), sum(type)), which(type == 1)),
        .by = group
    )

这将得到以下结果:

   group type value
1      1    0     3
2      1    0     5
3      1    1     6
4      1    1     7
5      2    0     1
6      2    0     2
7      2    0     4
8      2    1     7
9      2    1     8
10     2    1     9

希望对你有帮助!

英文:

You can simply run filter instead of slice_sample like below

df %&gt;%
    filter(
        row_number() %in% c(sample(which(!type), sum(type)), which(type == 1)),
        .by = group
    )

which gives, for example

   group type value
1      1    0     3
2      1    0     5
3      1    1     6
4      1    1     7
5      2    0     1
6      2    0     2
7      2    0     4
8      2    1     7
9      2    1     8
10     2    1     9

答案2

得分: 0

你可以尝试这段代码:

df %>%
  group_by(group) %>%
  tidyr::nest() %>%
  mutate(data = lapply(data, function(i) {
    type <- i$type
    ones_n <- length(type[type == 1])
    zeros <- type[type == 0]
    sampled_obs <- sample(seq_len(length(zeros)), size = ones_n)
    sampled_zeros <- seq_len(length(zeros)) %in% sampled_obs
    take_this <- rep(TRUE, length(type))
    take_this[type == 1] <- TRUE
    take_this[type == 0] <- sampled_zeros
    i$take_this <- take_this
    i
  })) %>%
  tidyr::unnest(data) %>%
  filter(take_this)

在这里,我使用了嵌套操作,允许在每个组内进行自定义操作。我无法使sample_slice起作用,所以我使用了slice。思路是用take_this = TRUE/FALSE标记每个观测值。如果一个零行的索引被抽样到,则被选中。

英文:

You can try this code:

df %&gt;% 
  group_by(group) %&gt;% 
  tidyr::nest() %&gt;% 
  mutate(data = lapply(data, function(i) {
    type &lt;- i$type
    ones_n &lt;- length(type[type == 1])
    zeros &lt;- type[type == 0]
    sampled_obs &lt;- sample(seq_len(length(zeros)), size = ones_n)
    sampled_zeros &lt;- seq_len(length(zeros)) %in% sampled_obs
    take_this &lt;- rep(TRUE, length(type))
    take_this[type == 1] &lt;- TRUE
    take_this[type == 0] &lt;- sampled_zeros
    i$take_this &lt;- take_this
    i
  })) %&gt;% 
  tidyr::unnest(data) %&gt;% 
  filter(take_this)

Here I use nesting which allows for custom operations inside each group. I couldn't make sample_slice working, I used slice instead. The idea is to mark every observation with take_this = TRUE/FALSE. If the index of a zero row is sampled then it's taken.

答案3

得分: 0

这是一个有些不太优雅的方法,但可以实现你的目标。首先,我创建了一个变量,根据grouptype分组时的最小观测数量确定样本数。这部分使用了dplyr库:

library(dplyr)
df2 <- df %>%
  mutate(nmax = max(row_number()), .by = c(group, type)) %>%
  mutate(sample_n = min(nmax), .by = group) %>% select(-nmax)

这创建了一个临时变量sample_n,它告诉我们最小组中每个group包含的观测数量。在这个例子中,group == 1的值应该是2,group == 2的值应该是3,因为type == 1的组大小分别是2和3:

# unique(df2[, c("group", "sample_n")])
#   group sample_n
# 1     1        2
# 8     2        3

然后,在基本的R中,我们可以使用splitsamplelapply函数:

set.seed(123)
ll <- lapply(split(df2, ~ df2$group + df2$type), 
             function(x) x[sample(nrow(x), max(x$sample_n)), -4])

然后使用do.call函数将结果合并:

df_final <- do.call(rbind, ll)
df_final <- df_final[order(df_final$group),]

将所有代码放在一起并输出结果:

df2 <- df %>%
  mutate(nmax = max(row_number()), .by = c(group, type)) %>%
  mutate(sample_n = min(nmax), .by = group) %>% select(-nmax)
ll <- lapply(split(df2, ~ df2$group + df2$type), 
             function(x) x[sample(nrow(x), max(x$sample_n)), -4])
df_final <- do.call(rbind, ll)
df_final <- df_final[order(df_final$group),]

#       group type value
#1.0.3      1    0     3
#1.0.2      1    0     2
#1.1.7      1    1     7
#1.1.6      1    1     6
#2.0.10     2    0     3
#2.0.9      2    0     2
#2.0.12     2    0     5
#2.1.14     2    1     7
#2.1.15     2    1     8
#2.1.16     2    1     9

希望对你有帮助!

英文:

Here is a somewhat inelegant approach, but should accomplish this. First, I created a variable that determines the number of samples based on the minimum number of observations when grouped by group and type. This part uses dplyr:

library(dplyr)
df2 &lt;- df %&gt;%
  mutate(nmax = max(row_number()), .by = c(group, type)) %&gt;%
  mutate(sample_n = min(nmax), .by = group) %&gt;% select(-nmax)

This creates a temporary variable sample_n that tells us what the minimum number of observations the smallest group contains, per group. Here we should have 2 for group == 1 and 3 for group == 2 since group sizes of type == 1 was 2 and 3, respectively:

# unique(df2[, c(&quot;group&quot;, &quot;sample_n&quot;)])
#   group sample_n
# 1     1        2
# 8     2        3

Then in base R we can use split and sample with lapply:

set.seed(123)
ll &lt;- lapply(split(df2, ~ df2$group + df2$type), 
             function(x) x[sample(nrow(x), max(x$sample_n)), -4])

Then combine with do.call:

df_final &lt;- do.call(rbind, ll)
df_final &lt;- df_final[order(df_final$group),]

All together with output:

df2 &lt;- df %&gt;%
  mutate(nmax = max(row_number()), .by = c(group, type)) %&gt;%
  mutate(sample_n = min(nmax), .by = group) %&gt;% select(-nmax)
ll &lt;- lapply(split(df2, ~ df2$group + df2$type), 
             function(x) x[sample(nrow(x), max(x$sample_n)), -4])
df_final &lt;- do.call(rbind, ll)
df_final &lt;- df_final[order(df_final$group),]

#       group type value
#1.0.3      1    0     3
#1.0.2      1    0     2
#1.1.7      1    1     7
#1.1.6      1    1     6
#2.0.10     2    0     3
#2.0.9      2    0     2
#2.0.12     2    0     5
#2.1.14     2    1     7
#2.1.15     2    1     8
#2.1.16     2    1     9

答案4

得分: 0

这是一个稍微简化了的替代方案:

set.seed(123)
df %>%
  group_by(group) %>%
  mutate(size = sum(type)) %>%
  filter(type == 0) %>%
  slice(sample.int(n(), size = size[1L])) %>%
  bind_rows(filter(df, type == 1))
# 一个 tibble: 10 × 4
# 分组:   group [2]
   group  type value  size
   <dbl> <dbl> <dbl> <dbl>
 1     1     0     3     2
 2     1     0     2     2
 3     2     0     3     3
 4     2     0     2     3
 5     2     0     5     3
 6     1     1     6    NA
 7     1     1     7    NA
 8     2     1     7    NA
 9     2     1     8    NA
10     2     1     9    NA
英文:

Here's an alternative with slightly less code

set.seed(123)
df %&gt;%
  group_by(group) %&gt;%
  mutate(size = sum(type)) %&gt;%
  filter(type == 0) %&gt;%
  slice(sample.int(n(), size = size[1L])) %&gt;%
  bind_rows(filter(df, type == 1))
# A tibble: 10 &#215; 4
# Groups:   group [2]
   group  type value  size
   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
 1     1     0     3     2
 2     1     0     2     2
 3     2     0     3     3
 4     2     0     2     3
 5     2     0     5     3
 6     1     1     6    NA
 7     1     1     7    NA
 8     2     1     7    NA
 9     2     1     8    NA
10     2     1     9    NA

答案5

得分: 0

我已经概括了ThomasIsCoding的答案,并避免了对于示例数据而言是正确的但对于其他情况是错误的假设。

这也是我在最后使用的方法:

df %>%
  arrange(group, type) %>%
  filter(row_number() %in% c(sample(which(type == 0),
                                    sum(type == 1)),
                             which(type == 1)),
         .by = group) %>%
  group_by(group, type) %>%
  summarise(n    = n(), 
            mean = mean(value))
英文:

I've generalized ThomasIsCoding's answer, and avoided the assumptions that are true for the example data, but false otherwise.

This is also the approach I used at the end:

df %&gt;% 
  arrange(group, type) %&gt;%
  filter(row_number() %in% c(sample(which(type == 0),
                                    sum(type == 1)),
                             which(type == 1)),
         .by = group) %&gt;%
  group_by(group, type) %&gt;%
  summarise(n    = n(), 
            mean = mean(value))

huangapple
  • 本文由 发表于 2023年8月8日 21:49:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/76860196.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定