2023年8月8日 21:49:37go评论132阅读模式

英文:

r: sampling within groups, matching subgroup sizes

问题

我有一个非常不平衡的数据集，包含多个组，每个组又分为两种类型的观测值。
这里是一个结构的合成示例（实际数据包含数百个组和数百万个观测值）：

df <- data.frame(
  group = c(1, 1, 1, 1, 1,  1, 1,    2, 2, 2, 2, 2, 2,  2, 2, 2),
  type  = c(0, 0, 0, 0, 0,  1, 1,    0, 0, 0, 0, 0, 0,  1, 1, 1),
  value = c(1, 2, 3, 4, 5,  6, 7,    1, 2, 3, 4, 5, 6,  7, 8, 9)
)

我需要比较每个组内每种类型的标准差。类似这样：

df %>%
  group_by(group, type) %>%
  ## 这里缺少一些内容...
  summarise(n    = n(), 
            mean = mean(value))

然而，每种类型的观测值数量不平衡，这可能会导致比较结果有偏差。

我想通过对"类型0"的观测值进行抽样，使其数量与每个组的"类型1"观测值数量相匹配。
我找到了一些建议使用slice_sample()，但在这种组-类型匹配的情况下无法使其正常工作...

如何从每个组+类型0的观测值池中进行抽样，使其大小与相应的组+类型1的观测值池相匹配？

英文:

I have a very unbalanced dataset, containing multiple groups, each divided into to two types of observations.
Here's a synthetic example of the structure (actual data contains hundreds of groups and millions of observations):

df &lt;- data.frame(
  group = c(1, 1, 1, 1, 1,  1, 1,    2, 2, 2, 2, 2, 2,  2, 2, 2),
  type  = c(0, 0, 0, 0, 0,  1, 1,    0, 0, 0, 0, 0, 0,  1, 1, 1),
  value = c(1, 2, 3, 4, 5,  6, 7,    1, 2, 3, 4, 5, 6,  7, 8, 9)
)

I need to compare the standard deviation of each type, within each group. Something like that:

df %&gt;% 
  group_by(group, type) %&gt;%
  ## something is missing here...
  summarise(n    = n(), 
            mean = mean(value))

However, the number of observation of each type is unbalanced, and this can bias the comparison.

I thought to reduce the number of "type 0" observation by way of sampling them, to match the number of "type 1" observation for each group.
I found some suggestions to use slice_sample() but couldn't get it to work with this situation of group-type matching...

How can I sample from each pool of group+type0 observations, matching the size of each corresponding group+type1 pool?

答案1

得分: 1

你可以简单地运行以下代码，而不是使用 slice_sample，如下所示：

df %>%
    filter(
        row_number() %in% c(sample(which(!type), sum(type)), which(type == 1)),
        .by = group
    )

这将得到以下结果：

   group type value
1      1    0     3
2      1    0     5
3      1    1     6
4      1    1     7
5      2    0     1
6      2    0     2
7      2    0     4
8      2    1     7
9      2    1     8
10     2    1     9

希望对你有帮助！

英文:

You can simply run filter instead of slice_sample like below

df %&gt;%
    filter(
        row_number() %in% c(sample(which(!type), sum(type)), which(type == 1)),
        .by = group
    )

which gives, for example

   group type value
1      1    0     3
2      1    0     5
3      1    1     6
4      1    1     7
5      2    0     1
6      2    0     2
7      2    0     4
8      2    1     7
9      2    1     8
10     2    1     9

答案2

得分: 0

你可以尝试这段代码：

df %>%
  group_by(group) %>%
  tidyr::nest() %>%
  mutate(data = lapply(data, function(i) {
    type <- i$type
    ones_n <- length(type[type == 1])
    zeros <- type[type == 0]
    sampled_obs <- sample(seq_len(length(zeros)), size = ones_n)
    sampled_zeros <- seq_len(length(zeros)) %in% sampled_obs
    take_this <- rep(TRUE, length(type))
    take_this[type == 1] <- TRUE
    take_this[type == 0] <- sampled_zeros
    i$take_this <- take_this
    i
  })) %>%
  tidyr::unnest(data) %>%
  filter(take_this)

在这里，我使用了嵌套操作，允许在每个组内进行自定义操作。我无法使sample_slice起作用，所以我使用了slice。思路是用take_this = TRUE/FALSE标记每个观测值。如果一个零行的索引被抽样到，则被选中。

英文:

You can try this code:

df %&gt;% 
  group_by(group) %&gt;% 
  tidyr::nest() %&gt;% 
  mutate(data = lapply(data, function(i) {
    type &lt;- i$type
    ones_n &lt;- length(type[type == 1])
    zeros &lt;- type[type == 0]
    sampled_obs &lt;- sample(seq_len(length(zeros)), size = ones_n)
    sampled_zeros &lt;- seq_len(length(zeros)) %in% sampled_obs
    take_this &lt;- rep(TRUE, length(type))
    take_this[type == 1] &lt;- TRUE
    take_this[type == 0] &lt;- sampled_zeros
    i$take_this &lt;- take_this
    i
  })) %&gt;% 
  tidyr::unnest(data) %&gt;% 
  filter(take_this)

Here I use nesting which allows for custom operations inside each group. I couldn't make sample_slice working, I used slice instead. The idea is to mark every observation with take_this = TRUE/FALSE. If the index of a zero row is sampled then it's taken.

答案3

得分: 0

这是一个有些不太优雅的方法，但可以实现你的目标。首先，我创建了一个变量，根据group和type分组时的最小观测数量确定样本数。这部分使用了dplyr库：

library(dplyr)
df2 <- df %>%
  mutate(nmax = max(row_number()), .by = c(group, type)) %>%
  mutate(sample_n = min(nmax), .by = group) %>% select(-nmax)

这创建了一个临时变量sample_n，它告诉我们最小组中每个group包含的观测数量。在这个例子中，group == 1的值应该是2，group == 2的值应该是3，因为type == 1的组大小分别是2和3：

# unique(df2[, c("group", "sample_n")])
#   group sample_n
# 1     1        2
# 8     2        3

然后，在基本的R中，我们可以使用split、sample和lapply函数：

set.seed(123)
ll <- lapply(split(df2, ~ df2$group + df2$type), 
             function(x) x[sample(nrow(x), max(x$sample_n)), -4])

然后使用do.call函数将结果合并：

df_final <- do.call(rbind, ll)
df_final <- df_final[order(df_final$group),]

将所有代码放在一起并输出结果：

df2 <- df %>%
  mutate(nmax = max(row_number()), .by = c(group, type)) %>%
  mutate(sample_n = min(nmax), .by = group) %>% select(-nmax)
ll <- lapply(split(df2, ~ df2$group + df2$type), 
             function(x) x[sample(nrow(x), max(x$sample_n)), -4])
df_final <- do.call(rbind, ll)
df_final <- df_final[order(df_final$group),]
#       group type value
#1.0.3      1    0     3
#1.0.2      1    0     2
#1.1.7      1    1     7
#1.1.6      1    1     6
#2.0.10     2    0     3
#2.0.9      2    0     2
#2.0.12     2    0     5
#2.1.14     2    1     7
#2.1.15     2    1     8
#2.1.16     2    1     9

希望对你有帮助！

英文:

Here is a somewhat inelegant approach, but should accomplish this. First, I created a variable that determines the number of samples based on the minimum number of observations when grouped by group and type. This part uses dplyr:

library(dplyr)
df2 &lt;- df %&gt;%
  mutate(nmax = max(row_number()), .by = c(group, type)) %&gt;%
  mutate(sample_n = min(nmax), .by = group) %&gt;% select(-nmax)

This creates a temporary variable sample_n that tells us what the minimum number of observations the smallest group contains, per group. Here we should have 2 for group == 1 and 3 for group == 2 since group sizes of type == 1 was 2 and 3, respectively:

# unique(df2[, c(&quot;group&quot;, &quot;sample_n&quot;)])
#   group sample_n
# 1     1        2
# 8     2        3

Then in base R we can use split and sample with lapply:

set.seed(123)
ll &lt;- lapply(split(df2, ~ df2$group + df2$type), 
             function(x) x[sample(nrow(x), max(x$sample_n)), -4])

Then combine with do.call:

df_final &lt;- do.call(rbind, ll)
df_final &lt;- df_final[order(df_final$group),]

All together with output:

df2 &lt;- df %&gt;%
  mutate(nmax = max(row_number()), .by = c(group, type)) %&gt;%
  mutate(sample_n = min(nmax), .by = group) %&gt;% select(-nmax)
ll &lt;- lapply(split(df2, ~ df2$group + df2$type), 
             function(x) x[sample(nrow(x), max(x$sample_n)), -4])
df_final &lt;- do.call(rbind, ll)
df_final &lt;- df_final[order(df_final$group),]
#       group type value
#1.0.3      1    0     3
#1.0.2      1    0     2
#1.1.7      1    1     7
#1.1.6      1    1     6
#2.0.10     2    0     3
#2.0.9      2    0     2
#2.0.12     2    0     5
#2.1.14     2    1     7
#2.1.15     2    1     8
#2.1.16     2    1     9

答案4

得分: 0

这是一个稍微简化了的替代方案：

set.seed(123)
df %>%
  group_by(group) %>%
  mutate(size = sum(type)) %>%
  filter(type == 0) %>%
  slice(sample.int(n(), size = size[1L])) %>%
  bind_rows(filter(df, type == 1))
# 一个 tibble: 10 × 4
# 分组:   group [2]
   group  type value  size
   <dbl> <dbl> <dbl> <dbl>
 1     1     0     3     2
 2     1     0     2     2
 3     2     0     3     3
 4     2     0     2     3
 5     2     0     5     3
 6     1     1     6    NA
 7     1     1     7    NA
 8     2     1     7    NA
 9     2     1     8    NA
10     2     1     9    NA

英文:

Here's an alternative with slightly less code

set.seed(123)
df %&gt;%
  group_by(group) %&gt;%
  mutate(size = sum(type)) %&gt;%
  filter(type == 0) %&gt;%
  slice(sample.int(n(), size = size[1L])) %&gt;%
  bind_rows(filter(df, type == 1))
# A tibble: 10 &#215; 4
# Groups:   group [2]
   group  type value  size
   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
 1     1     0     3     2
 2     1     0     2     2
 3     2     0     3     3
 4     2     0     2     3
 5     2     0     5     3
 6     1     1     6    NA
 7     1     1     7    NA
 8     2     1     7    NA
 9     2     1     8    NA
10     2     1     9    NA

答案5

得分: 0

我已经概括了ThomasIsCoding的答案，并避免了对于示例数据而言是正确的但对于其他情况是错误的假设。

这也是我在最后使用的方法：

df %>%
  arrange(group, type) %>%
  filter(row_number() %in% c(sample(which(type == 0),
                                    sum(type == 1)),
                             which(type == 1)),
         .by = group) %>%
  group_by(group, type) %>%
  summarise(n    = n(), 
            mean = mean(value))

英文:

I've generalized ThomasIsCoding's answer, and avoided the assumptions that are true for the example data, but false otherwise.

This is also the approach I used at the end:

df %&gt;% 
  arrange(group, type) %&gt;%
  filter(row_number() %in% c(sample(which(type == 0),
                                    sum(type == 1)),
                             which(type == 1)),
         .by = group) %&gt;%
  group_by(group, type) %&gt;%
  summarise(n    = n(), 
            mean = mean(value))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在组内进行抽样，匹配子组大小。

问题

答案1

答案2

答案3

答案4

答案5

尝试为 tibble 创建一个日期列。希望从价格 xts 对象的索引中获取值。

Xaringan会渲染一个具有class.source选项定义的代码块。如何解决它？

dplyr 根据条件总结多个变量。

将一个因素添加到cut()函数中。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论