英文:
r: sampling within groups, matching subgroup sizes
问题
我有一个非常不平衡的数据集,包含多个组,每个组又分为两种类型的观测值。
这里是一个结构的合成示例(实际数据包含数百个组和数百万个观测值):
df <- data.frame(
group = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2),
type = c(0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1),
value = c(1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 9)
)
我需要比较每个组内每种类型的标准差。类似这样:
df %>%
group_by(group, type) %>%
## 这里缺少一些内容...
summarise(n = n(),
mean = mean(value))
然而,每种类型的观测值数量不平衡,这可能会导致比较结果有偏差。
我想通过对"类型0"的观测值进行抽样,使其数量与每个组的"类型1"观测值数量相匹配。
我找到了一些建议使用slice_sample()
,但在这种组-类型匹配的情况下无法使其正常工作...
如何从每个组+类型0的观测值池中进行抽样,使其大小与相应的组+类型1的观测值池相匹配?
英文:
I have a very unbalanced dataset, containing multiple groups, each divided into to two types of observations.
Here's a synthetic example of the structure (actual data contains hundreds of groups and millions of observations):
df <- data.frame(
group = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2),
type = c(0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1),
value = c(1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 9)
)
I need to compare the standard deviation of each type, within each group. Something like that:
df %>%
group_by(group, type) %>%
## something is missing here...
summarise(n = n(),
mean = mean(value))
However, the number of observation of each type is unbalanced, and this can bias the comparison.
I thought to reduce the number of "type 0" observation by way of sampling them, to match the number of "type 1" observation for each group.
I found some suggestions to use slice_sample()
but couldn't get it to work with this situation of group-type matching...
How can I sample from each pool of group+type0 observations, matching the size of each corresponding group+type1 pool?
答案1
得分: 1
你可以简单地运行以下代码,而不是使用 slice_sample
,如下所示:
df %>%
filter(
row_number() %in% c(sample(which(!type), sum(type)), which(type == 1)),
.by = group
)
这将得到以下结果:
group type value
1 1 0 3
2 1 0 5
3 1 1 6
4 1 1 7
5 2 0 1
6 2 0 2
7 2 0 4
8 2 1 7
9 2 1 8
10 2 1 9
希望对你有帮助!
英文:
You can simply run filter
instead of slice_sample
like below
df %>%
filter(
row_number() %in% c(sample(which(!type), sum(type)), which(type == 1)),
.by = group
)
which gives, for example
group type value
1 1 0 3
2 1 0 5
3 1 1 6
4 1 1 7
5 2 0 1
6 2 0 2
7 2 0 4
8 2 1 7
9 2 1 8
10 2 1 9
答案2
得分: 0
你可以尝试这段代码:
df %>%
group_by(group) %>%
tidyr::nest() %>%
mutate(data = lapply(data, function(i) {
type <- i$type
ones_n <- length(type[type == 1])
zeros <- type[type == 0]
sampled_obs <- sample(seq_len(length(zeros)), size = ones_n)
sampled_zeros <- seq_len(length(zeros)) %in% sampled_obs
take_this <- rep(TRUE, length(type))
take_this[type == 1] <- TRUE
take_this[type == 0] <- sampled_zeros
i$take_this <- take_this
i
})) %>%
tidyr::unnest(data) %>%
filter(take_this)
在这里,我使用了嵌套操作,允许在每个组内进行自定义操作。我无法使sample_slice
起作用,所以我使用了slice
。思路是用take_this = TRUE/FALSE
标记每个观测值。如果一个零行的索引被抽样到,则被选中。
英文:
You can try this code:
df %>%
group_by(group) %>%
tidyr::nest() %>%
mutate(data = lapply(data, function(i) {
type <- i$type
ones_n <- length(type[type == 1])
zeros <- type[type == 0]
sampled_obs <- sample(seq_len(length(zeros)), size = ones_n)
sampled_zeros <- seq_len(length(zeros)) %in% sampled_obs
take_this <- rep(TRUE, length(type))
take_this[type == 1] <- TRUE
take_this[type == 0] <- sampled_zeros
i$take_this <- take_this
i
})) %>%
tidyr::unnest(data) %>%
filter(take_this)
Here I use nesting which allows for custom operations inside each group. I couldn't make sample_slice
working, I used slice
instead. The idea is to mark every observation with take_this = TRUE/FALSE
. If the index of a zero row is sampled then it's taken.
答案3
得分: 0
这是一个有些不太优雅的方法,但可以实现你的目标。首先,我创建了一个变量,根据group
和type
分组时的最小观测数量确定样本数。这部分使用了dplyr
库:
library(dplyr)
df2 <- df %>%
mutate(nmax = max(row_number()), .by = c(group, type)) %>%
mutate(sample_n = min(nmax), .by = group) %>% select(-nmax)
这创建了一个临时变量sample_n
,它告诉我们最小组中每个group
包含的观测数量。在这个例子中,group == 1
的值应该是2,group == 2
的值应该是3,因为type == 1
的组大小分别是2和3:
# unique(df2[, c("group", "sample_n")])
# group sample_n
# 1 1 2
# 8 2 3
然后,在基本的R中,我们可以使用split
、sample
和lapply
函数:
set.seed(123)
ll <- lapply(split(df2, ~ df2$group + df2$type),
function(x) x[sample(nrow(x), max(x$sample_n)), -4])
然后使用do.call
函数将结果合并:
df_final <- do.call(rbind, ll)
df_final <- df_final[order(df_final$group),]
将所有代码放在一起并输出结果:
df2 <- df %>%
mutate(nmax = max(row_number()), .by = c(group, type)) %>%
mutate(sample_n = min(nmax), .by = group) %>% select(-nmax)
ll <- lapply(split(df2, ~ df2$group + df2$type),
function(x) x[sample(nrow(x), max(x$sample_n)), -4])
df_final <- do.call(rbind, ll)
df_final <- df_final[order(df_final$group),]
# group type value
#1.0.3 1 0 3
#1.0.2 1 0 2
#1.1.7 1 1 7
#1.1.6 1 1 6
#2.0.10 2 0 3
#2.0.9 2 0 2
#2.0.12 2 0 5
#2.1.14 2 1 7
#2.1.15 2 1 8
#2.1.16 2 1 9
希望对你有帮助!
英文:
Here is a somewhat inelegant approach, but should accomplish this. First, I created a variable that determines the number of samples based on the minimum number of observations when grouped by group
and type
. This part uses dplyr
:
library(dplyr)
df2 <- df %>%
mutate(nmax = max(row_number()), .by = c(group, type)) %>%
mutate(sample_n = min(nmax), .by = group) %>% select(-nmax)
This creates a temporary variable sample_n
that tells us what the minimum number of observations the smallest group contains, per group
. Here we should have 2 for group == 1
and 3 for group == 2
since group sizes of type == 1
was 2 and 3, respectively:
# unique(df2[, c("group", "sample_n")])
# group sample_n
# 1 1 2
# 8 2 3
Then in base R we can use split
and sample
with lapply
:
set.seed(123)
ll <- lapply(split(df2, ~ df2$group + df2$type),
function(x) x[sample(nrow(x), max(x$sample_n)), -4])
Then combine with do.call
:
df_final <- do.call(rbind, ll)
df_final <- df_final[order(df_final$group),]
All together with output:
df2 <- df %>%
mutate(nmax = max(row_number()), .by = c(group, type)) %>%
mutate(sample_n = min(nmax), .by = group) %>% select(-nmax)
ll <- lapply(split(df2, ~ df2$group + df2$type),
function(x) x[sample(nrow(x), max(x$sample_n)), -4])
df_final <- do.call(rbind, ll)
df_final <- df_final[order(df_final$group),]
# group type value
#1.0.3 1 0 3
#1.0.2 1 0 2
#1.1.7 1 1 7
#1.1.6 1 1 6
#2.0.10 2 0 3
#2.0.9 2 0 2
#2.0.12 2 0 5
#2.1.14 2 1 7
#2.1.15 2 1 8
#2.1.16 2 1 9
答案4
得分: 0
这是一个稍微简化了的替代方案:
set.seed(123)
df %>%
group_by(group) %>%
mutate(size = sum(type)) %>%
filter(type == 0) %>%
slice(sample.int(n(), size = size[1L])) %>%
bind_rows(filter(df, type == 1))
# 一个 tibble: 10 × 4
# 分组: group [2]
group type value size
<dbl> <dbl> <dbl> <dbl>
1 1 0 3 2
2 1 0 2 2
3 2 0 3 3
4 2 0 2 3
5 2 0 5 3
6 1 1 6 NA
7 1 1 7 NA
8 2 1 7 NA
9 2 1 8 NA
10 2 1 9 NA
英文:
Here's an alternative with slightly less code
set.seed(123)
df %>%
group_by(group) %>%
mutate(size = sum(type)) %>%
filter(type == 0) %>%
slice(sample.int(n(), size = size[1L])) %>%
bind_rows(filter(df, type == 1))
# A tibble: 10 × 4
# Groups: group [2]
group type value size
<dbl> <dbl> <dbl> <dbl>
1 1 0 3 2
2 1 0 2 2
3 2 0 3 3
4 2 0 2 3
5 2 0 5 3
6 1 1 6 NA
7 1 1 7 NA
8 2 1 7 NA
9 2 1 8 NA
10 2 1 9 NA
答案5
得分: 0
我已经概括了ThomasIsCoding的答案,并避免了对于示例数据而言是正确的但对于其他情况是错误的假设。
这也是我在最后使用的方法:
df %>%
arrange(group, type) %>%
filter(row_number() %in% c(sample(which(type == 0),
sum(type == 1)),
which(type == 1)),
.by = group) %>%
group_by(group, type) %>%
summarise(n = n(),
mean = mean(value))
英文:
I've generalized ThomasIsCoding's answer, and avoided the assumptions that are true for the example data, but false otherwise.
This is also the approach I used at the end:
df %>%
arrange(group, type) %>%
filter(row_number() %in% c(sample(which(type == 0),
sum(type == 1)),
which(type == 1)),
.by = group) %>%
group_by(group, type) %>%
summarise(n = n(),
mean = mean(value))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论