英文:
R: Weighted Bootstrap in R
问题
我正在使用R编程语言进行工作。
我熟悉一般的自助法程序(https://en.wikipedia.org/wiki/Bootstrapping_(statistics)):
- 假设你有一个大小为“n”的数据集
- 用放回抽样的方式从中抽取大小为“n”的随机样本
- 计算这个随机样本的均值
- 重复上述步骤多次
**我的问题:**我对将这个方法扩展到“加权自助法”很感兴趣,也就是说,现在每个观测值都有一个与之相关的选择概率。
这是我尝试编写的R代码:
# 计算加权均值的函数(输入:数据x和权重w)
weighted_mean <- function(x, w) {
sum(x * w) / sum(w)
}
# 执行随机抽样的函数,其中选择任何点的概率与分配的权重成比例(输入:R是自助法重复次数)
weighted_bootstrap <- function(data, weights, R) {
estimates <- numeric(R)
for (i in seq_len(R)) {
bootstrap_sample <- sample(data, size = length(data), replace = TRUE, prob = weights)
estimates[i] <- weighted_mean(bootstrap_sample, weights)
}
estimates
}
这是如何在一些数据上使用这个加权自助法函数的示例(注意权重必须加起来等于1):
data <- c(1, 2, 3, 4, 5)
weights <- c(0.1, 0.2, 0.3, 0.2, 0.2)
R <- 1000
estimates <- weighted_bootstrap(data, weights, R)
plot(hist(estimates))
请问有人可以告诉我我是否理解得正确吗?
谢谢!
英文:
I am working with the R programming language.
I am familiar with the general bootstrap procedure (https://en.wikipedia.org/wiki/Bootstrapping_(statistics):
- Suppose you have a dataset of size "n"
- Take a random sample with replacement of size "n"
- Take the mean of this random sample
- Repeat the above steps many times
My Question: I am interested in extending this to the "weighted bootstrap" - that is, now each observation has an associated probability of being selected.
Here is my attempt to write the R code for this:
# function to calculate the weighted mean (inputs: data x and weights w)
weighted_mean <- function(x, w) {
sum(x * w) / sum(w)
}
# function that performs random sampling with replacement where the probability of selecting any point is proportional to the assigned weight (inputs: R is the number of bootstrap repetitions)
weighted_bootstrap <- function(data, weights, R) {
estimates <- numeric(R)
for (i in seq_len(R)) {
bootstrap_sample <- sample(data, size = length(data), replace = TRUE, prob = weights)
estimates[i] <- weighted_mean(bootstrap_sample, weights)
}
estimates
}
Here is how this weighted bootstrap function would be used on some data (note that the weights must add to 1) :
data <- c(1, 2, 3, 4, 5)
weights <- c(0.1, 0.2, 0.3, 0.2, 0.2)
R <- 1000
estimates <- weighted_bootstrap(data, weights, R)
plot(hist(estimates))
Can someone please tell me if I have understood this correctly?
Thanks!
答案1
得分: 3
目前的实现方式中,权重被使用了两次。
首先在sample()
函数中使用是正确的。
然后在weighted_mean()
函数中再次使用权重。这会导致错误的结果,因为权重向量不会改变,因此例如第五个权重将始终用于加权自助样本中的第五个观测值。
因此,要实现你想要的功能,代码应该是这样的:
weighted_bootstrap <- function(data, weights, R) {
estimates <- numeric(R)
for (i in seq_len(R)) {
bootstrap_sample <- sample(data, size = length(data), replace = TRUE, prob = weights)
estimates[i] <- mean(bootstrap_sample)
}
return(estimates)
}
data <- c(1, 2, 3, 4, 5)
weights <- c(0.1, 0.2, 0.3, 0.2, 0.2)
R <- 1000
estimates <- weighted_bootstrap(data, weights, R)
hist(estimates)
英文:
The way it is currently implemented you are using weights twice.
First in the sample()
function. There it is correct.
And then once again for the weighted_mean()
function. This one will produce wrong results as the weight vector does not change and therefore e.g. the fifth weight will always be used to weigh the fifth observation in your bootstrap sample.
Therefore, to achieve what you want to do the code would be:
weighted_bootstrap <- function(data, weights, R) {
estimates <- numeric(R)
for (i in seq_len(R)) {
bootstrap_sample <- sample(data, size = length(data), replace = TRUE, prob = weights)
estimates[i] <- mean(bootstrap_sample)
}
return(estimates)
}
data <- c(1, 2, 3, 4, 5)
weights <- c(0.1, 0.2, 0.3, 0.2, 0.2)
R <- 1000
estimates <- weighted_bootstrap(data, weights, R)
hist(estimates)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论