英文:
R: Randomly Sampling Groups of Coin Flips
问题
我正在使用R编程语言工作。
假设:
- 有一枚硬币,正面朝上的概率为0.6,反面朝上的概率也为0.6
- 一个班级里有100名学生
- 每名学生随机翻转这枚硬币多次
- 学生n的最后一次翻转不会影响学生n+1的第一次翻转(即当下一个学生翻转硬币时,第一次翻转为正面或反面的概率为0.5,但对于这名学生的下一次翻转取决于上一次翻转)
这是表示这个问题的一些R代码。
从这些数据中,我们可以计算HH、HT、TH和TT序列出现的次数。
我的问题是:使用这些数据,我尝试完成以下任务:
- 步骤1:使用带替换的抽样,选择100名学生
- 步骤2:对于在步骤1中选择的给定学生,随机选择他们翻转序列中的一个位置(例如称为位置“x”)
- 步骤3:对于同一名学生,随机选择他们翻转序列中的第二个位置(例如称为位置“y”),使y > x(即“y”在“x”之后)
- 步骤4:对步骤1中选择的所有学生重复步骤2和步骤3
- 步骤5:计算步骤1中选择的所有学生的HH、HT、TH和TT序列出现的次数
- 步骤6:重复步骤1到步骤5多次(例如1000次)
这是我自己尝试解决问题的方法。我不确定是否做得正确。能否有人帮我确认一下?
例如,假设学生15有6次翻转:H, H, T, H, T, T
- 如果x = 2,y = 5,那么我们将得到H, T, H, T
。
我的问题是:尽管代码似乎已经运行,但我不确定我是否做得正确。有人能帮我确认一下吗?
例如 - HH和TT的线条不应该几乎相同吗?TH和HT的线条不应该几乎相同吗?然而在我的图表中,这明显不是这样。我认为,在给定的迭代中,如果同一名学生在重新抽样的数据集中出现3次,那么第一次的最后一个转换将“泄漏”到第二次的第一次转换中,从而影响结果。
谢谢!
英文:
I am working with the R programming language.
Suppose:
- There is a coin where if it lands head then the probability of the next flip being heads is 0.6 (and if tails then the next flip being tails is also 0.6)
- There are 100 students in a class
- Each student flips this coin a random number of times
- The last flip of student_n does not influence the first flip of student_n+1 (i.e. when the next student flips the coin, the first flip has 0.5 probability of heads or tails, but the next flip for this student depends on the previous flip)
Here is some R code to represent this problem:
# https://stackoverflow.com/questions/76192042/r-verifying-the-results-of-coin-flips
library(tidyverse)
set.seed(123)
ids <- 1:100
student_id <- sort(sample(ids, 100000, replace = TRUE))
coin_result <- character(1000)
coin_result[1] <- sample(c("H", "T"), 1)
for (i in 2:length(coin_result)) {
if (student_id[i] != student_id[i-1]) {
coin_result[i] <- sample(c("H", "T"), 1)
} else if (coin_result[i-1] == "H") {
coin_result[i] <- sample(c("H", "T"), 1, prob = c(0.6, 0.4))
} else {
coin_result[i] <- sample(c("H", "T"), 1, prob = c(0.4, 0.6))
}
}
#tidy up
my_data <- data.frame(student_id, coin_result)
my_data <- my_data[order(my_data$student_id),]
final <- my_data %>%
group_by(student_id) %>%
mutate(flip_number = row_number())
From this data, we can count the number of times sequences of HH, HT, TH and TT occurred:
head(my_data)
# A tibble: 6 x 3
# Groups: student_id [1]
student_id coin_result flip_number
<int> <chr> <int>
1 1 H 1
2 1 H 2
3 1 H 3
4 1 H 4
5 1 T 5
6 1 H 6
my_data %>%
group_by(student_id) %>%
summarize(Sequence = str_c(coin_result, lead(coin_result)), .groups = 'drop') %>%
filter(!is.na(Sequence)) %>%
count(Sequence)
# A tibble: 4 × 2
Sequence n
<chr> <int>
1 HH 29763
2 HT 19782
3 TH 19775
4 TT 30580
My Problem: Using this data, I am trying to accomplish the following task:
- Step 1: Using sampling with replacement, select 100 students
- Step 2: For a given student that is selected in Step 1, randomly choose a position in their sequence of flips (e.g. call this position "x")
- Step 3: For this same student, randomly choose a second position in their sequence of flips (e.g. call this position "y") such that y > x (i.e. "y" comes after "x").
- Step 4: Repeat Step 2 and Step 3 for all students selected in Step 1
- Step 5: Count the number of times sequences of HH, HT, TH and TT occurred for all students selected in Step 1
- Step 6: Repeat Step 1 - Step 5 many times (e.g. 1000 times)
As an example, suppose Student 15 has 6 Flips : H, H, T, H, T, T
- if x = 2 and y = 5, then we would have H, T, H, T
Here is my own attempt at solving my problem:
# Set the number of iterations
k <- 1000
# Initialize a data frame
results <- data.frame(iteration_number = numeric(0),
h_given_h = numeric(0),
h_given_t = numeric(0),
t_given_h = numeric(0),
t_given_t = numeric(0))
# Set the number of students to sample
n_students <- length(unique(my_data$student_id))
# Loop over the number of iterations
for (i in 1:k) {
# Randomly sample student ids with replacement
sampled_ids <- sample(ids, n_students, replace = TRUE)
# Initialize a data frame to store the sampled data
sampled_data <- data.frame(student_id = integer(0), coin_result = character(0), stringsAsFactors = FALSE)
# LOOP
for (j in sampled_ids) {
# Get data for the current student
student_data <- my_data[my_data$student_id == j, ]
# Randomly choose a starting and ending point
x <- sample(nrow(student_data), 1)
y <- sample(x:nrow(student_data), 1)
# Select the data between the starting and ending point
selected_data <- student_data[x:y, ]
# Append the selected data to the sampled data frame
sampled_data <- rbind(sampled_data, selected_data)
}
final <- sampled_data %>%
group_by(student_id) %>%
mutate(flip_number = row_number())
# Calculate the conditional probabilities
cond_prob <- final %>%
group_by(student_id) %>%
summarize(Sequence = str_c(coin_result, lead(coin_result)), .groups = 'drop') %>%
filter(!is.na(Sequence)) %>%
count(Sequence) %>%
mutate(prob = n / sum(n))
# Extract probabilities
p_HH <- cond_prob$prob[cond_prob$Sequence == "HH"]
p_HT <- cond_prob$prob[cond_prob$Sequence == "HT"]
p_TH <- cond_prob$prob[cond_prob$Sequence == "TH"]
p_TT <- cond_prob$prob[cond_prob$Sequence == "TT"]
#print(i)
# Append
results[i, ] <- c(i, p_HH, p_HT, p_TH, p_TT)
}
colnames(results) <- c("iteration_number", "h_given_h", "h_given_t", "t_given_h", "t_given_t")
library(ggplot2)
# Convert to long
results_long <- tidyr::pivot_longer(results, cols = c(h_given_h, h_given_t, t_given_h, t_given_t), names_to = "condition", values_to = "probability")
# Plot
ggplot(results_long, aes(x = iteration_number, y = probability, color = condition)) +
geom_line() +
labs(x = "Iteration", y = "Probability", color = "Condition")
My Question: While the code seems to have run, I am not sure if I have done this correctly. Can someone please help me confirm this?
For example - shouldn't the lines for HH and TT be almost identical .... and shouldn't the lines of TH and HT be almost identical? Yet in my graph, this clearly not the case? The way I see it, within a given iteration, if the same student appears 3 times in the re-sampled dataset, the last transition from the first time will "leak" into the the first transition from the second time and thus compromise the results.
Thanks!
答案1
得分: 1
以下是您要翻译的内容:
"I believe that we can find the result you want by doing repetitions of the simulation process together with the sampling process. I will demonstrate this by creating functions to simulate the data and sample it according to the code you wrote
Function to simulate data
sim_data <- function(n_students){
ids <- 1:n_students
student_id <- sort(sample(ids, 100000, replace = TRUE))
coin_result <- character(1000)
coin_result[1] <- sample(c("H", "T"), 1)
for (i in 2:length(coin_result)) {
if (student_id[i] != student_id[i-1]) {
coin_result[i] <- sample(c("H", "T"), 1)
} else if (coin_result[i-1] == "H") {
coin_result[i] <- sample(c("H", "T"), 1, prob = c(0.6, 0.4))
} else {
coin_result[i] <- sample(c("H", "T"), 1, prob = c(0.4, 0.6))
}
}
data.frame(student_id, coin_result) %>%
arrange(student_id) %>%
group_by(student_id) %>%
mutate(flip_number = row_number())
}
dt <- sim_data(100)
Function for sampling
Sampling <- function(my_data, n_sample = 100){
# Step 1
n_students <- length(unique(my_data$student_id))
sampled_ids <- sample(1:n_students, n_sample, replace = TRUE)
sampled_data <- data.frame(student_id = integer(0),
coin_result = character(0),
stringsAsFactors = FALSE)
for (j in seq_along(sampled_ids)) {
# Step 2
student_data <- my_data[my_data$student_id == sampled_ids[j], ]
x <- sample(nrow(student_data), 1)
# Step 3
# There is a edge case where position x is the last flip in the sequence, for
# all other cases we shouldn't use position "x" in the sample
if (x == nrow(student_data)) y <- x else
y <- sample((x + 1):nrow(student_data), 1)
selected_data <- student_data[x:y,] %>%
mutate(loop_id = j)
sampled_data <- rbind(sampled_data, selected_data)
}
# Step 5
cond_prob <-
sampled_data %>%
group_by(loop_id) %>%
mutate(flip_number = row_number()) %>%
mutate(Sequence = str_c(coin_result, lead(coin_result))) %>%
filter(!is.na(Sequence)) %>%
ungroup() %>%
count(Sequence) %>%
mutate(prob = n / sum(n))
p_HH <- cond_prob$prob[cond_prob$Sequence == "HH"]
p_HT <- cond_prob$prob[cond_prob$Sequence == "HT"]
p_TH <- cond_prob$prob[cond_prob$Sequence == "TH"]
p_TT <- cond_prob$prob[cond_prob$Sequence == "TT"]
c("h_given_h" = p_HH,"h_given_t" = p_HT, "t_given_h" = p_TH, "t_given_t" = p_TT)
}
Comparing the methods across k = 100
replications we get the following results:
n_replicate <- 100
results_1 <-
replicate(n_replicate,Sampling(dt)) %>%
t() %>%
as.data.frame() %>%
rowid_to_column('iteration_number')
results_2 <- replicate(n_replicate,{
sim_data(100) %>%
Sampling()
} ) %>%
t() %>%
as.data.frame() %>%
rowid_to_column('iteration_number')
bind_rows("OP" = results_1,"Sample different data" = results_2, .id = "sample_method") %>%
tidyr::pivot_longer(cols = c(h_given_h, h_given_t, t_given_h, t_given_t), names_to = "condition", values_to = "probability") %>%
summarise(prob = mean(probability), .by = c(sample_method, condition)) %>%
pivot_wider(names_from = sample_method, values_from = prob )
#> A tibble: 4 × 3
#> condition OP `Sample different data`
#> <chr> <dbl> <dbl>
#> 1 h_given_h 0.283 0.305
#> 2 h_given_t 0.203 0.199
#> 3 t_given_h 0.203 0.199
#> 4 t_given_t 0.311 0.297
library(ggplot2)
bind_rows(results_1, results_2, .id = "sample_method") %>%
tidyr::pivot_longer(cols = c(h_given_h, h_given_t, t_given_h, t_given_t), names_to = "condition", values_to = "probability") %>%
ggplot(aes(x = iteration_number, y = probability, color = condition)) +
geom_line() +
labs(x = "Iteration", y = "Probability", color = "Condition") +
facet_wrap(~sample_method, labeller = labeller(sample_method = c("1" = "OP", "2" = "Sample different data") ) )
<!-- -->
<sup>Created on 2023-05-11 with reprex v2.0.2</sup>
Edit
The way
英文:
I believe that we can find the result you want by doing repetitions of the simulation process together with the sampling process. I will demonstrate this by creating functions to simulate the data and sample it according to the code you wrote
Function to simulate data
sim_data <- function(n_students){
ids <- 1:n_students
student_id <- sort(sample(ids, 100000, replace = TRUE))
coin_result <- character(1000)
coin_result[1] <- sample(c("H", "T"), 1)
for (i in 2:length(coin_result)) {
if (student_id[i] != student_id[i-1]) {
coin_result[i] <- sample(c("H", "T"), 1)
} else if (coin_result[i-1] == "H") {
coin_result[i] <- sample(c("H", "T"), 1, prob = c(0.6, 0.4))
} else {
coin_result[i] <- sample(c("H", "T"), 1, prob = c(0.4, 0.6))
}
}
data.frame(student_id, coin_result) %>%
arrange(student_id) %>%
group_by(student_id) %>%
mutate(flip_number = row_number())
}
dt <- sim_data(100)
Function for sampling
Sampling <- function(my_data, n_sample = 100){
# Step 1
n_students <- length(unique(my_data$student_id))
sampled_ids <- sample(1:n_students, n_sample, replace = TRUE)
sampled_data <- data.frame(student_id = integer(0),
coin_result = character(0),
stringsAsFactors = FALSE)
for (j in seq_along(sampled_ids)) {
# Step 2
student_data <- my_data[my_data$student_id == sampled_ids[j], ]
x <- sample(nrow(student_data), 1)
# Step 3
# There is a edge case where position x is the last flip in the sequence, for
# all other cases we shouldn't use position "x" in the sample
if (x == nrow(student_data)) y <- x else
y <- sample((x + 1):nrow(student_data), 1)
selected_data <- student_data[x:y,] %>%
mutate(loop_id = j)
sampled_data <- rbind(sampled_data, selected_data)
}
# Step 5
cond_prob <-
sampled_data %>%
group_by(loop_id) %>%
mutate(flip_number = row_number()) %>%
mutate(Sequence = str_c(coin_result, lead(coin_result))) %>%
filter(!is.na(Sequence)) %>%
ungroup() %>%
count(Sequence) %>%
mutate(prob = n / sum(n))
p_HH <- cond_prob$prob[cond_prob$Sequence == "HH"]
p_HT <- cond_prob$prob[cond_prob$Sequence == "HT"]
p_TH <- cond_prob$prob[cond_prob$Sequence == "TH"]
p_TT <- cond_prob$prob[cond_prob$Sequence == "TT"]
c("h_given_h" = p_HH,"h_given_t" = p_HT, "t_given_h" = p_TH, "t_given_t" = p_TT)
}
Comparing the methods across k = 100
replications we get the following results:
n_replicate <- 100
results_1 <-
replicate(n_replicate,Sampling(dt)) %>%
t() %>%
as.data.frame() %>%
rowid_to_column('iteration_number')
results_2 <- replicate(n_replicate,{
sim_data(100) %>%
Sampling()
} ) %>%
t() %>%
as.data.frame() %>%
rowid_to_column('iteration_number')
bind_rows("OP" = results_1,"Sample different data" = results_2, .id = "sample_method") %>%
tidyr::pivot_longer(cols = c(h_given_h, h_given_t, t_given_h, t_given_t), names_to = "condition", values_to = "probability") %>%
summarise(prob = mean(probability), .by = c(sample_method, condition)) %>%
pivot_wider(names_from = sample_method, values_from = prob )
#> A tibble: 4 × 3
#> condition OP `Sample different data`
#> <chr> <dbl> <dbl>
#> 1 h_given_h 0.283 0.305
#> 2 h_given_t 0.203 0.199
#> 3 t_given_h 0.203 0.199
#> 4 t_given_t 0.311 0.297
library(ggplot2)
bind_rows(results_1, results_2, .id = "sample_method") %>%
tidyr::pivot_longer(cols = c(h_given_h, h_given_t, t_given_h, t_given_t), names_to = "condition", values_to = "probability") %>%
ggplot(aes(x = iteration_number, y = probability, color = condition)) +
geom_line() +
labs(x = "Iteration", y = "Probability", color = "Condition") +
facet_wrap(~sample_method, labeller = labeller(sample_method = c("1" = "OP", "2" = "Sample different data") ) )
<!-- -->
<sup>Created on 2023-05-11 with reprex v2.0.2</sup>
Edit
The way the conditional probability is calculated is also not ideal. The best way to estimate the probability of "H" given "H", would be to do the number of coins that had the result "H" and on the next coin the result "H" again. The way you are presenting it is closer to estimating the marginal probability of the combinations P(HH) and P(HT) than P(H|H) and P(T|H).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论