R: 随机抽样硬币投掷组

huangapple go评论66阅读模式
英文:

R: Randomly Sampling Groups of Coin Flips

问题

我正在使用R编程语言工作。

假设:

  • 有一枚硬币,正面朝上的概率为0.6,反面朝上的概率也为0.6
  • 一个班级里有100名学生
  • 每名学生随机翻转这枚硬币多次
  • 学生n的最后一次翻转不会影响学生n+1的第一次翻转(即当下一个学生翻转硬币时,第一次翻转为正面或反面的概率为0.5,但对于这名学生的下一次翻转取决于上一次翻转)

这是表示这个问题的一些R代码。

从这些数据中,我们可以计算HH、HT、TH和TT序列出现的次数。

我的问题是:使用这些数据,我尝试完成以下任务:

  • 步骤1:使用带替换的抽样,选择100名学生
  • 步骤2:对于在步骤1中选择的给定学生,随机选择他们翻转序列中的一个位置(例如称为位置“x”)
  • 步骤3:对于同一名学生,随机选择他们翻转序列中的第二个位置(例如称为位置“y”),使y > x(即“y”在“x”之后)
  • 步骤4:对步骤1中选择的所有学生重复步骤2和步骤3
  • 步骤5:计算步骤1中选择的所有学生的HH、HT、TH和TT序列出现的次数
  • 步骤6:重复步骤1到步骤5多次(例如1000次)

这是我自己尝试解决问题的方法。我不确定是否做得正确。能否有人帮我确认一下?

例如,假设学生15有6次翻转:H, H, T, H, T, T - 如果x = 2,y = 5,那么我们将得到H, T, H, T

我的问题是:尽管代码似乎已经运行,但我不确定我是否做得正确。有人能帮我确认一下吗?

例如 - HH和TT的线条不应该几乎相同吗?TH和HT的线条不应该几乎相同吗?然而在我的图表中,这明显不是这样。我认为,在给定的迭代中,如果同一名学生在重新抽样的数据集中出现3次,那么第一次的最后一个转换将“泄漏”到第二次的第一次转换中,从而影响结果。

谢谢!

英文:

I am working with the R programming language.

Suppose:

  • There is a coin where if it lands head then the probability of the next flip being heads is 0.6 (and if tails then the next flip being tails is also 0.6)
  • There are 100 students in a class
  • Each student flips this coin a random number of times
  • The last flip of student_n does not influence the first flip of student_n+1 (i.e. when the next student flips the coin, the first flip has 0.5 probability of heads or tails, but the next flip for this student depends on the previous flip)

Here is some R code to represent this problem:

# https://stackoverflow.com/questions/76192042/r-verifying-the-results-of-coin-flips
library(tidyverse)

set.seed(123)
ids <- 1:100
student_id <- sort(sample(ids, 100000, replace = TRUE))
coin_result <- character(1000)
coin_result[1] <- sample(c("H", "T"), 1)

for (i in 2:length(coin_result)) {
  if (student_id[i] != student_id[i-1]) {
    coin_result[i] <- sample(c("H", "T"), 1)
  } else if (coin_result[i-1] == "H") {
    coin_result[i] <- sample(c("H", "T"), 1, prob = c(0.6, 0.4))
  } else {
    coin_result[i] <- sample(c("H", "T"), 1, prob = c(0.4, 0.6))
  }
}

#tidy up
my_data <- data.frame(student_id, coin_result)
my_data <- my_data[order(my_data$student_id),]

final <- my_data %>%
    group_by(student_id) %>%
    mutate(flip_number = row_number())

From this data, we can count the number of times sequences of HH, HT, TH and TT occurred:

head(my_data)
# A tibble: 6 x 3
# Groups:   student_id [1]
  student_id coin_result  flip_number
       <int> <chr>              <int>
1          1 H                      1
2          1 H                      2
3          1 H                      3
4          1 H                      4
5          1 T                      5
6          1 H                      6

my_data %>%
  group_by(student_id) %>%
  summarize(Sequence = str_c(coin_result, lead(coin_result)), .groups = 'drop') %>%
  filter(!is.na(Sequence)) %>%
  count(Sequence)

# A tibble: 4 × 2
  Sequence     n
  <chr>    <int>
1 HH       29763
2 HT       19782
3 TH       19775
4 TT       30580

My Problem: Using this data, I am trying to accomplish the following task:

  • Step 1: Using sampling with replacement, select 100 students
  • Step 2: For a given student that is selected in Step 1, randomly choose a position in their sequence of flips (e.g. call this position "x")
  • Step 3: For this same student, randomly choose a second position in their sequence of flips (e.g. call this position "y") such that y > x (i.e. "y" comes after "x").
  • Step 4: Repeat Step 2 and Step 3 for all students selected in Step 1
  • Step 5: Count the number of times sequences of HH, HT, TH and TT occurred for all students selected in Step 1
  • Step 6: Repeat Step 1 - Step 5 many times (e.g. 1000 times)

As an example, suppose Student 15 has 6 Flips : H, H, T, H, T, T - if x = 2 and y = 5, then we would have H, T, H, T

Here is my own attempt at solving my problem:

# Set the number of iterations
k <- 1000

# Initialize a data frame 
results <- data.frame(iteration_number = numeric(0),
                      h_given_h = numeric(0),
                      h_given_t = numeric(0),
                      t_given_h = numeric(0),
                      t_given_t = numeric(0))

# Set the number of students to sample
n_students <- length(unique(my_data$student_id))

# Loop over the number of iterations
for (i in 1:k) {
  # Randomly sample student ids with replacement
  sampled_ids <- sample(ids, n_students, replace = TRUE)
  
  # Initialize a data frame to store the sampled data
  sampled_data <- data.frame(student_id = integer(0), coin_result = character(0), stringsAsFactors = FALSE)
  
  # LOOP
  for (j in sampled_ids) {
    # Get data for the current student
    student_data <- my_data[my_data$student_id == j, ]
    
    # Randomly choose a starting and ending point
    x <- sample(nrow(student_data), 1)
    y <- sample(x:nrow(student_data), 1)
    
    # Select the data between the starting and ending point
    selected_data <- student_data[x:y, ]
    
    # Append the selected data to the sampled data frame
    sampled_data <- rbind(sampled_data, selected_data)
  }
  
  final <- sampled_data %>%
    group_by(student_id) %>%
    mutate(flip_number = row_number())
  
  # Calculate the conditional probabilities
  cond_prob <- final %>%
    group_by(student_id) %>%
    summarize(Sequence = str_c(coin_result, lead(coin_result)), .groups = 'drop') %>%
    filter(!is.na(Sequence)) %>%
    count(Sequence) %>%
    mutate(prob = n / sum(n))
  
  # Extract probabilities
  p_HH <- cond_prob$prob[cond_prob$Sequence == "HH"]
  p_HT <- cond_prob$prob[cond_prob$Sequence == "HT"]
  p_TH <- cond_prob$prob[cond_prob$Sequence == "TH"]
  p_TT <- cond_prob$prob[cond_prob$Sequence == "TT"]

  #print(i)
  # Append 
  results[i, ] <- c(i, p_HH, p_HT, p_TH, p_TT)
}

colnames(results) <- c("iteration_number", "h_given_h", "h_given_t", "t_given_h", "t_given_t")

library(ggplot2)

# Convert to long 
results_long <- tidyr::pivot_longer(results, cols = c(h_given_h, h_given_t, t_given_h, t_given_t), names_to = "condition", values_to = "probability")

# Plot
ggplot(results_long, aes(x = iteration_number, y = probability, color = condition)) +
  geom_line() +
  labs(x = "Iteration", y = "Probability", color = "Condition")

R: 随机抽样硬币投掷组

My Question: While the code seems to have run, I am not sure if I have done this correctly. Can someone please help me confirm this?

For example - shouldn't the lines for HH and TT be almost identical .... and shouldn't the lines of TH and HT be almost identical? Yet in my graph, this clearly not the case? The way I see it, within a given iteration, if the same student appears 3 times in the re-sampled dataset, the last transition from the first time will "leak" into the the first transition from the second time and thus compromise the results.

Thanks!

答案1

得分: 1

以下是您要翻译的内容:

"I believe that we can find the result you want by doing repetitions of the simulation process together with the sampling process. I will demonstrate this by creating functions to simulate the data and sample it according to the code you wrote

Function to simulate data

sim_data <- function(n_students){
ids <- 1:n_students
student_id <- sort(sample(ids, 100000, replace = TRUE))
coin_result <- character(1000)
coin_result[1] <- sample(c("H", "T"), 1)

for (i in 2:length(coin_result)) {
  if (student_id[i] != student_id[i-1]) {
    coin_result[i] <- sample(c("H", "T"), 1)
  } else if (coin_result[i-1] == "H") {
    coin_result[i] <- sample(c("H", "T"), 1, prob = c(0.6, 0.4))
  } else {
    coin_result[i] <- sample(c("H", "T"), 1, prob = c(0.4, 0.6))
  }
}

data.frame(student_id, coin_result) %>% 
  arrange(student_id) %>% 
  group_by(student_id) %>% 
  mutate(flip_number = row_number())
}

dt <- sim_data(100)

Function for sampling

Sampling <- function(my_data, n_sample = 100){

  # Step 1
  n_students <- length(unique(my_data$student_id))
  sampled_ids <- sample(1:n_students, n_sample, replace = TRUE)
  sampled_data <- data.frame(student_id = integer(0), 
coin_result = character(0), 
stringsAsFactors = FALSE)
  
  for (j in seq_along(sampled_ids)) {
  # Step 2
    student_data <- my_data[my_data$student_id == sampled_ids[j], ]
    
    x <- sample(nrow(student_data), 1)
    
  # Step 3  
    # There is a edge case where position x is the last flip in the sequence, for 
    # all other cases we shouldn't use position "x" in the sample
    if (x == nrow(student_data)) y <- x else
    y <- sample((x + 1):nrow(student_data), 1)
    selected_data <- student_data[x:y,] %>% 
    mutate(loop_id = j)
    
    sampled_data <- rbind(sampled_data, selected_data)
  }
  
  # Step 5  
  cond_prob <- 
    sampled_data %>%
    group_by(loop_id) %>%
    mutate(flip_number = row_number()) %>% 
    mutate(Sequence = str_c(coin_result, lead(coin_result))) %>%
    filter(!is.na(Sequence)) %>%
    ungroup() %>%
    count(Sequence) %>%
    mutate(prob = n / sum(n))
  
  
  p_HH <- cond_prob$prob[cond_prob$Sequence == "HH"]
  p_HT <- cond_prob$prob[cond_prob$Sequence == "HT"]
  p_TH <- cond_prob$prob[cond_prob$Sequence == "TH"]
  p_TT <- cond_prob$prob[cond_prob$Sequence == "TT"]
  
  c("h_given_h" = p_HH,"h_given_t" =  p_HT, "t_given_h" = p_TH, "t_given_t" = p_TT)
}

Comparing the methods across k = 100 replications we get the following results:

n_replicate <- 100

results_1 <- 
  replicate(n_replicate,Sampling(dt)) %>% 
  t() %>% 
  as.data.frame() %>%
  rowid_to_column('iteration_number') 

results_2 <- replicate(n_replicate,{
  sim_data(100) %>% 
  Sampling()
  } ) %>% 
  t() %>% 
  as.data.frame() %>%
  rowid_to_column('iteration_number') 

 bind_rows("OP" = results_1,"Sample different data" = results_2, .id = "sample_method") %>% 
   tidyr::pivot_longer(cols = c(h_given_h, h_given_t, t_given_h, t_given_t), names_to = "condition", values_to = "probability") %>%
   summarise(prob = mean(probability), .by = c(sample_method, condition)) %>%
     pivot_wider(names_from = sample_method, values_from = prob ) 
#> A tibble: 4 × 3
#>  condition    OP `Sample different data`
#>  <chr>     <dbl>                   <dbl>
#> 1 h_given_h 0.283                   0.305
#> 2 h_given_t 0.203                   0.199
#> 3 t_given_h 0.203                   0.199
#> 4 t_given_t 0.311                   0.297

library(ggplot2)

  bind_rows(results_1, results_2, .id = "sample_method") %>% 
    tidyr::pivot_longer(cols = c(h_given_h, h_given_t, t_given_h, t_given_t), names_to = "condition", values_to = "probability") %>% 
    ggplot(aes(x = iteration_number, y = probability, color = condition)) +
  geom_line() +
  labs(x = "Iteration", y = "Probability", color = "Condition") +
    facet_wrap(~sample_method, labeller = labeller(sample_method = c("1" = "OP", "2" = "Sample different data") ) )

R: 随机抽样硬币投掷组<!-- -->

<sup>Created on 2023-05-11 with reprex v2.0.2</sup>

Edit

The way

英文:

I believe that we can find the result you want by doing repetitions of the simulation process together with the sampling process. I will demonstrate this by creating functions to simulate the data and sample it according to the code you wrote

Function to simulate data

sim_data &lt;- function(n_students){
ids &lt;- 1:n_students
student_id &lt;- sort(sample(ids, 100000, replace = TRUE))
coin_result &lt;- character(1000)
coin_result[1] &lt;- sample(c(&quot;H&quot;, &quot;T&quot;), 1)

for (i in 2:length(coin_result)) {
  if (student_id[i] != student_id[i-1]) {
    coin_result[i] &lt;- sample(c(&quot;H&quot;, &quot;T&quot;), 1)
  } else if (coin_result[i-1] == &quot;H&quot;) {
    coin_result[i] &lt;- sample(c(&quot;H&quot;, &quot;T&quot;), 1, prob = c(0.6, 0.4))
  } else {
    coin_result[i] &lt;- sample(c(&quot;H&quot;, &quot;T&quot;), 1, prob = c(0.4, 0.6))
  }
}

data.frame(student_id, coin_result) %&gt;% 
  arrange(student_id) %&gt;% 
  group_by(student_id) %&gt;%
  mutate(flip_number = row_number())
}

dt &lt;- sim_data(100)

Function for sampling

Sampling &lt;- function(my_data, n_sample = 100){

  # Step 1
  n_students &lt;- length(unique(my_data$student_id))
  sampled_ids &lt;- sample(1:n_students, n_sample, replace = TRUE)
  sampled_data &lt;- data.frame(student_id = integer(0), 
coin_result = character(0), 
stringsAsFactors = FALSE)
  
  for (j in seq_along(sampled_ids)) {
  # Step 2
    student_data &lt;- my_data[my_data$student_id == sampled_ids[j], ]
    
    x &lt;- sample(nrow(student_data), 1)
    
  # Step 3  
    # There is a edge case where position x is the last flip in the sequence, for 
    # all other cases we shouldn&#39;t use position &quot;x&quot; in the sample
    if (x == nrow(student_data)) y &lt;- x else
    y &lt;- sample((x + 1):nrow(student_data), 1)
    selected_data &lt;- student_data[x:y,] %&gt;% 
    mutate(loop_id = j)
    
    sampled_data &lt;- rbind(sampled_data, selected_data)
  }
  
  # Step 5  
  cond_prob &lt;- 
    sampled_data %&gt;%
    group_by(loop_id) %&gt;%
    mutate(flip_number = row_number()) %&gt;% 
    mutate(Sequence = str_c(coin_result, lead(coin_result))) %&gt;%
    filter(!is.na(Sequence)) %&gt;%
    ungroup() %&gt;% 
    count(Sequence) %&gt;%
    mutate(prob = n / sum(n))
  
  
  p_HH &lt;- cond_prob$prob[cond_prob$Sequence == &quot;HH&quot;]
  p_HT &lt;- cond_prob$prob[cond_prob$Sequence == &quot;HT&quot;]
  p_TH &lt;- cond_prob$prob[cond_prob$Sequence == &quot;TH&quot;]
  p_TT &lt;- cond_prob$prob[cond_prob$Sequence == &quot;TT&quot;]
  
  c(&quot;h_given_h&quot; = p_HH,&quot;h_given_t&quot; =  p_HT, &quot;t_given_h&quot; = p_TH, &quot;t_given_t&quot; = p_TT)
}

Comparing the methods across k = 100 replications we get the following results:

n_replicate &lt;- 100

results_1 &lt;- 
  replicate(n_replicate,Sampling(dt)) %&gt;% 
  t() %&gt;% 
  as.data.frame() %&gt;%
  rowid_to_column(&#39;iteration_number&#39;) 

results_2 &lt;- replicate(n_replicate,{
  sim_data(100) %&gt;% 
  Sampling()
  } ) %&gt;% 
  t() %&gt;% 
  as.data.frame() %&gt;%
  rowid_to_column(&#39;iteration_number&#39;) 

 bind_rows(&quot;OP&quot; = results_1,&quot;Sample different data&quot; = results_2, .id = &quot;sample_method&quot;) %&gt;% 
   tidyr::pivot_longer(cols = c(h_given_h, h_given_t, t_given_h, t_given_t), names_to = &quot;condition&quot;, values_to = &quot;probability&quot;) %&gt;%
   summarise(prob = mean(probability), .by = c(sample_method, condition)) %&gt;% 
     pivot_wider(names_from = sample_method, values_from = prob ) 
#&gt; A tibble: 4 &#215; 3
#&gt;  condition    OP `Sample different data`
#&gt;  &lt;chr&gt;     &lt;dbl&gt;                   &lt;dbl&gt;
#&gt; 1 h_given_h 0.283                   0.305
#&gt; 2 h_given_t 0.203                   0.199
#&gt; 3 t_given_h 0.203                   0.199
#&gt; 4 t_given_t 0.311                   0.297

library(ggplot2)

  bind_rows(results_1, results_2, .id = &quot;sample_method&quot;) %&gt;% 
    tidyr::pivot_longer(cols = c(h_given_h, h_given_t, t_given_h, t_given_t), names_to = &quot;condition&quot;, values_to = &quot;probability&quot;) %&gt;% 
    ggplot(aes(x = iteration_number, y = probability, color = condition)) +
  geom_line() +
  labs(x = &quot;Iteration&quot;, y = &quot;Probability&quot;, color = &quot;Condition&quot;) +
    facet_wrap(~sample_method, labeller = labeller(sample_method = c(&quot;1&quot; = &quot;OP&quot;, &quot;2&quot; = &quot;Sample different data&quot;) ) )

R: 随机抽样硬币投掷组<!-- -->

<sup>Created on 2023-05-11 with reprex v2.0.2</sup>

Edit

The way the conditional probability is calculated is also not ideal. The best way to estimate the probability of "H" given "H", would be to do the number of coins that had the result "H" and on the next coin the result "H" again. The way you are presenting it is closer to estimating the marginal probability of the combinations P(HH) and P(HT) than P(H|H) and P(T|H).

huangapple
  • 本文由 发表于 2023年5月8日 00:29:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/76195071.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定