2023年5月8日 00:29:31go评论66阅读模式

英文:

R: Randomly Sampling Groups of Coin Flips

问题

我正在使用R编程语言工作。

假设：

有一枚硬币，正面朝上的概率为0.6，反面朝上的概率也为0.6
一个班级里有100名学生
每名学生随机翻转这枚硬币多次
学生n的最后一次翻转不会影响学生n+1的第一次翻转（即当下一个学生翻转硬币时，第一次翻转为正面或反面的概率为0.5，但对于这名学生的下一次翻转取决于上一次翻转）

这是表示这个问题的一些R代码。

从这些数据中，我们可以计算HH、HT、TH和TT序列出现的次数。

我的问题是：使用这些数据，我尝试完成以下任务：

步骤1：使用带替换的抽样，选择100名学生
步骤2：对于在步骤1中选择的给定学生，随机选择他们翻转序列中的一个位置（例如称为位置“x”）
步骤3：对于同一名学生，随机选择他们翻转序列中的第二个位置（例如称为位置“y”），使y > x（即“y”在“x”之后）
步骤4：对步骤1中选择的所有学生重复步骤2和步骤3
步骤5：计算步骤1中选择的所有学生的HH、HT、TH和TT序列出现的次数
步骤6：重复步骤1到步骤5多次（例如1000次）

这是我自己尝试解决问题的方法。我不确定是否做得正确。能否有人帮我确认一下？

例如，假设学生15有6次翻转：H, H, T, H, T, T - 如果x = 2，y = 5，那么我们将得到H, T, H, T。

我的问题是：尽管代码似乎已经运行，但我不确定我是否做得正确。有人能帮我确认一下吗？

例如 - HH和TT的线条不应该几乎相同吗？TH和HT的线条不应该几乎相同吗？然而在我的图表中，这明显不是这样。我认为，在给定的迭代中，如果同一名学生在重新抽样的数据集中出现3次，那么第一次的最后一个转换将“泄漏”到第二次的第一次转换中，从而影响结果。

谢谢！

英文:

I am working with the R programming language.

Suppose:

There is a coin where if it lands head then the probability of the next flip being heads is 0.6 (and if tails then the next flip being tails is also 0.6)
There are 100 students in a class
Each student flips this coin a random number of times
The last flip of student_n does not influence the first flip of student_n+1 (i.e. when the next student flips the coin, the first flip has 0.5 probability of heads or tails, but the next flip for this student depends on the previous flip)

Here is some R code to represent this problem:

# https://stackoverflow.com/questions/76192042/r-verifying-the-results-of-coin-flips
library(tidyverse)

set.seed(123)
ids &lt;- 1:100
student_id &lt;- sort(sample(ids, 100000, replace = TRUE))
coin_result &lt;- character(1000)
coin_result[1] &lt;- sample(c(&quot;H&quot;, &quot;T&quot;), 1)

for (i in 2:length(coin_result)) {
  if (student_id[i] != student_id[i-1]) {
    coin_result[i] &lt;- sample(c(&quot;H&quot;, &quot;T&quot;), 1)
  } else if (coin_result[i-1] == &quot;H&quot;) {
    coin_result[i] &lt;- sample(c(&quot;H&quot;, &quot;T&quot;), 1, prob = c(0.6, 0.4))
  } else {
    coin_result[i] &lt;- sample(c(&quot;H&quot;, &quot;T&quot;), 1, prob = c(0.4, 0.6))
  }
}

#tidy up
my_data &lt;- data.frame(student_id, coin_result)
my_data &lt;- my_data[order(my_data$student_id),]

final &lt;- my_data %&gt;%
    group_by(student_id) %&gt;%
    mutate(flip_number = row_number())

From this data, we can count the number of times sequences of HH, HT, TH and TT occurred:

head(my_data)
# A tibble: 6 x 3
# Groups:   student_id [1]
  student_id coin_result  flip_number
       &lt;int&gt; &lt;chr&gt;              &lt;int&gt;
1          1 H                      1
2          1 H                      2
3          1 H                      3
4          1 H                      4
5          1 T                      5
6          1 H                      6

my_data %&gt;%
  group_by(student_id) %&gt;%
  summarize(Sequence = str_c(coin_result, lead(coin_result)), .groups = &#39;drop&#39;) %&gt;%
  filter(!is.na(Sequence)) %&gt;%
  count(Sequence)

# A tibble: 4 &#215; 2
  Sequence     n
  &lt;chr&gt;    &lt;int&gt;
1 HH       29763
2 HT       19782
3 TH       19775
4 TT       30580

My Problem: Using this data, I am trying to accomplish the following task:

Step 1: Using sampling with replacement, select 100 students
Step 2: For a given student that is selected in Step 1, randomly choose a position in their sequence of flips (e.g. call this position "x")
Step 3: For this same student, randomly choose a second position in their sequence of flips (e.g. call this position "y") such that y > x (i.e. "y" comes after "x").
Step 4: Repeat Step 2 and Step 3 for all students selected in Step 1
Step 5: Count the number of times sequences of HH, HT, TH and TT occurred for all students selected in Step 1
Step 6: Repeat Step 1 - Step 5 many times (e.g. 1000 times)

As an example, suppose Student 15 has 6 Flips : H, H, T, H, T, T - if x = 2 and y = 5, then we would have H, T, H, T

Here is my own attempt at solving my problem:

# Set the number of iterations
k &lt;- 1000

# Initialize a data frame 
results &lt;- data.frame(iteration_number = numeric(0),
                      h_given_h = numeric(0),
                      h_given_t = numeric(0),
                      t_given_h = numeric(0),
                      t_given_t = numeric(0))

# Set the number of students to sample
n_students &lt;- length(unique(my_data$student_id))

# Loop over the number of iterations
for (i in 1:k) {
  # Randomly sample student ids with replacement
  sampled_ids &lt;- sample(ids, n_students, replace = TRUE)
  
  # Initialize a data frame to store the sampled data
  sampled_data &lt;- data.frame(student_id = integer(0), coin_result = character(0), stringsAsFactors = FALSE)
  
  # LOOP
  for (j in sampled_ids) {
    # Get data for the current student
    student_data &lt;- my_data[my_data$student_id == j, ]
    
    # Randomly choose a starting and ending point
    x &lt;- sample(nrow(student_data), 1)
    y &lt;- sample(x:nrow(student_data), 1)
    
    # Select the data between the starting and ending point
    selected_data &lt;- student_data[x:y, ]
    
    # Append the selected data to the sampled data frame
    sampled_data &lt;- rbind(sampled_data, selected_data)
  }
  
  final &lt;- sampled_data %&gt;%
    group_by(student_id) %&gt;%
    mutate(flip_number = row_number())
  
  # Calculate the conditional probabilities
  cond_prob &lt;- final %&gt;%
    group_by(student_id) %&gt;%
    summarize(Sequence = str_c(coin_result, lead(coin_result)), .groups = &#39;drop&#39;) %&gt;%
    filter(!is.na(Sequence)) %&gt;%
    count(Sequence) %&gt;%
    mutate(prob = n / sum(n))
  
  # Extract probabilities
  p_HH &lt;- cond_prob$prob[cond_prob$Sequence == &quot;HH&quot;]
  p_HT &lt;- cond_prob$prob[cond_prob$Sequence == &quot;HT&quot;]
  p_TH &lt;- cond_prob$prob[cond_prob$Sequence == &quot;TH&quot;]
  p_TT &lt;- cond_prob$prob[cond_prob$Sequence == &quot;TT&quot;]

  #print(i)
  # Append 
  results[i, ] &lt;- c(i, p_HH, p_HT, p_TH, p_TT)
}

colnames(results) &lt;- c(&quot;iteration_number&quot;, &quot;h_given_h&quot;, &quot;h_given_t&quot;, &quot;t_given_h&quot;, &quot;t_given_t&quot;)

library(ggplot2)

# Convert to long 
results_long &lt;- tidyr::pivot_longer(results, cols = c(h_given_h, h_given_t, t_given_h, t_given_t), names_to = &quot;condition&quot;, values_to = &quot;probability&quot;)

# Plot
ggplot(results_long, aes(x = iteration_number, y = probability, color = condition)) +
  geom_line() +
  labs(x = &quot;Iteration&quot;, y = &quot;Probability&quot;, color = &quot;Condition&quot;)

My Question: While the code seems to have run, I am not sure if I have done this correctly. Can someone please help me confirm this?

For example - shouldn't the lines for HH and TT be almost identical .... and shouldn't the lines of TH and HT be almost identical? Yet in my graph, this clearly not the case? The way I see it, within a given iteration, if the same student appears 3 times in the re-sampled dataset, the last transition from the first time will "leak" into the the first transition from the second time and thus compromise the results.

Thanks!

答案1

得分: 1

以下是您要翻译的内容：

"I believe that we can find the result you want by doing repetitions of the simulation process together with the sampling process. I will demonstrate this by creating functions to simulate the data and sample it according to the code you wrote

Function to simulate data

sim_data &lt;- function(n_students){
ids &lt;- 1:n_students
student_id &lt;- sort(sample(ids, 100000, replace = TRUE))
coin_result &lt;- character(1000)
coin_result[1] &lt;- sample(c(&quot;H&quot;, &quot;T&quot;), 1)

for (i in 2:length(coin_result)) {
  if (student_id[i] != student_id[i-1]) {
    coin_result[i] &lt;- sample(c(&quot;H&quot;, &quot;T&quot;), 1)
  } else if (coin_result[i-1] == &quot;H&quot;) {
    coin_result[i] &lt;- sample(c(&quot;H&quot;, &quot;T&quot;), 1, prob = c(0.6, 0.4))
  } else {
    coin_result[i] &lt;- sample(c(&quot;H&quot;, &quot;T&quot;), 1, prob = c(0.4, 0.6))
  }
}

data.frame(student_id, coin_result) %&gt;% 
  arrange(student_id) %&gt;% 
  group_by(student_id) %&gt;% 
  mutate(flip_number = row_number())
}

dt &lt;- sim_data(100)

Function for sampling

Sampling &lt;- function(my_data, n_sample = 100){

  # Step 1
  n_students &lt;- length(unique(my_data$student_id))
  sampled_ids &lt;- sample(1:n_students, n_sample, replace = TRUE)
  sampled_data &lt;- data.frame(student_id = integer(0), 
coin_result = character(0), 
stringsAsFactors = FALSE)
  
  for (j in seq_along(sampled_ids)) {
  # Step 2
    student_data &lt;- my_data[my_data$student_id == sampled_ids[j], ]
    
    x &lt;- sample(nrow(student_data), 1)
    
  # Step 3  
    # There is a edge case where position x is the last flip in the sequence, for 
    # all other cases we shouldn&#39;t use position &quot;x&quot; in the sample
    if (x == nrow(student_data)) y &lt;- x else
    y &lt;- sample((x + 1):nrow(student_data), 1)
    selected_data &lt;- student_data[x:y,] %&gt;% 
    mutate(loop_id = j)
    
    sampled_data &lt;- rbind(sampled_data, selected_data)
  }
  
  # Step 5  
  cond_prob &lt;- 
    sampled_data %&gt;%
    group_by(loop_id) %&gt;%
    mutate(flip_number = row_number()) %&gt;% 
    mutate(Sequence = str_c(coin_result, lead(coin_result))) %&gt;%
    filter(!is.na(Sequence)) %&gt;%
    ungroup() %&gt;%
    count(Sequence) %&gt;%
    mutate(prob = n / sum(n))
  
  
  p_HH &lt;- cond_prob$prob[cond_prob$Sequence == &quot;HH&quot;]
  p_HT &lt;- cond_prob$prob[cond_prob$Sequence == &quot;HT&quot;]
  p_TH &lt;- cond_prob$prob[cond_prob$Sequence == &quot;TH&quot;]
  p_TT &lt;- cond_prob$prob[cond_prob$Sequence == &quot;TT&quot;]
  
  c(&quot;h_given_h&quot; = p_HH,&quot;h_given_t&quot; =  p_HT, &quot;t_given_h&quot; = p_TH, &quot;t_given_t&quot; = p_TT)
}

Comparing the methods across k = 100 replications we get the following results:

n_replicate &lt;- 100

results_1 &lt;- 
  replicate(n_replicate,Sampling(dt)) %&gt;% 
  t() %&gt;% 
  as.data.frame() %&gt;%
  rowid_to_column(&#39;iteration_number&#39;) 

results_2 &lt;- replicate(n_replicate,{
  sim_data(100) %&gt;% 
  Sampling()
  } ) %&gt;% 
  t() %&gt;% 
  as.data.frame() %&gt;%
  rowid_to_column(&#39;iteration_number&#39;) 

 bind_rows(&quot;OP&quot; = results_1,&quot;Sample different data&quot; = results_2, .id = &quot;sample_method&quot;) %&gt;% 
   tidyr::pivot_longer(cols = c(h_given_h, h_given_t, t_given_h, t_given_t), names_to = &quot;condition&quot;, values_to = &quot;probability&quot;) %&gt;%
   summarise(prob = mean(probability), .by = c(sample_method, condition)) %&gt;%
     pivot_wider(names_from = sample_method, values_from = prob ) 
#&gt; A tibble: 4 &#215; 3
#&gt;  condition    OP `Sample different data`
#&gt;  &lt;chr&gt;     &lt;dbl&gt;                   &lt;dbl&gt;
#&gt; 1 h_given_h 0.283                   0.305
#&gt; 2 h_given_t 0.203                   0.199
#&gt; 3 t_given_h 0.203                   0.199
#&gt; 4 t_given_t 0.311                   0.297

library(ggplot2)

  bind_rows(results_1, results_2, .id = &quot;sample_method&quot;) %&gt;% 
    tidyr::pivot_longer(cols = c(h_given_h, h_given_t, t_given_h, t_given_t), names_to = &quot;condition&quot;, values_to = &quot;probability&quot;) %&gt;% 
    ggplot(aes(x = iteration_number, y = probability, color = condition)) +
  geom_line() +
  labs(x = &quot;Iteration&quot;, y = &quot;Probability&quot;, color = &quot;Condition&quot;) +
    facet_wrap(~sample_method, labeller = labeller(sample_method = c(&quot;1&quot; = &quot;OP&quot;, &quot;2&quot; = &quot;Sample different data&quot;) ) )

R: 随机抽样硬币投掷组

<sup>Created on 2023-05-11 with reprex v2.0.2</sup>

Edit

The way

英文:

I believe that we can find the result you want by doing repetitions of the simulation process together with the sampling process. I will demonstrate this by creating functions to simulate the data and sample it according to the code you wrote

Function to simulate data

sim_data &lt;- function(n_students){
ids &lt;- 1:n_students
student_id &lt;- sort(sample(ids, 100000, replace = TRUE))
coin_result &lt;- character(1000)
coin_result[1] &lt;- sample(c(&quot;H&quot;, &quot;T&quot;), 1)

for (i in 2:length(coin_result)) {
  if (student_id[i] != student_id[i-1]) {
    coin_result[i] &lt;- sample(c(&quot;H&quot;, &quot;T&quot;), 1)
  } else if (coin_result[i-1] == &quot;H&quot;) {
    coin_result[i] &lt;- sample(c(&quot;H&quot;, &quot;T&quot;), 1, prob = c(0.6, 0.4))
  } else {
    coin_result[i] &lt;- sample(c(&quot;H&quot;, &quot;T&quot;), 1, prob = c(0.4, 0.6))
  }
}

data.frame(student_id, coin_result) %&gt;% 
  arrange(student_id) %&gt;% 
  group_by(student_id) %&gt;%
  mutate(flip_number = row_number())
}

dt &lt;- sim_data(100)

Function for sampling

Sampling &lt;- function(my_data, n_sample = 100){

  # Step 1
  n_students &lt;- length(unique(my_data$student_id))
  sampled_ids &lt;- sample(1:n_students, n_sample, replace = TRUE)
  sampled_data &lt;- data.frame(student_id = integer(0), 
coin_result = character(0), 
stringsAsFactors = FALSE)
  
  for (j in seq_along(sampled_ids)) {
  # Step 2
    student_data &lt;- my_data[my_data$student_id == sampled_ids[j], ]
    
    x &lt;- sample(nrow(student_data), 1)
    
  # Step 3  
    # There is a edge case where position x is the last flip in the sequence, for 
    # all other cases we shouldn&#39;t use position &quot;x&quot; in the sample
    if (x == nrow(student_data)) y &lt;- x else
    y &lt;- sample((x + 1):nrow(student_data), 1)
    selected_data &lt;- student_data[x:y,] %&gt;% 
    mutate(loop_id = j)
    
    sampled_data &lt;- rbind(sampled_data, selected_data)
  }
  
  # Step 5  
  cond_prob &lt;- 
    sampled_data %&gt;%
    group_by(loop_id) %&gt;%
    mutate(flip_number = row_number()) %&gt;% 
    mutate(Sequence = str_c(coin_result, lead(coin_result))) %&gt;%
    filter(!is.na(Sequence)) %&gt;%
    ungroup() %&gt;% 
    count(Sequence) %&gt;%
    mutate(prob = n / sum(n))
  
  
  p_HH &lt;- cond_prob$prob[cond_prob$Sequence == &quot;HH&quot;]
  p_HT &lt;- cond_prob$prob[cond_prob$Sequence == &quot;HT&quot;]
  p_TH &lt;- cond_prob$prob[cond_prob$Sequence == &quot;TH&quot;]
  p_TT &lt;- cond_prob$prob[cond_prob$Sequence == &quot;TT&quot;]
  
  c(&quot;h_given_h&quot; = p_HH,&quot;h_given_t&quot; =  p_HT, &quot;t_given_h&quot; = p_TH, &quot;t_given_t&quot; = p_TT)
}

Comparing the methods across k = 100 replications we get the following results:

n_replicate &lt;- 100

results_1 &lt;- 
  replicate(n_replicate,Sampling(dt)) %&gt;% 
  t() %&gt;% 
  as.data.frame() %&gt;%
  rowid_to_column(&#39;iteration_number&#39;) 

results_2 &lt;- replicate(n_replicate,{
  sim_data(100) %&gt;% 
  Sampling()
  } ) %&gt;% 
  t() %&gt;% 
  as.data.frame() %&gt;%
  rowid_to_column(&#39;iteration_number&#39;) 

 bind_rows(&quot;OP&quot; = results_1,&quot;Sample different data&quot; = results_2, .id = &quot;sample_method&quot;) %&gt;% 
   tidyr::pivot_longer(cols = c(h_given_h, h_given_t, t_given_h, t_given_t), names_to = &quot;condition&quot;, values_to = &quot;probability&quot;) %&gt;%
   summarise(prob = mean(probability), .by = c(sample_method, condition)) %&gt;% 
     pivot_wider(names_from = sample_method, values_from = prob ) 
#&gt; A tibble: 4 &#215; 3
#&gt;  condition    OP `Sample different data`
#&gt;  &lt;chr&gt;     &lt;dbl&gt;                   &lt;dbl&gt;
#&gt; 1 h_given_h 0.283                   0.305
#&gt; 2 h_given_t 0.203                   0.199
#&gt; 3 t_given_h 0.203                   0.199
#&gt; 4 t_given_t 0.311                   0.297

library(ggplot2)

  bind_rows(results_1, results_2, .id = &quot;sample_method&quot;) %&gt;% 
    tidyr::pivot_longer(cols = c(h_given_h, h_given_t, t_given_h, t_given_t), names_to = &quot;condition&quot;, values_to = &quot;probability&quot;) %&gt;% 
    ggplot(aes(x = iteration_number, y = probability, color = condition)) +
  geom_line() +
  labs(x = &quot;Iteration&quot;, y = &quot;Probability&quot;, color = &quot;Condition&quot;) +
    facet_wrap(~sample_method, labeller = labeller(sample_method = c(&quot;1&quot; = &quot;OP&quot;, &quot;2&quot; = &quot;Sample different data&quot;) ) )

R: 随机抽样硬币投掷组

<sup>Created on 2023-05-11 with reprex v2.0.2</sup>

Edit

The way the conditional probability is calculated is also not ideal. The best way to estimate the probability of "H" given "H", would be to do the number of coins that had the result "H" and on the next coin the result "H" again. The way you are presenting it is closer to estimating the marginal probability of the combinations P(HH) and P(HT) than P(H|H) and P(T|H).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

R: 随机抽样硬币投掷组

问题

答案1

Function to simulate data

Function for sampling

Edit

Function to simulate data

Function for sampling

Edit

从分组柱状图中删除Y轴。

Conditional REGEX in R to select text in front of or behind a specific character — "%" in this case

如何构建一个函数，根据变量来构建直方图或柱状图

创建一个从两个数据集中生成的’表格’。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论