按时间阈值在R中计算真值、假值和总和值。

huangapple go评论61阅读模式
英文:

count true false sum values by time thresholds in r

问题

我已经翻译好了你提供的内容,以下是翻译结果:

我有一个包含问题回答为正确或错误的学生数据集。还有一个以秒为单位的时间变量。我想创建一个时间标志,记录按`1分钟`、`2分钟`和`3分钟`阈值计算的正确和错误回答数量。以下是一个示例数据集。

    df <- data.frame(id = c(1,2,3,4,5),
                     gender = c("m","f","m","f","m"),
                     age = c(11,12,12,13,14),
                     i1 = c(1,0,NA,1,0),
                     i2 = c(0,1,0,"1]",1),
                     i3 = c("1]",1,"1]",0,"0]"),
                     i4 = c(0,"0]",1,1,0),
                     i5 = c(1,1,NA,"0]","1]"),
                     i6 = c(0,0,"0]",1,1),
                     i7 = c(1,"1]",1,0,0),
                     i8 = c(0,0,0,"1]","1]"),
                     i9 = c(1,1,1,0,NA),
                     time = c(115,138,148,195, 225))
    
     > df
      id gender age i1 i2 i3 i4   i5 i6 i7 i8 i9 time
    1  1      m  11  1  0 1]  0    1  0  1  0  1  115
    2  2      f  12  0  1  1 0]    1  0 1]  0  1  138
    3  3      m  12 NA  0 1]  1 <NA> 0]  1  0  1  148
    4  4      f  13  1 1]  0  1   0]  1  0 1]  0  195
    5  5      m  14  0  1 0]  0   1]  1  0 1] NA  225

分钟阈值由分数右侧的`]`符号表示。

例如,对于`id = 3`,`1分钟`阈值位于`i3`项目,`2分钟`阈值位于`i6`项目。每个学生可能具有不同的时间阈值。

我需要创建标志变量,以计算按`1分钟`、`2分钟`和`3分钟`阈值计算的正确和错误回答数量。

如何实现以下所需的数据集。

    > df1
      id gender age i1 i2 i3 i4   i5 i6 i7 i8 i9 time one_true one_false two_true two_false three_true three_false
    1  1      m  11  1  0 1]  0    1  0  1  0  1  115        2         1       NA        NA         NA          NA
    2  2      f  12  0  1  1 0]    1  0 1]  0  1  138        2         2        4         3         NA          NA
    3  3      m  12 NA  0 1]  1 <NA> 0]  1  0  1  148        1         1        2         2         NA          NA
    4  4      f  13  1 1]  0  1   0]  1  0 1]  0  195        2         0        3         2          5           3
    5  5      m  14  0  1 0]  0   1]  1  0 1] NA  225        1         2        2         3          4           4

请注意,我已经将代码部分保留在原文中,不进行翻译。如果需要进一步的解释或帮助,请随时提问。

英文:

I have a student dataset that includes responses to questions as right or wrong. There is also a time variable in seconds. I would like to create a time flag to record number of correct and incorrect responses by 1 minute 2 minute and 3 minute thresholds. Here is a sample dataset.

df &lt;- data.frame(id = c(1,2,3,4,5),
gender = c(&quot;m&quot;,&quot;f&quot;,&quot;m&quot;,&quot;f&quot;,&quot;m&quot;),
age = c(11,12,12,13,14),
i1 = c(1,0,NA,1,0),
i2 = c(0,1,0,&quot;1]&quot;,1),
i3 = c(&quot;1]&quot;,1,&quot;1]&quot;,0,&quot;0]&quot;),
i4 = c(0,&quot;0]&quot;,1,1,0),
i5 = c(1,1,NA,&quot;0]&quot;,&quot;1]&quot;),
i6 = c(0,0,&quot;0]&quot;,1,1),
i7 = c(1,&quot;1]&quot;,1,0,0),
i8 = c(0,0,0,&quot;1]&quot;,&quot;1]&quot;),
i9 = c(1,1,1,0,NA),
time = c(115,138,148,195, 225))
&gt; df
id gender age i1 i2 i3 i4   i5 i6 i7 i8 i9 time
1  1      m  11  1  0 1]  0    1  0  1  0  1  115
2  2      f  12  0  1  1 0]    1  0 1]  0  1  138
3  3      m  12 NA  0 1]  1 &lt;NA&gt; 0]  1  0  1  148
4  4      f  13  1 1]  0  1   0]  1  0 1]  0  195
5  5      m  14  0  1 0]  0   1]  1  0 1] NA  225

The minute thresholds are represented by a ] sign at the right side of the score.

For example for the id = 3, the 1-minute threshold is at item i3 , the 2-minute threshold is at item i6. Each student might have different time thresholds.

I need to create flagging variables to count number of correct and incorrect responses by the 1-min 2-min and 3-min thresholds.

How can I achieve the desired dataset as below.

&gt; df1
id gender age i1 i2 i3 i4   i5 i6 i7 i8 i9 time one_true one_false two_true two_false three_true three_false
1  1      m  11  1  0 1]  0    1  0  1  0  1  115        2         1       NA        NA         NA          NA
2  2      f  12  0  1  1 0]    1  0 1]  0  1  138        2         2        4         3         NA          NA
3  3      m  12 NA  0 1]  1 &lt;NA&gt; 0]  1  0  1  148        1         1        2         2         NA          NA
4  4      f  13  1 1]  0  1   0]  1  0 1]  0  195        2         0        3         2          5           3
5  5      m  14  0  1 0]  0   1]  1  0 1] NA  225        1         2        2         3          4           4

答案1

得分: 1

这里是一个dplyr管道,可以生成你想要的结果。

我选择使用xfun::n2w来将1转换为"one"等。如果你可以接受严格的数字,那么你不需要这个步骤。

library(dplyr)
library(tidyr) # pivot_*
# library(xfun) # n2w, convert numbers to words
df %>%
  select(-gender, -age, -time) %>%
  mutate(across(-id, as.character)) %>%
  pivot_longer(-id) %>%
  arrange(id, name) %>%
  mutate(
    grp = 1L + cumsum(grepl("]", lag(value), fixed=TRUE)),
    grp = xfun::n2w(grp),
    num = if_else(gsub("[^0-9]", "", value) == "1", "true", "false"),
    .by = id) %>%
  filter(
    any(grepl("]", value, fixed = TRUE)), 
    .by = c(id, grp)) %>%
  count(id, grp, num) %>%
  filter(!is.na(num)) %>%
  mutate(n = cumsum(n), .by = c(id, num)) %>%
  pivot_wider(
    id, names_sep = "_",
    names_from = c("grp", "num"), values_from = "n"
  ) %>%
  left_join(df, ., by = "id")
#   id gender age i1 i2 i3 i4   i5 i6 i7 i8 i9 time one_false one_true two_false two_true three_false three_true
# 1  1      m  11  1  0 1]  0    1  0  1  0  1  115         1        2        NA       NA          NA         NA
# 2  2      f  12  0  1  1 0]    1  0 1]  0  1  138         2        2         3        4          NA         NA
# 3  3      m  12 NA  0 1]  1 <NA> 0]  1  0  1  148         1        1         2        2          NA         NA
# 4  4      f  13  1 1]  0  1   0]  1  0 1]  0  195        NA        2         3        5           1          4
# 5  5      m  14  0  1 0]  0   1]  1  0 1] NA  225         2        1         4        4           3          3

这里使用了.by=, 所以需要 dplyr_1.1.0 或更新版本。

大部分工作都在一个临时框架上完成,返回id和预期的摘要列 one_false 等。

  • mutate(across(-id, as.character)),因为 pivot_longer 需要兼容的数据类型,你的示例数据有一些列是整数,一些是字符。
  • pivot_longer 重新塑造了数据,将它从“宽”格式转换成三列:idname"i1","i2",...)和value"1","0","1]",...
  • mutate(grp...) 条件化组,使得直到第一次出现]的所有内容都在组"one""two"等中;使用xfun:n2w2转换为"two"纯粹是为了美观,如果可以接受以数字开头的列名,比如1_true(假设你也对num进行了变换)。
  • mutate(num...) 将你的10转换为"true""false";这主要是为了匹配你的预期输出;如果你只想要01,那么你仍然需要移除]以便正确计数。
  • filter(any(...)) 移除不以]结尾的行。
  • count 计数(奇怪的计数)按组进行。
  • mutate(n=cumsum(n)) 按不同的组确保你的one_truetwo_true是累积的。
  • pivot_wider 将多行转换为列,撤销了我们的第一步努力。
  • 我们将这个摘要重新与原始的 df 进行连接,使用 left_join
英文:

Here's a dplyr pipe that produces what you want.

I'm optionally using xfun::n2w to convert 1 to &quot;one&quot;, etc. If you can accept strict numbers, then you don't need this.

library(dplyr)
library(tidyr) # pivot_*
# library(xfun) # n2w, convert numbers to words
df %&gt;%
  select(-gender, -age, -time) %&gt;%
  mutate(across(-id, as.character)) %&gt;%
  pivot_longer(-id) %&gt;%
  arrange(id, name) %&gt;%
  mutate(
    grp = 1L + cumsum(grepl(&quot;]&quot;, lag(value), fixed=TRUE)),
    grp = xfun::n2w(grp),
    num = if_else(gsub(&quot;[^0-9]&quot;, &quot;&quot;, value) == &quot;1&quot;, &quot;true&quot;, &quot;false&quot;),
    .by = id) %&gt;%
  filter(
    any(grepl(&quot;]&quot;, value, fixed = TRUE)), 
    .by = c(id, grp)) %&gt;%
  count(id, grp, num) %&gt;%
  filter(!is.na(num)) %&gt;%
  mutate(n = cumsum(n), .by = c(id, num)) %&gt;%
  pivot_wider(
    id, names_sep = &quot;_&quot;,
    names_from = c(&quot;grp&quot;, &quot;num&quot;), values_from = &quot;n&quot;
  ) %&gt;%
  left_join(df, ., by = &quot;id&quot;)
#   id gender age i1 i2 i3 i4   i5 i6 i7 i8 i9 time one_false one_true two_false two_true three_false three_true
# 1  1      m  11  1  0 1]  0    1  0  1  0  1  115         1        2        NA       NA          NA         NA
# 2  2      f  12  0  1  1 0]    1  0 1]  0  1  138         2        2         3        4          NA         NA
# 3  3      m  12 NA  0 1]  1 &lt;NA&gt; 0]  1  0  1  148         1        1         2        2          NA         NA
# 4  4      f  13  1 1]  0  1   0]  1  0 1]  0  195        NA        2         3        5           1          4
# 5  5      m  14  0  1 0]  0   1]  1  0 1] NA  225         2        1         4        4           3          3

This is using .by=, so required dplyr_1.1.0 or newer.

Most of the work is done on a temporary frame that returns id and the intended summary columns one_false and beyond.

  • mutate(across(-id, as.character)) because pivot_longer requires compatible classes, and your sample data here has some columns int, some chr.
  • pivot_longer reshapes from "wide" to three columns: id, name (&quot;i1&quot;, &quot;i2&quot;, ...), and value (&quot;1&quot;, &quot;0&quot;, &quot;1]&quot;, ...)
  • mutate(grp...) condition the groups such that everything up until the first occurrence of ] is in group &quot;one&quot;, &quot;two&quot;, etc; the use of xfun:n2w to go from 2 to &quot;two&quot; is purely aesthetic, you can do without if you can accept column names starting with numbers, ala 1_true (assuming you mutate num as well);
  • mutate(num...) converts your 1s and 0s to &quot;true&quot; and &quot;false&quot;; this is mostly aesthetic, included to match your intended output; if you would prefer just 0 and 1, then you'd still need to remove the ] in order to count things correctly
  • filter(any(...)) removes rows that do not end in a value with ]
  • count counts (weird) by the groups
  • mutate(n=cumsum(n)) by different grouping ensures your one_true and two_true are cumulative
  • pivot_wider converts multiple rows into columns, undoing our first effort
  • we bring that summary back to the original df with left_join

huangapple
  • 本文由 发表于 2023年6月8日 01:10:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/76425643.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定