英文:
count true false sum values by time thresholds in r
问题
我已经翻译好了你提供的内容,以下是翻译结果:
我有一个包含问题回答为正确或错误的学生数据集。还有一个以秒为单位的时间变量。我想创建一个时间标志,记录按`1分钟`、`2分钟`和`3分钟`阈值计算的正确和错误回答数量。以下是一个示例数据集。
df <- data.frame(id = c(1,2,3,4,5),
gender = c("m","f","m","f","m"),
age = c(11,12,12,13,14),
i1 = c(1,0,NA,1,0),
i2 = c(0,1,0,"1]",1),
i3 = c("1]",1,"1]",0,"0]"),
i4 = c(0,"0]",1,1,0),
i5 = c(1,1,NA,"0]","1]"),
i6 = c(0,0,"0]",1,1),
i7 = c(1,"1]",1,0,0),
i8 = c(0,0,0,"1]","1]"),
i9 = c(1,1,1,0,NA),
time = c(115,138,148,195, 225))
> df
id gender age i1 i2 i3 i4 i5 i6 i7 i8 i9 time
1 1 m 11 1 0 1] 0 1 0 1 0 1 115
2 2 f 12 0 1 1 0] 1 0 1] 0 1 138
3 3 m 12 NA 0 1] 1 <NA> 0] 1 0 1 148
4 4 f 13 1 1] 0 1 0] 1 0 1] 0 195
5 5 m 14 0 1 0] 0 1] 1 0 1] NA 225
分钟阈值由分数右侧的`]`符号表示。
例如,对于`id = 3`,`1分钟`阈值位于`i3`项目,`2分钟`阈值位于`i6`项目。每个学生可能具有不同的时间阈值。
我需要创建标志变量,以计算按`1分钟`、`2分钟`和`3分钟`阈值计算的正确和错误回答数量。
如何实现以下所需的数据集。
> df1
id gender age i1 i2 i3 i4 i5 i6 i7 i8 i9 time one_true one_false two_true two_false three_true three_false
1 1 m 11 1 0 1] 0 1 0 1 0 1 115 2 1 NA NA NA NA
2 2 f 12 0 1 1 0] 1 0 1] 0 1 138 2 2 4 3 NA NA
3 3 m 12 NA 0 1] 1 <NA> 0] 1 0 1 148 1 1 2 2 NA NA
4 4 f 13 1 1] 0 1 0] 1 0 1] 0 195 2 0 3 2 5 3
5 5 m 14 0 1 0] 0 1] 1 0 1] NA 225 1 2 2 3 4 4
请注意,我已经将代码部分保留在原文中,不进行翻译。如果需要进一步的解释或帮助,请随时提问。
英文:
I have a student dataset that includes responses to questions as right or wrong. There is also a time variable in seconds. I would like to create a time flag to record number of correct and incorrect responses by 1 minute 2 minute and 3 minute thresholds. Here is a sample dataset.
df <- data.frame(id = c(1,2,3,4,5),
gender = c("m","f","m","f","m"),
age = c(11,12,12,13,14),
i1 = c(1,0,NA,1,0),
i2 = c(0,1,0,"1]",1),
i3 = c("1]",1,"1]",0,"0]"),
i4 = c(0,"0]",1,1,0),
i5 = c(1,1,NA,"0]","1]"),
i6 = c(0,0,"0]",1,1),
i7 = c(1,"1]",1,0,0),
i8 = c(0,0,0,"1]","1]"),
i9 = c(1,1,1,0,NA),
time = c(115,138,148,195, 225))
> df
id gender age i1 i2 i3 i4 i5 i6 i7 i8 i9 time
1 1 m 11 1 0 1] 0 1 0 1 0 1 115
2 2 f 12 0 1 1 0] 1 0 1] 0 1 138
3 3 m 12 NA 0 1] 1 <NA> 0] 1 0 1 148
4 4 f 13 1 1] 0 1 0] 1 0 1] 0 195
5 5 m 14 0 1 0] 0 1] 1 0 1] NA 225
The minute thresholds are represented by a ] sign at the right side of the score.
For example for the id = 3, the 1-minute threshold is at item i3 , the 2-minute threshold is at item i6. Each student might have different time thresholds.
I need to create flagging variables to count number of correct and incorrect responses by the 1-min 2-min and 3-min thresholds.
How can I achieve the desired dataset as below.
> df1
id gender age i1 i2 i3 i4 i5 i6 i7 i8 i9 time one_true one_false two_true two_false three_true three_false
1 1 m 11 1 0 1] 0 1 0 1 0 1 115 2 1 NA NA NA NA
2 2 f 12 0 1 1 0] 1 0 1] 0 1 138 2 2 4 3 NA NA
3 3 m 12 NA 0 1] 1 <NA> 0] 1 0 1 148 1 1 2 2 NA NA
4 4 f 13 1 1] 0 1 0] 1 0 1] 0 195 2 0 3 2 5 3
5 5 m 14 0 1 0] 0 1] 1 0 1] NA 225 1 2 2 3 4 4
答案1
得分: 1
这里是一个dplyr管道,可以生成你想要的结果。
我选择使用xfun::n2w来将1转换为"one"等。如果你可以接受严格的数字,那么你不需要这个步骤。
library(dplyr)
library(tidyr) # pivot_*
# library(xfun) # n2w, convert numbers to words
df %>%
select(-gender, -age, -time) %>%
mutate(across(-id, as.character)) %>%
pivot_longer(-id) %>%
arrange(id, name) %>%
mutate(
grp = 1L + cumsum(grepl("]", lag(value), fixed=TRUE)),
grp = xfun::n2w(grp),
num = if_else(gsub("[^0-9]", "", value) == "1", "true", "false"),
.by = id) %>%
filter(
any(grepl("]", value, fixed = TRUE)),
.by = c(id, grp)) %>%
count(id, grp, num) %>%
filter(!is.na(num)) %>%
mutate(n = cumsum(n), .by = c(id, num)) %>%
pivot_wider(
id, names_sep = "_",
names_from = c("grp", "num"), values_from = "n"
) %>%
left_join(df, ., by = "id")
# id gender age i1 i2 i3 i4 i5 i6 i7 i8 i9 time one_false one_true two_false two_true three_false three_true
# 1 1 m 11 1 0 1] 0 1 0 1 0 1 115 1 2 NA NA NA NA
# 2 2 f 12 0 1 1 0] 1 0 1] 0 1 138 2 2 3 4 NA NA
# 3 3 m 12 NA 0 1] 1 <NA> 0] 1 0 1 148 1 1 2 2 NA NA
# 4 4 f 13 1 1] 0 1 0] 1 0 1] 0 195 NA 2 3 5 1 4
# 5 5 m 14 0 1 0] 0 1] 1 0 1] NA 225 2 1 4 4 3 3
这里使用了.by=, 所以需要 dplyr_1.1.0 或更新版本。
大部分工作都在一个临时框架上完成,返回id和预期的摘要列 one_false 等。
mutate(across(-id, as.character)),因为pivot_longer需要兼容的数据类型,你的示例数据有一些列是整数,一些是字符。pivot_longer重新塑造了数据,将它从“宽”格式转换成三列:id、name("i1","i2",...)和value("1","0","1]",...)mutate(grp...)条件化组,使得直到第一次出现]的所有内容都在组"one"、"two"等中;使用xfun:n2w将2转换为"two"纯粹是为了美观,如果可以接受以数字开头的列名,比如1_true(假设你也对num进行了变换)。mutate(num...)将你的1和0转换为"true"和"false";这主要是为了匹配你的预期输出;如果你只想要0和1,那么你仍然需要移除]以便正确计数。filter(any(...))移除不以]结尾的行。count计数(奇怪的计数)按组进行。mutate(n=cumsum(n))按不同的组确保你的one_true和two_true是累积的。pivot_wider将多行转换为列,撤销了我们的第一步努力。- 我们将这个摘要重新与原始的
df进行连接,使用left_join。
英文:
Here's a dplyr pipe that produces what you want.
I'm optionally using xfun::n2w to convert 1 to "one", etc. If you can accept strict numbers, then you don't need this.
library(dplyr)
library(tidyr) # pivot_*
# library(xfun) # n2w, convert numbers to words
df %>%
select(-gender, -age, -time) %>%
mutate(across(-id, as.character)) %>%
pivot_longer(-id) %>%
arrange(id, name) %>%
mutate(
grp = 1L + cumsum(grepl("]", lag(value), fixed=TRUE)),
grp = xfun::n2w(grp),
num = if_else(gsub("[^0-9]", "", value) == "1", "true", "false"),
.by = id) %>%
filter(
any(grepl("]", value, fixed = TRUE)),
.by = c(id, grp)) %>%
count(id, grp, num) %>%
filter(!is.na(num)) %>%
mutate(n = cumsum(n), .by = c(id, num)) %>%
pivot_wider(
id, names_sep = "_",
names_from = c("grp", "num"), values_from = "n"
) %>%
left_join(df, ., by = "id")
# id gender age i1 i2 i3 i4 i5 i6 i7 i8 i9 time one_false one_true two_false two_true three_false three_true
# 1 1 m 11 1 0 1] 0 1 0 1 0 1 115 1 2 NA NA NA NA
# 2 2 f 12 0 1 1 0] 1 0 1] 0 1 138 2 2 3 4 NA NA
# 3 3 m 12 NA 0 1] 1 <NA> 0] 1 0 1 148 1 1 2 2 NA NA
# 4 4 f 13 1 1] 0 1 0] 1 0 1] 0 195 NA 2 3 5 1 4
# 5 5 m 14 0 1 0] 0 1] 1 0 1] NA 225 2 1 4 4 3 3
This is using .by=, so required dplyr_1.1.0 or newer.
Most of the work is done on a temporary frame that returns id and the intended summary columns one_false and beyond.
mutate(across(-id, as.character))becausepivot_longerrequires compatible classes, and your sample data here has some columns int, some chr.pivot_longerreshapes from "wide" to three columns:id,name("i1", "i2", ...), andvalue("1", "0", "1]", ...)mutate(grp...)condition the groups such that everything up until the first occurrence of]is in group"one","two", etc; the use ofxfun:n2wto go from2to"two"is purely aesthetic, you can do without if you can accept column names starting with numbers, ala1_true(assuming you mutatenumas well);mutate(num...)converts your1s and0s to"true"and"false"; this is mostly aesthetic, included to match your intended output; if you would prefer just0and1, then you'd still need to remove the]in order to count things correctlyfilter(any(...))removes rows that do not end in a value with]countcounts (weird) by the groupsmutate(n=cumsum(n))by different grouping ensures yourone_trueandtwo_trueare cumulativepivot_widerconverts multiple rows into columns, undoing our first effort- we bring that summary back to the original
dfwithleft_join
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论