英文:
count true false sum values by time thresholds in r
问题
我已经翻译好了你提供的内容,以下是翻译结果:
我有一个包含问题回答为正确或错误的学生数据集。还有一个以秒为单位的时间变量。我想创建一个时间标志,记录按`1分钟`、`2分钟`和`3分钟`阈值计算的正确和错误回答数量。以下是一个示例数据集。
df <- data.frame(id = c(1,2,3,4,5),
gender = c("m","f","m","f","m"),
age = c(11,12,12,13,14),
i1 = c(1,0,NA,1,0),
i2 = c(0,1,0,"1]",1),
i3 = c("1]",1,"1]",0,"0]"),
i4 = c(0,"0]",1,1,0),
i5 = c(1,1,NA,"0]","1]"),
i6 = c(0,0,"0]",1,1),
i7 = c(1,"1]",1,0,0),
i8 = c(0,0,0,"1]","1]"),
i9 = c(1,1,1,0,NA),
time = c(115,138,148,195, 225))
> df
id gender age i1 i2 i3 i4 i5 i6 i7 i8 i9 time
1 1 m 11 1 0 1] 0 1 0 1 0 1 115
2 2 f 12 0 1 1 0] 1 0 1] 0 1 138
3 3 m 12 NA 0 1] 1 <NA> 0] 1 0 1 148
4 4 f 13 1 1] 0 1 0] 1 0 1] 0 195
5 5 m 14 0 1 0] 0 1] 1 0 1] NA 225
分钟阈值由分数右侧的`]`符号表示。
例如,对于`id = 3`,`1分钟`阈值位于`i3`项目,`2分钟`阈值位于`i6`项目。每个学生可能具有不同的时间阈值。
我需要创建标志变量,以计算按`1分钟`、`2分钟`和`3分钟`阈值计算的正确和错误回答数量。
如何实现以下所需的数据集。
> df1
id gender age i1 i2 i3 i4 i5 i6 i7 i8 i9 time one_true one_false two_true two_false three_true three_false
1 1 m 11 1 0 1] 0 1 0 1 0 1 115 2 1 NA NA NA NA
2 2 f 12 0 1 1 0] 1 0 1] 0 1 138 2 2 4 3 NA NA
3 3 m 12 NA 0 1] 1 <NA> 0] 1 0 1 148 1 1 2 2 NA NA
4 4 f 13 1 1] 0 1 0] 1 0 1] 0 195 2 0 3 2 5 3
5 5 m 14 0 1 0] 0 1] 1 0 1] NA 225 1 2 2 3 4 4
请注意,我已经将代码部分保留在原文中,不进行翻译。如果需要进一步的解释或帮助,请随时提问。
英文:
I have a student dataset that includes responses to questions as right or wrong. There is also a time variable in seconds. I would like to create a time flag to record number of correct and incorrect responses by 1 minute
2 minute
and 3 minute
thresholds. Here is a sample dataset.
df <- data.frame(id = c(1,2,3,4,5),
gender = c("m","f","m","f","m"),
age = c(11,12,12,13,14),
i1 = c(1,0,NA,1,0),
i2 = c(0,1,0,"1]",1),
i3 = c("1]",1,"1]",0,"0]"),
i4 = c(0,"0]",1,1,0),
i5 = c(1,1,NA,"0]","1]"),
i6 = c(0,0,"0]",1,1),
i7 = c(1,"1]",1,0,0),
i8 = c(0,0,0,"1]","1]"),
i9 = c(1,1,1,0,NA),
time = c(115,138,148,195, 225))
> df
id gender age i1 i2 i3 i4 i5 i6 i7 i8 i9 time
1 1 m 11 1 0 1] 0 1 0 1 0 1 115
2 2 f 12 0 1 1 0] 1 0 1] 0 1 138
3 3 m 12 NA 0 1] 1 <NA> 0] 1 0 1 148
4 4 f 13 1 1] 0 1 0] 1 0 1] 0 195
5 5 m 14 0 1 0] 0 1] 1 0 1] NA 225
The minute thresholds are represented by a ]
sign at the right side of the score.
For example for the id = 3
, the 1-minute
threshold is at item i3
, the 2-minute
threshold is at item i6
. Each student might have different time thresholds.
I need to create flagging variables to count number of correct and incorrect responses by the 1-min
2-min
and 3-min
thresholds.
How can I achieve the desired dataset as below.
> df1
id gender age i1 i2 i3 i4 i5 i6 i7 i8 i9 time one_true one_false two_true two_false three_true three_false
1 1 m 11 1 0 1] 0 1 0 1 0 1 115 2 1 NA NA NA NA
2 2 f 12 0 1 1 0] 1 0 1] 0 1 138 2 2 4 3 NA NA
3 3 m 12 NA 0 1] 1 <NA> 0] 1 0 1 148 1 1 2 2 NA NA
4 4 f 13 1 1] 0 1 0] 1 0 1] 0 195 2 0 3 2 5 3
5 5 m 14 0 1 0] 0 1] 1 0 1] NA 225 1 2 2 3 4 4
答案1
得分: 1
这里是一个dplyr管道,可以生成你想要的结果。
我选择使用xfun::n2w
来将1
转换为"one"
等。如果你可以接受严格的数字,那么你不需要这个步骤。
library(dplyr)
library(tidyr) # pivot_*
# library(xfun) # n2w, convert numbers to words
df %>%
select(-gender, -age, -time) %>%
mutate(across(-id, as.character)) %>%
pivot_longer(-id) %>%
arrange(id, name) %>%
mutate(
grp = 1L + cumsum(grepl("]", lag(value), fixed=TRUE)),
grp = xfun::n2w(grp),
num = if_else(gsub("[^0-9]", "", value) == "1", "true", "false"),
.by = id) %>%
filter(
any(grepl("]", value, fixed = TRUE)),
.by = c(id, grp)) %>%
count(id, grp, num) %>%
filter(!is.na(num)) %>%
mutate(n = cumsum(n), .by = c(id, num)) %>%
pivot_wider(
id, names_sep = "_",
names_from = c("grp", "num"), values_from = "n"
) %>%
left_join(df, ., by = "id")
# id gender age i1 i2 i3 i4 i5 i6 i7 i8 i9 time one_false one_true two_false two_true three_false three_true
# 1 1 m 11 1 0 1] 0 1 0 1 0 1 115 1 2 NA NA NA NA
# 2 2 f 12 0 1 1 0] 1 0 1] 0 1 138 2 2 3 4 NA NA
# 3 3 m 12 NA 0 1] 1 <NA> 0] 1 0 1 148 1 1 2 2 NA NA
# 4 4 f 13 1 1] 0 1 0] 1 0 1] 0 195 NA 2 3 5 1 4
# 5 5 m 14 0 1 0] 0 1] 1 0 1] NA 225 2 1 4 4 3 3
这里使用了.by=
, 所以需要 dplyr_1.1.0
或更新版本。
大部分工作都在一个临时框架上完成,返回id
和预期的摘要列 one_false
等。
mutate(across(-id, as.character))
,因为pivot_longer
需要兼容的数据类型,你的示例数据有一些列是整数,一些是字符。pivot_longer
重新塑造了数据,将它从“宽”格式转换成三列:id
、name
("i1","i2",...
)和value
("1","0","1]",...
)mutate(grp...)
条件化组,使得直到第一次出现]
的所有内容都在组"one"
、"two"
等中;使用xfun:n2w
将2
转换为"two"
纯粹是为了美观,如果可以接受以数字开头的列名,比如1_true
(假设你也对num
进行了变换)。mutate(num...)
将你的1
和0
转换为"true"
和"false"
;这主要是为了匹配你的预期输出;如果你只想要0
和1
,那么你仍然需要移除]
以便正确计数。filter(any(...))
移除不以]
结尾的行。count
计数(奇怪的计数)按组进行。mutate(n=cumsum(n))
按不同的组确保你的one_true
和two_true
是累积的。pivot_wider
将多行转换为列,撤销了我们的第一步努力。- 我们将这个摘要重新与原始的
df
进行连接,使用left_join
。
英文:
Here's a dplyr pipe that produces what you want.
I'm optionally using xfun::n2w
to convert 1
to "one"
, etc. If you can accept strict numbers, then you don't need this.
library(dplyr)
library(tidyr) # pivot_*
# library(xfun) # n2w, convert numbers to words
df %>%
select(-gender, -age, -time) %>%
mutate(across(-id, as.character)) %>%
pivot_longer(-id) %>%
arrange(id, name) %>%
mutate(
grp = 1L + cumsum(grepl("]", lag(value), fixed=TRUE)),
grp = xfun::n2w(grp),
num = if_else(gsub("[^0-9]", "", value) == "1", "true", "false"),
.by = id) %>%
filter(
any(grepl("]", value, fixed = TRUE)),
.by = c(id, grp)) %>%
count(id, grp, num) %>%
filter(!is.na(num)) %>%
mutate(n = cumsum(n), .by = c(id, num)) %>%
pivot_wider(
id, names_sep = "_",
names_from = c("grp", "num"), values_from = "n"
) %>%
left_join(df, ., by = "id")
# id gender age i1 i2 i3 i4 i5 i6 i7 i8 i9 time one_false one_true two_false two_true three_false three_true
# 1 1 m 11 1 0 1] 0 1 0 1 0 1 115 1 2 NA NA NA NA
# 2 2 f 12 0 1 1 0] 1 0 1] 0 1 138 2 2 3 4 NA NA
# 3 3 m 12 NA 0 1] 1 <NA> 0] 1 0 1 148 1 1 2 2 NA NA
# 4 4 f 13 1 1] 0 1 0] 1 0 1] 0 195 NA 2 3 5 1 4
# 5 5 m 14 0 1 0] 0 1] 1 0 1] NA 225 2 1 4 4 3 3
This is using .by=
, so required dplyr_1.1.0
or newer.
Most of the work is done on a temporary frame that returns id
and the intended summary columns one_false
and beyond.
mutate(across(-id, as.character))
becausepivot_longer
requires compatible classes, and your sample data here has some columns int, some chr.pivot_longer
reshapes from "wide" to three columns:id
,name
("i1", "i2", ...
), andvalue
("1", "0", "1]", ...
)mutate(grp...)
condition the groups such that everything up until the first occurrence of]
is in group"one"
,"two"
, etc; the use ofxfun:n2w
to go from2
to"two"
is purely aesthetic, you can do without if you can accept column names starting with numbers, ala1_true
(assuming you mutatenum
as well);mutate(num...)
converts your1
s and0
s to"true"
and"false"
; this is mostly aesthetic, included to match your intended output; if you would prefer just0
and1
, then you'd still need to remove the]
in order to count things correctlyfilter(any(...))
removes rows that do not end in a value with]
count
counts (weird) by the groupsmutate(n=cumsum(n))
by different grouping ensures yourone_true
andtwo_true
are cumulativepivot_wider
converts multiple rows into columns, undoing our first effort- we bring that summary back to the original
df
withleft_join
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论