英文:
Predict values using group_by in R
问题
I want to create a column with the predicted values of a regression - previously grouped.
我想创建一个带有回归预测值的列 - 先前已分组。
I have tried this:
我尝试过这个:
Data
数据
city <- c("a", "a", "a", "b", "b", "b", "a")
gender <- c("male", "female", "female", "male", "male", "female", "male")
age <- c(24, 25, 26, 78, 65, 34, 23)
death <- c(0, 0, 1, 1, 0, 0, 0)
df <- data.frame(city, gender, age, death)
Code:
代码:
df_1 <- df %>%
group_by(city) %>%
glm(death ~ gender + age, data = df, family = "poisson") %>%
mutate(death_p = predict(glm))
Result
结果
Error in model.frame.default(formula = ., data = df, weights = death ~ :
invalid type (language) for variable '(weights)'
英文:
I want to create a column with the predicted values of a regression - previously grouped.
I have tried this:
Data
city<-c("a","a","a","b","b","b","a")
gender<-c("male","female","female","male","male","female","male")
age<-c(24,25,26,78,65,34,23)
death<-c(0,0,1,1,0,0,0)
df<-data.frame(city,gender,age,death)
Code:
df_1<-df%>%
group_by(city)%>%
glm(death~gender+age,data=df,family="poisson")%>%
mutate(death_p=predict(glm))
Result
Error in model.frame.default(formula = ., data = df, weights = death ~ :
invalid type (language) for variable '(weights)'
答案1
得分: 2
-
"永远" 不要在基于
df
的管道中使用df
。在任何数据追加、筛选、扩充或(就像在这种情况下一样)进行_分组_的情况下,重用df
不会给您预期的结果。请改用cur_data()
。 -
我们可以将模型存储为列表列。在这种情况下,由于我们没有进行摘要,这样做可能有点低效,因为它将在每个组内的每一行中存储冗余的模型副本,但是...我们现在可以接受这一点。
尝试这样做:
out <- df %>%
group_by(city) %>%
mutate(
mdl = list(glm(death ~ gender + age, data=cur_data(), family="poisson")),
pred = predict(mdl[[1]], newdata = cur_data(), family = "poisson")
) %>%
ungroup()
out
# 警告: 在计算 `mdl = list(glm(death ~ gender + age, data = cur_data(), family = "poisson"))` 时出现问题。
# ℹ glm.fit: 数值上出现了0的拟合率
# ℹ 警告出现在组 1 中:city = "a"。
# # A tibble: 7 × 6
# city gender age death mdl pred
# <chr> <chr> <dbl> <dbl> <list> <dbl>
# 1 a male 24 0 <glm> -2.31e+ 1
# 2 a female 25 0 <glm> -2.31e+ 1
# 3 a female 26 1 <glm> 0
# 4 b male 78 1 <glm> -2.84e-14
# 5 b male 65 0 <glm> -2.33e+ 1
# 6 b female 34 0 <glm> -2.33e+ 1
# 7 a male 23 0 <glm> -4.61e+ 1
您还可以对 mdl
列执行其他操作,例如提取一些摘要:
out %>%
group_by(city) %>%
summarize(smry = list(summary(mdl[[1]]))) %>%
pull(smry)
# [[1]]
# Call:
# glm(formula = death ~ gender + age, family = "poisson", data = cur_data())
# Deviance Residuals:
# 1 2 3 4
# -1.389e-05 -1.389e-05 0.000e+00 -2.110e-08
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -599.62 1606083.51 0 1
# gendermale 23.06 138127.43 0 1
# age 23.06 61772.44 0 1
# (Dispersion parameter for poisson family taken to be 1)
# Null deviance: 2.7726e+00 on 3 degrees of freedom
# Residual deviance: 3.8564e-10 on 1 degrees of freedom
# AIC: 8
# Number of Fisher Scoring iterations: 22
# [[2]]
# Call:
# glm(formula = death ~ gender + age, family = "poisson", data = cur_data())
# Deviance Residuals:
# [1] 0 0 0
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -84.248 195033.586 0 1
# gendermale -55.568 245825.832 0 1
# age 1.793 5357.985 0 1
# (Dispersion parameter for poisson family taken to be 1)
# Null deviance: 2.1972e+00 on 2 degrees of freedom
# Residual deviance: 3.0330e-10 on 0 degrees of freedom
# AIC: 8
# Number of Fisher Scoring iterations: 21
英文:
-
"Never" use
df
in a pipe based off ofdf
. In any case where the data is appended, filtered, augmented, or (as in this case) grouped, reusingdf
is not going to give you the intended results. Usecur_data()
instead. -
We can store the model as a list-column. In this case since we don't summarize it, it's a little inefficient since it'll store redundant copies of the model in each row within a group, but ... we can live with that for now.
Try this:
out <- df %>%
group_by(city) %>%
mutate(
mdl = list(glm(death ~ gender + age, data=cur_data(), family="poisson")),
pred = predict(mdl[[1]], newdata = cur_data(), family = "poisson")
) %>%
ungroup()
out
# Warning: Problem while computing `mdl = list(glm(death ~ gender + age, data = cur_data(), family = "poisson"))`.
# ℹ glm.fit: fitted rates numerically 0 occurred
# ℹ The warning occurred in group 1: city = "a".
# # A tibble: 7 × 6
# city gender age death mdl pred
# <chr> <chr> <dbl> <dbl> <list> <dbl>
# 1 a male 24 0 <glm> -2.31e+ 1
# 2 a female 25 0 <glm> -2.31e+ 1
# 3 a female 26 1 <glm> 0
# 4 b male 78 1 <glm> -2.84e-14
# 5 b male 65 0 <glm> -2.33e+ 1
# 6 b female 34 0 <glm> -2.33e+ 1
# 7 a male 23 0 <glm> -4.61e+ 1
You can do other things with the mdl
column, such as extract some summaries:
out %>%
group_by(city) %>%
summarize(smry = list(summary(mdl[[1]]))) %>%
pull(smry)
# [[1]]
# Call:
# glm(formula = death ~ gender + age, family = "poisson", data = cur_data())
# Deviance Residuals:
# 1 2 3 4
# -1.389e-05 -1.389e-05 0.000e+00 -2.110e-08
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -599.62 1606083.51 0 1
# gendermale 23.06 138127.43 0 1
# age 23.06 61772.44 0 1
# (Dispersion parameter for poisson family taken to be 1)
# Null deviance: 2.7726e+00 on 3 degrees of freedom
# Residual deviance: 3.8564e-10 on 1 degrees of freedom
# AIC: 8
# Number of Fisher Scoring iterations: 22
# [[2]]
# Call:
# glm(formula = death ~ gender + age, family = "poisson", data = cur_data())
# Deviance Residuals:
# [1] 0 0 0
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -84.248 195033.586 0 1
# gendermale -55.568 245825.832 0 1
# age 1.793 5357.985 0 1
# (Dispersion parameter for poisson family taken to be 1)
# Null deviance: 2.1972e+00 on 2 degrees of freedom
# Residual deviance: 3.0330e-10 on 0 degrees of freedom
# AIC: 8
# Number of Fisher Scoring iterations: 21
答案2
得分: 1
我们可以使用 do()
并进行一些小的更改,比如在glm()函数中包括公式参数,权重可以设置为NULL:
library(dplyr)
df %>%
group_by(city) %>%
do(data.frame(., death_p = predict(glm(death ~ gender + age, data = ., family = "poisson"))))
城市 性别 年龄 死亡 死亡概率
1 a 男性 24 0 -2.31e+ 1
2 a 女性 25 0 -2.31e+ 1
3 a 女性 26 1 0
4 a 男性 23 0 -4.61e+ 1
5 b 男性 78 1 -2.84e-14
6 b 男性 65 0 -2.33e+ 1
7 b 女性 34 0 -2.33e+ 1
<details>
<summary>英文:</summary>
We could do it using `do()` and minor changes like to include the formula argument in the glm() function, weights could be set to NULL:
library(dplyr)
df %>%
group_by(city) %>%
do(data.frame(., death_p = predict(glm(death ~ gender + age, data = ., family = "poisson"))))
city gender age death death_p
<chr> <chr> <dbl> <dbl> <dbl>
1 a male 24 0 -2.31e+ 1
2 a female 25 0 -2.31e+ 1
3 a female 26 1 0
4 a male 23 0 -4.61e+ 1
5 b male 78 1 -2.84e-14
6 b male 65 0 -2.33e+ 1
7 b female 34 0 -2.33e+ 1
</details>
# 答案3
**得分**: 0
这是一个data.table版本的代码:
```R
library(data.table)
setDT(df)[, death_p := exp(predict(glm(death ~ age + gender, family = "poisson"))), city]
输出结果如下:
city gender age death death_p
1: a male 24 0 9.640864e-11
2: a female 25 0 9.640864e-11
3: a female 26 1 1.000000e+00
4: b male 78 1 1.000000e+00
5: b male 65 0 7.582560e-11
6: b female 34 0 7.582560e-11
7: a male 23 0 9.294626e-21
英文:
Here is a data.table version
library(data.table)
setDT(df)[, death_p:=exp(predict(glm(death~age+gender, family="poisson"))), city]
Out:
city gender age death death_p
1: a male 24 0 9.640864e-11
2: a female 25 0 9.640864e-11
3: a female 26 1 1.000000e+00
4: b male 78 1 1.000000e+00
5: b male 65 0 7.582560e-11
6: b female 34 0 7.582560e-11
7: a male 23 0 9.294626e-21
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论