使用R中的group_by预测数值。

huangapple go评论54阅读模式
英文:

Predict values using group_by in R

问题

I want to create a column with the predicted values of a regression - previously grouped.

我想创建一个带有回归预测值的列 - 先前已分组。

I have tried this:

我尝试过这个:

Data

数据

city <- c("a", "a", "a", "b", "b", "b", "a")
gender <- c("male", "female", "female", "male", "male", "female", "male")
age <- c(24, 25, 26, 78, 65, 34, 23)
death <- c(0, 0, 1, 1, 0, 0, 0)

df <- data.frame(city, gender, age, death)

Code:

代码:

df_1 <- df %>%
  group_by(city) %>%
  glm(death ~ gender + age, data = df, family = "poisson") %>%
  mutate(death_p = predict(glm))

Result

结果

Error in model.frame.default(formula = ., data = df, weights = death ~  : 
  invalid type (language) for variable '(weights)'
英文:

I want to create a column with the predicted values of a regression - previously grouped.

I have tried this:

Data

city&lt;-c(&quot;a&quot;,&quot;a&quot;,&quot;a&quot;,&quot;b&quot;,&quot;b&quot;,&quot;b&quot;,&quot;a&quot;)
gender&lt;-c(&quot;male&quot;,&quot;female&quot;,&quot;female&quot;,&quot;male&quot;,&quot;male&quot;,&quot;female&quot;,&quot;male&quot;)
age&lt;-c(24,25,26,78,65,34,23)
death&lt;-c(0,0,1,1,0,0,0)

df&lt;-data.frame(city,gender,age,death)

Code:

df_1&lt;-df%&gt;%
  group_by(city)%&gt;%
  glm(death~gender+age,data=df,family=&quot;poisson&quot;)%&gt;%
  mutate(death_p=predict(glm))

Result

Error in model.frame.default(formula = ., data = df, weights = death ~  : 
  invalid type (language) for variable &#39;(weights)&#39;

答案1

得分: 2

  1. "永远" 不要在基于 df 的管道中使用 df。在任何数据追加、筛选、扩充或(就像在这种情况下一样)进行_分组_的情况下,重用 df 不会给您预期的结果。请改用 cur_data()

  2. 我们可以将模型存储为列表列。在这种情况下,由于我们没有进行摘要,这样做可能有点低效,因为它将在每个组内的每一行中存储冗余的模型副本,但是...我们现在可以接受这一点。

尝试这样做:

out <- df %>%
  group_by(city) %>%
  mutate(
    mdl = list(glm(death ~ gender + age, data=cur_data(), family="poisson")), 
    pred = predict(mdl[[1]], newdata = cur_data(), family = "poisson")
  ) %>%
  ungroup()
out
# 警告: 在计算 `mdl = list(glm(death ~ gender + age, data = cur_data(), family = "poisson"))` 时出现问题。
# ℹ glm.fit: 数值上出现了0的拟合率
# ℹ 警告出现在组 1 中:city = "a"。
# # A tibble: 7 × 6
#   city  gender   age death mdl         pred
#   <chr> <chr>  <dbl> <dbl> <list>     <dbl>
# 1 a     male      24     0 <glm>  -2.31e+ 1
# 2 a     female    25     0 <glm>  -2.31e+ 1
# 3 a     female    26     1 <glm>   0       
# 4 b     male      78     1 <glm>  -2.84e-14
# 5 b     male      65     0 <glm>  -2.33e+ 1
# 6 b     female    34     0 <glm>  -2.33e+ 1
# 7 a     male      23     0 <glm>  -4.61e+ 1

您还可以对 mdl 列执行其他操作,例如提取一些摘要:

out %>%
  group_by(city) %>%
  summarize(smry = list(summary(mdl[[1]]))) %>%
  pull(smry)
# [[1]]
# Call:
# glm(formula = death ~ gender + age, family = "poisson", data = cur_data())
# Deviance Residuals: 
#          1           2           3           4  
# -1.389e-05  -1.389e-05   0.000e+00  -2.110e-08  
# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)
# (Intercept)    -599.62 1606083.51       0        1
# gendermale       23.06  138127.43       0        1
# age              23.06   61772.44       0        1
# (Dispersion parameter for poisson family taken to be 1)
#     Null deviance: 2.7726e+00  on 3  degrees of freedom
# Residual deviance: 3.8564e-10  on 1  degrees of freedom
# AIC: 8
# Number of Fisher Scoring iterations: 22
# [[2]]
# Call:
# glm(formula = death ~ gender + age, family = "poisson", data = cur_data())
# Deviance Residuals: 
# [1]  0  0  0
# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)
# (Intercept)    -84.248 195033.586       0        1
# gendermale     -55.568 245825.832       0        1
# age              1.793   5357.985       0        1
# (Dispersion parameter for poisson family taken to be 1)
#     Null deviance: 2.1972e+00  on 2  degrees of freedom
# Residual deviance: 3.0330e-10  on 0  degrees of freedom
# AIC: 8
# Number of Fisher Scoring iterations: 21
英文:
  1. "Never" use df in a pipe based off of df. In any case where the data is appended, filtered, augmented, or (as in this case) grouped, reusing df is not going to give you the intended results. Use cur_data() instead.

  2. We can store the model as a list-column. In this case since we don't summarize it, it's a little inefficient since it'll store redundant copies of the model in each row within a group, but ... we can live with that for now.

Try this:

out &lt;- df %&gt;%
  group_by(city) %&gt;%
  mutate(
    mdl = list(glm(death ~ gender + age, data=cur_data(), family=&quot;poisson&quot;)), 
    pred = predict(mdl[[1]], newdata = cur_data(), family = &quot;poisson&quot;)
  ) %&gt;%
  ungroup()
out
# Warning: Problem while computing `mdl = list(glm(death ~ gender + age, data = cur_data(), family = &quot;poisson&quot;))`.
# ℹ glm.fit: fitted rates numerically 0 occurred
# ℹ The warning occurred in group 1: city = &quot;a&quot;.
# # A tibble: 7 &#215; 6
#   city  gender   age death mdl         pred
#   &lt;chr&gt; &lt;chr&gt;  &lt;dbl&gt; &lt;dbl&gt; &lt;list&gt;     &lt;dbl&gt;
# 1 a     male      24     0 &lt;glm&gt;  -2.31e+ 1
# 2 a     female    25     0 &lt;glm&gt;  -2.31e+ 1
# 3 a     female    26     1 &lt;glm&gt;   0       
# 4 b     male      78     1 &lt;glm&gt;  -2.84e-14
# 5 b     male      65     0 &lt;glm&gt;  -2.33e+ 1
# 6 b     female    34     0 &lt;glm&gt;  -2.33e+ 1
# 7 a     male      23     0 &lt;glm&gt;  -4.61e+ 1

You can do other things with the mdl column, such as extract some summaries:

out %&gt;%
  group_by(city) %&gt;%
  summarize(smry = list(summary(mdl[[1]]))) %&gt;%
  pull(smry)
# [[1]]
# Call:
# glm(formula = death ~ gender + age, family = &quot;poisson&quot;, data = cur_data())
# Deviance Residuals: 
#          1           2           3           4  
# -1.389e-05  -1.389e-05   0.000e+00  -2.110e-08  
# Coefficients:
#               Estimate Std. Error z value Pr(&gt;|z|)
# (Intercept)    -599.62 1606083.51       0        1
# gendermale       23.06  138127.43       0        1
# age              23.06   61772.44       0        1
# (Dispersion parameter for poisson family taken to be 1)
#     Null deviance: 2.7726e+00  on 3  degrees of freedom
# Residual deviance: 3.8564e-10  on 1  degrees of freedom
# AIC: 8
# Number of Fisher Scoring iterations: 22
# [[2]]
# Call:
# glm(formula = death ~ gender + age, family = &quot;poisson&quot;, data = cur_data())
# Deviance Residuals: 
# [1]  0  0  0
# Coefficients:
#               Estimate Std. Error z value Pr(&gt;|z|)
# (Intercept)    -84.248 195033.586       0        1
# gendermale     -55.568 245825.832       0        1
# age              1.793   5357.985       0        1
# (Dispersion parameter for poisson family taken to be 1)
#     Null deviance: 2.1972e+00  on 2  degrees of freedom
# Residual deviance: 3.0330e-10  on 0  degrees of freedom
# AIC: 8
# Number of Fisher Scoring iterations: 21

答案2

得分: 1

我们可以使用 do() 并进行一些小的更改,比如在glm()函数中包括公式参数,权重可以设置为NULL:

library(dplyr)

df %>%
  group_by(city) %>%
  do(data.frame(., death_p = predict(glm(death ~ gender + age, data = ., family = "poisson"))))

城市 性别 年龄 死亡 死亡概率
1 a 男性 24 0 -2.31e+ 1
2 a 女性 25 0 -2.31e+ 1
3 a 女性 26 1 0
4 a 男性 23 0 -4.61e+ 1
5 b 男性 78 1 -2.84e-14
6 b 男性 65 0 -2.33e+ 1
7 b 女性 34 0 -2.33e+ 1


<details>
<summary>英文:</summary>

We could do it using `do()` and minor changes like to include the formula argument in the glm() function, weights could be set to NULL:  

library(dplyr)

df %>%
group_by(city) %>%
do(data.frame(., death_p = predict(glm(death ~ gender + age, data = ., family = "poisson"))))

city gender age death death_p
<chr> <chr> <dbl> <dbl> <dbl>
1 a male 24 0 -2.31e+ 1
2 a female 25 0 -2.31e+ 1
3 a female 26 1 0
4 a male 23 0 -4.61e+ 1
5 b male 78 1 -2.84e-14
6 b male 65 0 -2.33e+ 1
7 b female 34 0 -2.33e+ 1


</details>



# 答案3
**得分**: 0

这是一个data.table版本的代码:

```R
library(data.table)

setDT(df)[, death_p := exp(predict(glm(death ~ age + gender, family = "poisson"))), city]

输出结果如下:

   city gender age death      death_p
1:    a   male  24     0 9.640864e-11
2:    a female  25     0 9.640864e-11
3:    a female  26     1 1.000000e+00
4:    b   male  78     1 1.000000e+00
5:    b   male  65     0 7.582560e-11
6:    b female  34     0 7.582560e-11
7:    a   male  23     0 9.294626e-21
英文:

Here is a data.table version

library(data.table)

setDT(df)[, death_p:=exp(predict(glm(death~age+gender, family=&quot;poisson&quot;))), city]

Out:

   city gender age death      death_p
1:    a   male  24     0 9.640864e-11
2:    a female  25     0 9.640864e-11
3:    a female  26     1 1.000000e+00
4:    b   male  78     1 1.000000e+00
5:    b   male  65     0 7.582560e-11
6:    b female  34     0 7.582560e-11
7:    a   male  23     0 9.294626e-21

huangapple
  • 本文由 发表于 2023年4月17日 01:43:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/76029387.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定