为什么当我想在Python和R中查找我的模型的AIC时,会得到不同的结果?

huangapple go评论100阅读模式
英文:

Why do I get different results when I want to find the AIC of my model in Python and R?

问题

I see you want me to provide the translated code parts without any additional information. Here's the translation of the code:

Python code:

  1. # Imports
  2. import numpy as np
  3. import pandas as pd
  4. import statsmodels.formula.api as smf
  5. import statsmodels.api as sm
  6. # Let's create a dataframe of our data
  7. use = np.array([5, 7, 11, 20, 47, 49, 54, 53, 76])
  8. total = np.array([30, 33, 35, 40, 67, 61, 63, 59, 80])
  9. age = np.arange(16, 25)
  10. p = use / total
  11. df = pd.DataFrame({"Use": use, "Total": total, "Age": age, "P": p})
  12. # Let's create our model
  13. logit_model = smf.glm(formula="P ~ Age", data=df,
  14. family=sm.families.Binomial(link=sm.families.links.Logit()),
  15. var_weights=total).fit()
  16. logit_model.summary()

R code:

  1. use <- c(5, 7, 11, 20, 47, 49, 54, 53, 76)
  2. total <- c(30, 33, 35, 40, 67, 61, 63, 59, 80)
  3. age <- seq(16, 24)
  4. p <- use / total
  5. logit_temp <- glm(p ~ age, family=binomial, weights=total)
  6. logit_temp

If you have any specific questions or need further assistance, please let me know.

英文:

I have a dataset and I want to study the relationship between the age and the probability of using a credit card. So I decided to apply weighted logistic regression.

Python code:

  1. #Imports
  2. import numpy as np
  3. import pandas as pd
  4. import statsmodels.formula.api as smf
  5. import statsmodels.api as sm
  6. #Let&#39;s create a dataframe of our data
  7. use = np.array([5,7,11,20,47,49,54,53,76])
  8. total = np.array([30,33,35,40,67,61,63,59,80])
  9. age = np.arange(16, 25)
  10. p = use / total
  11. df = pd.DataFrame({&quot;Use&quot;: use, &quot;Total&quot;: total, &quot;Age&quot;: age, &quot;P&quot;: p})
  12. #Let&#39;s create our model
  13. logit_model = smf.glm(formula = &quot;P ~ Age&quot;, data = df,
  14. family = sm.families.Binomial(link=sm.families.links.Logit()),
  15. var_weights = total).fit()
  16. logit_model.summary()

Output

  1. Generalized Linear Model Regression Results
  2. ==============================================================================
  3. Dep. Variable: P No. Observations: 9
  4. Model: GLM Df Residuals: 7
  5. Model Family: Binomial Df Model: 1
  6. Link Function: Logit Scale: 1.0000
  7. Method: IRLS Log-Likelihood: -147.07
  8. Date: Mon, 08 May 2023 Deviance: 1.9469
  9. Time: 10:22:00 Pearson chi2: 1.95
  10. No. Iterations: 5 Pseudo R-squ. (CS): 1.000
  11. Covariance Type: nonrobust
  12. ==============================================================================
  13. coef std err z P&gt;|z| [0.025 0.975]
  14. ------------------------------------------------------------------------------
  15. Intercept -11.3464 1.172 -9.678 0.000 -13.644 -9.049
  16. Age 0.5996 0.059 10.231 0.000 0.485 0.715
  17. ==============================================================================

In this model, I want to calculate the AIC.

  1. logit_model.aic

Output

  1. 298.1385764743574

(In the above model I used the argument var_weights as Josef suggested in this thread)

Let's do the same in R.

Rcode:

  1. use = c(5,7,11,20,47,49,54,53,76)
  2. total = c(30,33,35,40,67,61,63,59,80)
  3. age = seq(16,24)
  4. p = use/total
  5. logit_temp = glm(p~age,family = binomial, weights = total)
  6. logit_temp

Output

  1. Call: glm(formula = p ~ age, family = binomial, weights = total)
  2. Coefficients:
  3. (Intercept) age
  4. -11.3464 0.5996
  5. Degrees of Freedom: 8 Total (i.e. Null); 7 Residual
  6. Null Deviance: 156.4
  7. Residual Deviance: 1.947 AIC: 40.12

As you can see now, the AIC of the model I created with R is very different from the AIC I found with Python. What should I change in order to have the same results?

答案1

得分: 2

以下是您要翻译的部分:

差异归结为对数似然差异:两个软件包似乎(正确地)将AIC计算为(-2*loglikelihood + 2*df)。

statsmodels GLM函数的帮助文档中写道:

警告:对于标度等于1的模型(即二项式、负二项式和泊松分布等),对数似然和偏差无效。如果指定了方差权重,则loglike和偏差等结果基于拟似然解释。在这种情况下,对数似然未正确指定,基于它的统计量,如AIC或似然比检验,都不适用。

我不太确定该怎么办...

(我花了一些时间尝试通过计算拟似然来复制Python中的-147.07的对数似然,但目前还没有成功)

英文:

The difference reduces to a difference in the log-likelihoods: both packages appear to (correctly) compute AIC as (-2*loglikelihood + 2*df).

The help for statsmodels GLM function says:

> WARNING: Loglikelihood and deviance are not valid in models where scale is equal to 1 (i.e., Binomial, NegativeBinomial, and Poisson). If variance weights are specified, then results such as loglike and deviance are based on a quasi-likelihood interpretation. The loglikelihood is not correctly specified in this case, and statistics based on it, such AIC or likelihood ratio tests, are not appropriate.

I'm not quite sure what to do about that ...

(I spent a little while digging down to try to reproduce the log-likelihood of -147.07 from Python by computing a quasi-likelihood, but so far without success)

答案2

得分: 2

I'm not sure (don't remember) how aic is defined with var_weights in statsmodels GLM.

update
var_weights are treated as scale or dispersion factor in statsmodels GLM. It does not assume that there is a likelihood interpretation of var_weights, and so we only have QMLE and not MLE. (see Ben Bolker's answer).
Parameter estimates and standard errors are the same in both cases and do not directly rely on the MLE/QMLE distinction. However, loglikelihood and statistics based on it, like aic, bic, will not correspond to a correctly specified likelihood in QMLE.

However, the example is just a standard binomial model.

In this case we can specify (success, failure) counts directly as dependent variable. This produces then the same estimate and same aic as the R version.
The results are interpreted as standard maximum likelihood estimate in this case.

  1. fail = total - use
  2. df["fail"] = fail
  3. logit_model2 = smf.glm(formula = "use + fail ~ Age", data = df,
  4. family = m.families.Binomial(link=sm.families.links.Logit()),
  5. ).fit()
  6. print(logit_model2.summary())
  7. Generalized Linear Model Regression Results
  8. ==============================================================================
  9. Dep. Variable: ['use', 'fail'] No. Observations: 9
  10. Model: GLM Df Residuals: 7
  11. Model Family: Binomial Df Model: 1
  12. Link Function: Logit Scale: 1.0000
  13. Method: IRLS Log-Likelihood: -18.059
  14. Date: Mon, 08 May 2023 Deviance: 1.9469
  15. Time: 11:46:25 Pearson chi2: 1.95
  16. No. Iterations: 5 Pseudo R-squ. (CS): 1.000
  17. Covariance Type: nonrobust
  18. ==============================================================================
  19. coef std err z P>|z| [0.025 0.975]
  20. ------------------------------------------------------------------------------
  21. Intercept -11.3464 1.172 -9.678 0.000 -13.644 -9.049
  22. Age 0.5996 0.059 10.231 0.000 0.485 0.715
  23. ==============================================================================
  24. logit_model.aic, logit_model2.aic
  25. (298.1385764743575, 40.11730698160575)

As background.
This notebook https://www.statsmodels.org/dev/examples/notebooks/generated/glm_weights.html illustrates some of the properties of var and freq weights and their equivalent standard model when the MLE assumption holds.
This relies to a large extent on averaging and aggregation properties of the linear exponential families in GLM.

英文:

I'm not sure (don't remember) how aic is defined with var_weights in statsmodels GLM.

update
var_weights are treated as scale or dispersion factor in statsmodels GLM. It does not assume that there is a likelihood interpretation of var_weights, and so we only have QMLE and not MLE. (see Ben Bolker's answer).
Parameter estimates and standard errors are the same in both cases and do not directly rely on the MLE/QMLE distinction. However, loglikelihood and statistics based on it, like aic, bic, will not correspond to a correctly specified likelihood in QMLE.

However, the example is just a standard binomial model.

In this case we can specify (success, failure) counts directly as dependent variable. This produces then the same estimate and same aic as the R version.
The results are interpreted as standard maximum likelihood estimate in this case.

  1. fail = total - use
  2. df[&quot;fail&quot;] = fail
  3. logit_model2 = smf.glm(formula = &quot;use + fail ~ Age&quot;, data = df,
  4. family = m.families.Binomial(link=sm.families.links.Logit()),
  5. ).fit()
  6. print(logit_model2.summary())
  7. Generalized Linear Model Regression Results
  8. ==============================================================================
  9. Dep. Variable: [&#39;use&#39;, &#39;fail&#39;] No. Observations: 9
  10. Model: GLM Df Residuals: 7
  11. Model Family: Binomial Df Model: 1
  12. Link Function: Logit Scale: 1.0000
  13. Method: IRLS Log-Likelihood: -18.059
  14. Date: Mon, 08 May 2023 Deviance: 1.9469
  15. Time: 11:46:25 Pearson chi2: 1.95
  16. No. Iterations: 5 Pseudo R-squ. (CS): 1.000
  17. Covariance Type: nonrobust
  18. ==============================================================================
  19. coef std err z P&gt;|z| [0.025 0.975]
  20. ------------------------------------------------------------------------------
  21. Intercept -11.3464 1.172 -9.678 0.000 -13.644 -9.049
  22. Age 0.5996 0.059 10.231 0.000 0.485 0.715
  23. ==============================================================================
  24. logit_model.aic, logit_model2.aic
  25. (298.1385764743575, 40.11730698160575)

As background.
This notebook https://www.statsmodels.org/dev/examples/notebooks/generated/glm_weights.html illustrates some of the properties of var and freq weights and their equivalent standard model when the MLE assumption holds.
This relies to a large extent on averaging and aggregation properties of the linear exponential families in GLM.

答案3

得分: 0

我想提供一个类似于Josef答案的答案,适用于想要使用sm.GLM()函数而不是smf.glm()的人。

  1. #导入
  2. import numpy as np
  3. import pandas as pd
  4. import statsmodels.api as sm
  5. #让我们创建我们的数据的数据框
  6. use = np.array([5,7,11,20,47,49,54,53,76])
  7. total = np.array([30,33,35,40,67,61,63,59,80])
  8. age = np.arange(16, 25)
  9. fail = total - use
  10. df = pd.DataFrame({&quot;Use&quot;: use, &quot;Total&quot;: total, &quot;Age&quot;: age, &quot;Fail&quot;: fail})
  11. endog_star = df[[&quot;Use&quot;,&quot;Fail&quot;]] #我们的响应变量形式为[成功,失败]
  12. exog_star = df[&quot;Age&quot;] #我们的自变量
  13. exog_star = sm.add_constant(exog_star) #我们必须包含一个常数,因为默认情况下
  14. #不包括常数
  15. #让我们创建我们的模型
  16. logit_model = sm.GLM(
  17. endog = endog_star,
  18. exog = exog_star,
  19. family = sm.families.Binomial(link=sm.families.links.Logit())).fit()
  20. logit_model.summary()
  21. logit_model.aic

输出:

  1. 广义线性模型回归结果
  2. ==============================================================================
  3. 依赖变量: [&#39;使用&#39;, &#39;失败&#39;] 观测数: 9
  4. 模型: GLM 残差自由度: 7
  5. 模型家族: 二项分布 模型自由度: 1
  6. 链接函数: 逻辑函数 缩放: 1.0000
  7. 方法: IRLS 对数似然: -18.059
  8. 日期: Fri, 12 May 2023 偏差: 1.9469
  9. 时间: 12:53:23 Pearson chi2: 1.95
  10. 迭代次数: 5 R平方(CS): 1.000
  11. 协方差类型: 非鲁棒
  12. ==============================================================================
  13. 系数 标准误 z P&gt;|z| [0.025 0.975]
  14. ------------------------------------------------------------------------------
  15. 常数 -11.3464 1.172 -9.678 0.000 -13.644 -9.049
  16. 年龄 0.5996 0.059 10.231 0.000 0.485 0.715
  17. ==============================================================================
  18. 40.11730698160573
英文:

I would like to provide an answer which is similar to Josef's answer for someone who wants to use sm.GLM() function instead of smf.glm()

  1. #Imports
  2. import numpy as np
  3. import pandas as pd
  4. import statsmodels.api as sm
  5. #Let&#39;s create a dataframe of our data
  6. use = np.array([5,7,11,20,47,49,54,53,76])
  7. total = np.array([30,33,35,40,67,61,63,59,80])
  8. age = np.arange(16, 25)
  9. fail = total - use
  10. df = pd.DataFrame({&quot;Use&quot;: use, &quot;Total&quot;: total, &quot;Age&quot;: age, &quot;Fail&quot;: fail})
  11. endog_star = df[[&quot;Use&quot;,&quot;Fail&quot;]] #our response variable in form [success, failure]
  12. exog_star = df[&quot;Age&quot;] #our independent variable
  13. exog_star = sm.add_constant(exog_star) #we have to include a constant because by default
  14. #constant is not included
  15. #Let&#39;s create our model
  16. logit_model = sm.GLM(
  17. endog = endog_star,
  18. exog = exog_star,
  19. family = sm.families.Binomial(link=sm.families.links.Logit())).fit()
  20. logit_model.summary()
  21. logit_model.aic

Output:

  1. Generalized Linear Model Regression Results
  2. ==============================================================================
  3. Dep. Variable: [&#39;Use&#39;, &#39;fail&#39;] No. Observations: 9
  4. Model: GLM Df Residuals: 7
  5. Model Family: Binomial Df Model: 1
  6. Link Function: Logit Scale: 1.0000
  7. Method: IRLS Log-Likelihood: -18.059
  8. Date: Fri, 12 May 2023 Deviance: 1.9469
  9. Time: 12:53:23 Pearson chi2: 1.95
  10. No. Iterations: 5 Pseudo R-squ. (CS): 1.000
  11. Covariance Type: nonrobust
  12. ==============================================================================
  13. coef std err z P&gt;|z| [0.025 0.975]
  14. ------------------------------------------------------------------------------
  15. const -11.3464 1.172 -9.678 0.000 -13.644 -9.049
  16. Age 0.5996 0.059 10.231 0.000 0.485 0.715
  17. ==============================================================================
  18. 40.11730698160573

huangapple
  • 本文由 发表于 2023年5月7日 05:04:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/76191135.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定