为什么当我想在Python和R中查找我的模型的AIC时,会得到不同的结果?

huangapple go评论74阅读模式
英文:

Why do I get different results when I want to find the AIC of my model in Python and R?

问题

I see you want me to provide the translated code parts without any additional information. Here's the translation of the code:

Python code:

# Imports
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm

# Let's create a dataframe of our data
use = np.array([5, 7, 11, 20, 47, 49, 54, 53, 76])
total = np.array([30, 33, 35, 40, 67, 61, 63, 59, 80])
age = np.arange(16, 25)
p = use / total
df = pd.DataFrame({"Use": use, "Total": total, "Age": age, "P": p})

# Let's create our model
logit_model = smf.glm(formula="P ~ Age", data=df,
                      family=sm.families.Binomial(link=sm.families.links.Logit()),
                      var_weights=total).fit()

logit_model.summary()

R code:

use <- c(5, 7, 11, 20, 47, 49, 54, 53, 76)
total <- c(30, 33, 35, 40, 67, 61, 63, 59, 80)
age <- seq(16, 24)
p <- use / total

logit_temp <- glm(p ~ age, family=binomial, weights=total)
logit_temp

If you have any specific questions or need further assistance, please let me know.

英文:

I have a dataset and I want to study the relationship between the age and the probability of using a credit card. So I decided to apply weighted logistic regression.

Python code:

#Imports

import numpy as np

import pandas as pd

import statsmodels.formula.api as smf

import statsmodels.api as sm

#Let&#39;s create a dataframe of our data

use = np.array([5,7,11,20,47,49,54,53,76])

total = np.array([30,33,35,40,67,61,63,59,80])

age = np.arange(16, 25)

p = use / total

df = pd.DataFrame({&quot;Use&quot;: use, &quot;Total&quot;: total, &quot;Age&quot;: age, &quot;P&quot;: p})

#Let&#39;s create our model

logit_model = smf.glm(formula = &quot;P ~ Age&quot;, data = df,
                  family = sm.families.Binomial(link=sm.families.links.Logit()),
                  var_weights = total).fit()

logit_model.summary()

Output

             Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                      P   No. Observations:                    9
Model:                            GLM   Df Residuals:                        7
Model Family:                Binomial   Df Model:                            1
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -147.07
Date:                Mon, 08 May 2023   Deviance:                       1.9469
Time:                        10:22:00   Pearson chi2:                     1.95
No. Iterations:                     5   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust
==============================================================================
             coef    std err          z      P&gt;|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -11.3464      1.172     -9.678      0.000     -13.644      -9.049
Age            0.5996      0.059     10.231      0.000       0.485       0.715
==============================================================================

In this model, I want to calculate the AIC.

logit_model.aic

Output

298.1385764743574

(In the above model I used the argument var_weights as Josef suggested in this thread)

Let's do the same in R.

Rcode:

use = c(5,7,11,20,47,49,54,53,76)

total = c(30,33,35,40,67,61,63,59,80)

age = seq(16,24)

p = use/total

logit_temp = glm(p~age,family = binomial, weights = total)

logit_temp

Output

Call:  glm(formula = p ~ age, family = binomial, weights = total)

Coefficients:
(Intercept)          age  
   -11.3464       0.5996  

Degrees of Freedom: 8 Total (i.e. Null);  7 Residual
Null Deviance:	    156.4 
Residual Deviance: 1.947 	AIC: 40.12

As you can see now, the AIC of the model I created with R is very different from the AIC I found with Python. What should I change in order to have the same results?

答案1

得分: 2

以下是您要翻译的部分:

差异归结为对数似然差异:两个软件包似乎(正确地)将AIC计算为(-2*loglikelihood + 2*df)。

statsmodels GLM函数的帮助文档中写道:

警告:对于标度等于1的模型(即二项式、负二项式和泊松分布等),对数似然和偏差无效。如果指定了方差权重,则loglike和偏差等结果基于拟似然解释。在这种情况下,对数似然未正确指定,基于它的统计量,如AIC或似然比检验,都不适用。

我不太确定该怎么办...

(我花了一些时间尝试通过计算拟似然来复制Python中的-147.07的对数似然,但目前还没有成功)

英文:

The difference reduces to a difference in the log-likelihoods: both packages appear to (correctly) compute AIC as (-2*loglikelihood + 2*df).

The help for statsmodels GLM function says:

> WARNING: Loglikelihood and deviance are not valid in models where scale is equal to 1 (i.e., Binomial, NegativeBinomial, and Poisson). If variance weights are specified, then results such as loglike and deviance are based on a quasi-likelihood interpretation. The loglikelihood is not correctly specified in this case, and statistics based on it, such AIC or likelihood ratio tests, are not appropriate.

I'm not quite sure what to do about that ...

(I spent a little while digging down to try to reproduce the log-likelihood of -147.07 from Python by computing a quasi-likelihood, but so far without success)

答案2

得分: 2

I'm not sure (don't remember) how aic is defined with var_weights in statsmodels GLM.

update
var_weights are treated as scale or dispersion factor in statsmodels GLM. It does not assume that there is a likelihood interpretation of var_weights, and so we only have QMLE and not MLE. (see Ben Bolker's answer).
Parameter estimates and standard errors are the same in both cases and do not directly rely on the MLE/QMLE distinction. However, loglikelihood and statistics based on it, like aic, bic, will not correspond to a correctly specified likelihood in QMLE.

However, the example is just a standard binomial model.

In this case we can specify (success, failure) counts directly as dependent variable. This produces then the same estimate and same aic as the R version.
The results are interpreted as standard maximum likelihood estimate in this case.

fail = total - use
df["fail"] = fail
logit_model2 = smf.glm(formula = "use + fail ~ Age", data = df,
                       family = m.families.Binomial(link=sm.families.links.Logit()),
                       ).fit()

print(logit_model2.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:        ['use', 'fail']   No. Observations:                    9
Model:                            GLM   Df Residuals:                        7
Model Family:                Binomial   Df Model:                            1
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -18.059
Date:                Mon, 08 May 2023   Deviance:                       1.9469
Time:                        11:46:25   Pearson chi2:                     1.95
No. Iterations:                     5   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -11.3464      1.172     -9.678      0.000     -13.644      -9.049
Age            0.5996      0.059     10.231      0.000       0.485       0.715
==============================================================================

logit_model.aic, logit_model2.aic
(298.1385764743575, 40.11730698160575)

As background.
This notebook https://www.statsmodels.org/dev/examples/notebooks/generated/glm_weights.html illustrates some of the properties of var and freq weights and their equivalent standard model when the MLE assumption holds.
This relies to a large extent on averaging and aggregation properties of the linear exponential families in GLM.

英文:

I'm not sure (don't remember) how aic is defined with var_weights in statsmodels GLM.

update
var_weights are treated as scale or dispersion factor in statsmodels GLM. It does not assume that there is a likelihood interpretation of var_weights, and so we only have QMLE and not MLE. (see Ben Bolker's answer).
Parameter estimates and standard errors are the same in both cases and do not directly rely on the MLE/QMLE distinction. However, loglikelihood and statistics based on it, like aic, bic, will not correspond to a correctly specified likelihood in QMLE.

However, the example is just a standard binomial model.

In this case we can specify (success, failure) counts directly as dependent variable. This produces then the same estimate and same aic as the R version.
The results are interpreted as standard maximum likelihood estimate in this case.

fail = total - use
df[&quot;fail&quot;] = fail
logit_model2 = smf.glm(formula = &quot;use + fail ~ Age&quot;, data = df,
                       family = m.families.Binomial(link=sm.families.links.Logit()),
                       ).fit()

print(logit_model2.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:        [&#39;use&#39;, &#39;fail&#39;]   No. Observations:                    9
Model:                            GLM   Df Residuals:                        7
Model Family:                Binomial   Df Model:                            1
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -18.059
Date:                Mon, 08 May 2023   Deviance:                       1.9469
Time:                        11:46:25   Pearson chi2:                     1.95
No. Iterations:                     5   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P&gt;|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -11.3464      1.172     -9.678      0.000     -13.644      -9.049
Age            0.5996      0.059     10.231      0.000       0.485       0.715
==============================================================================

logit_model.aic, logit_model2.aic
(298.1385764743575, 40.11730698160575)

As background.
This notebook https://www.statsmodels.org/dev/examples/notebooks/generated/glm_weights.html illustrates some of the properties of var and freq weights and their equivalent standard model when the MLE assumption holds.
This relies to a large extent on averaging and aggregation properties of the linear exponential families in GLM.

答案3

得分: 0

我想提供一个类似于Josef答案的答案,适用于想要使用sm.GLM()函数而不是smf.glm()的人。

#导入

import numpy as np

import pandas as pd

import statsmodels.api as sm

#让我们创建我们的数据的数据框

use = np.array([5,7,11,20,47,49,54,53,76])

total = np.array([30,33,35,40,67,61,63,59,80])

age = np.arange(16, 25)

fail = total - use

df = pd.DataFrame({&quot;Use&quot;: use, &quot;Total&quot;: total, &quot;Age&quot;: age, &quot;Fail&quot;: fail})

endog_star = df[[&quot;Use&quot;,&quot;Fail&quot;]] #我们的响应变量形式为[成功,失败]

exog_star = df[&quot;Age&quot;] #我们的自变量

exog_star = sm.add_constant(exog_star) #我们必须包含一个常数,因为默认情况下
#不包括常数

#让我们创建我们的模型

logit_model = sm.GLM(
    endog = endog_star,
    exog = exog_star,
    family = sm.families.Binomial(link=sm.families.links.Logit())).fit()

logit_model.summary()

logit_model.aic

输出:

                 广义线性模型回归结果
==============================================================================
依赖变量:        [&#39;使用&#39;, &#39;失败&#39;]   观测数:                    9
模型:                            GLM   残差自由度:                        7
模型家族:                二项分布   模型自由度:                            1
链接函数:                  逻辑函数   缩放:                          1.0000
方法:                          IRLS   对数似然:                -18.059
日期:                Fri, 12 May 2023   偏差:                       1.9469
时间:                        12:53:23   Pearson chi2:                     1.95
迭代次数:                     5   伪R平方(CS):              1.000
协方差类型:            非鲁棒
==============================================================================
                 系数    标准误          z      P&gt;|z|      [0.025      0.975]
------------------------------------------------------------------------------
常数        -11.3464      1.172     -9.678      0.000     -13.644      -9.049
年龄            0.5996      0.059     10.231      0.000       0.485       0.715
==============================================================================

40.11730698160573
英文:

I would like to provide an answer which is similar to Josef's answer for someone who wants to use sm.GLM() function instead of smf.glm()

#Imports

import numpy as np

import pandas as pd

import statsmodels.api as sm

#Let&#39;s create a dataframe of our data

use = np.array([5,7,11,20,47,49,54,53,76])

total = np.array([30,33,35,40,67,61,63,59,80])

age = np.arange(16, 25)

fail = total - use

df = pd.DataFrame({&quot;Use&quot;: use, &quot;Total&quot;: total, &quot;Age&quot;: age, &quot;Fail&quot;: fail})

endog_star = df[[&quot;Use&quot;,&quot;Fail&quot;]] #our response variable in form [success, failure]

exog_star = df[&quot;Age&quot;] #our independent variable

exog_star = sm.add_constant(exog_star) #we have to include a constant because by default
#constant is not included

#Let&#39;s create our model

logit_model = sm.GLM(
    endog = endog_star,
    exog = exog_star,
    family = sm.families.Binomial(link=sm.families.links.Logit())).fit()

logit_model.summary()

logit_model.aic

Output:

                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:        [&#39;Use&#39;, &#39;fail&#39;]   No. Observations:                    9
Model:                            GLM   Df Residuals:                        7
Model Family:                Binomial   Df Model:                            1
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -18.059
Date:                Fri, 12 May 2023   Deviance:                       1.9469
Time:                        12:53:23   Pearson chi2:                     1.95
No. Iterations:                     5   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          z      P&gt;|z|      [0.025      0.975]
------------------------------------------------------------------------------
const        -11.3464      1.172     -9.678      0.000     -13.644      -9.049
Age            0.5996      0.059     10.231      0.000       0.485       0.715
==============================================================================

40.11730698160573

huangapple
  • 本文由 发表于 2023年5月7日 05:04:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/76191135.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定