英文:
Why do I get different results when I want to find the AIC of my model in Python and R?
问题
I see you want me to provide the translated code parts without any additional information. Here's the translation of the code:
Python code:
# Imports
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
# Let's create a dataframe of our data
use = np.array([5, 7, 11, 20, 47, 49, 54, 53, 76])
total = np.array([30, 33, 35, 40, 67, 61, 63, 59, 80])
age = np.arange(16, 25)
p = use / total
df = pd.DataFrame({"Use": use, "Total": total, "Age": age, "P": p})
# Let's create our model
logit_model = smf.glm(formula="P ~ Age", data=df,
family=sm.families.Binomial(link=sm.families.links.Logit()),
var_weights=total).fit()
logit_model.summary()
R code:
use <- c(5, 7, 11, 20, 47, 49, 54, 53, 76)
total <- c(30, 33, 35, 40, 67, 61, 63, 59, 80)
age <- seq(16, 24)
p <- use / total
logit_temp <- glm(p ~ age, family=binomial, weights=total)
logit_temp
If you have any specific questions or need further assistance, please let me know.
英文:
I have a dataset and I want to study the relationship between the age and the probability of using a credit card. So I decided to apply weighted logistic regression.
Python code:
#Imports
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
#Let's create a dataframe of our data
use = np.array([5,7,11,20,47,49,54,53,76])
total = np.array([30,33,35,40,67,61,63,59,80])
age = np.arange(16, 25)
p = use / total
df = pd.DataFrame({"Use": use, "Total": total, "Age": age, "P": p})
#Let's create our model
logit_model = smf.glm(formula = "P ~ Age", data = df,
family = sm.families.Binomial(link=sm.families.links.Logit()),
var_weights = total).fit()
logit_model.summary()
Output
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: P No. Observations: 9
Model: GLM Df Residuals: 7
Model Family: Binomial Df Model: 1
Link Function: Logit Scale: 1.0000
Method: IRLS Log-Likelihood: -147.07
Date: Mon, 08 May 2023 Deviance: 1.9469
Time: 10:22:00 Pearson chi2: 1.95
No. Iterations: 5 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -11.3464 1.172 -9.678 0.000 -13.644 -9.049
Age 0.5996 0.059 10.231 0.000 0.485 0.715
==============================================================================
In this model, I want to calculate the AIC.
logit_model.aic
Output
298.1385764743574
(In the above model I used the argument var_weights
as Josef suggested in this thread)
Let's do the same in R.
Rcode:
use = c(5,7,11,20,47,49,54,53,76)
total = c(30,33,35,40,67,61,63,59,80)
age = seq(16,24)
p = use/total
logit_temp = glm(p~age,family = binomial, weights = total)
logit_temp
Output
Call: glm(formula = p ~ age, family = binomial, weights = total)
Coefficients:
(Intercept) age
-11.3464 0.5996
Degrees of Freedom: 8 Total (i.e. Null); 7 Residual
Null Deviance: 156.4
Residual Deviance: 1.947 AIC: 40.12
As you can see now, the AIC of the model I created with R is very different from the AIC I found with Python. What should I change in order to have the same results?
答案1
得分: 2
以下是您要翻译的部分:
差异归结为对数似然差异:两个软件包似乎(正确地)将AIC计算为(-2*loglikelihood + 2*df
)。
警告:对于标度等于1的模型(即二项式、负二项式和泊松分布等),对数似然和偏差无效。如果指定了方差权重,则loglike和偏差等结果基于拟似然解释。在这种情况下,对数似然未正确指定,基于它的统计量,如AIC或似然比检验,都不适用。
我不太确定该怎么办...
(我花了一些时间尝试通过计算拟似然来复制Python中的-147.07的对数似然,但目前还没有成功)
英文:
The difference reduces to a difference in the log-likelihoods: both packages appear to (correctly) compute AIC as (-2*loglikelihood + 2*df
).
The help for statsmodels GLM function says:
> WARNING: Loglikelihood and deviance are not valid in models where scale is equal to 1 (i.e., Binomial, NegativeBinomial, and Poisson). If variance weights are specified, then results such as loglike and deviance are based on a quasi-likelihood interpretation. The loglikelihood is not correctly specified in this case, and statistics based on it, such AIC or likelihood ratio tests, are not appropriate.
I'm not quite sure what to do about that ...
(I spent a little while digging down to try to reproduce the log-likelihood of -147.07 from Python by computing a quasi-likelihood, but so far without success)
答案2
得分: 2
I'm not sure (don't remember) how aic is defined with var_weights in statsmodels GLM.
update
var_weights
are treated as scale or dispersion factor in statsmodels GLM. It does not assume that there is a likelihood interpretation of var_weights
, and so we only have QMLE and not MLE. (see Ben Bolker's answer).
Parameter estimates and standard errors are the same in both cases and do not directly rely on the MLE/QMLE distinction. However, loglikelihood and statistics based on it, like aic, bic, will not correspond to a correctly specified likelihood in QMLE.
However, the example is just a standard binomial model.
In this case we can specify (success, failure)
counts directly as dependent variable. This produces then the same estimate and same aic as the R version.
The results are interpreted as standard maximum likelihood estimate in this case.
fail = total - use
df["fail"] = fail
logit_model2 = smf.glm(formula = "use + fail ~ Age", data = df,
family = m.families.Binomial(link=sm.families.links.Logit()),
).fit()
print(logit_model2.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: ['use', 'fail'] No. Observations: 9
Model: GLM Df Residuals: 7
Model Family: Binomial Df Model: 1
Link Function: Logit Scale: 1.0000
Method: IRLS Log-Likelihood: -18.059
Date: Mon, 08 May 2023 Deviance: 1.9469
Time: 11:46:25 Pearson chi2: 1.95
No. Iterations: 5 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -11.3464 1.172 -9.678 0.000 -13.644 -9.049
Age 0.5996 0.059 10.231 0.000 0.485 0.715
==============================================================================
logit_model.aic, logit_model2.aic
(298.1385764743575, 40.11730698160575)
As background.
This notebook https://www.statsmodels.org/dev/examples/notebooks/generated/glm_weights.html illustrates some of the properties of var and freq weights and their equivalent standard model when the MLE assumption holds.
This relies to a large extent on averaging and aggregation properties of the linear exponential families in GLM.
英文:
I'm not sure (don't remember) how aic is defined with var_weights in statsmodels GLM.
update
var_weights
are treated as scale or dispersion factor in statsmodels GLM. It does not assume that there is a likelihood interpretation of var_weights
, and so we only have QMLE and not MLE. (see Ben Bolker's answer).
Parameter estimates and standard errors are the same in both cases and do not directly rely on the MLE/QMLE distinction. However, loglikelihood and statistics based on it, like aic, bic, will not correspond to a correctly specified likelihood in QMLE.
However, the example is just a standard binomial model.
In this case we can specify (success, failure)
counts directly as dependent variable. This produces then the same estimate and same aic as the R version.
The results are interpreted as standard maximum likelihood estimate in this case.
fail = total - use
df["fail"] = fail
logit_model2 = smf.glm(formula = "use + fail ~ Age", data = df,
family = m.families.Binomial(link=sm.families.links.Logit()),
).fit()
print(logit_model2.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: ['use', 'fail'] No. Observations: 9
Model: GLM Df Residuals: 7
Model Family: Binomial Df Model: 1
Link Function: Logit Scale: 1.0000
Method: IRLS Log-Likelihood: -18.059
Date: Mon, 08 May 2023 Deviance: 1.9469
Time: 11:46:25 Pearson chi2: 1.95
No. Iterations: 5 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -11.3464 1.172 -9.678 0.000 -13.644 -9.049
Age 0.5996 0.059 10.231 0.000 0.485 0.715
==============================================================================
logit_model.aic, logit_model2.aic
(298.1385764743575, 40.11730698160575)
As background.
This notebook https://www.statsmodels.org/dev/examples/notebooks/generated/glm_weights.html illustrates some of the properties of var and freq weights and their equivalent standard model when the MLE assumption holds.
This relies to a large extent on averaging and aggregation properties of the linear exponential families in GLM.
答案3
得分: 0
我想提供一个类似于Josef答案的答案,适用于想要使用sm.GLM()
函数而不是smf.glm()
的人。
#导入
import numpy as np
import pandas as pd
import statsmodels.api as sm
#让我们创建我们的数据的数据框
use = np.array([5,7,11,20,47,49,54,53,76])
total = np.array([30,33,35,40,67,61,63,59,80])
age = np.arange(16, 25)
fail = total - use
df = pd.DataFrame({"Use": use, "Total": total, "Age": age, "Fail": fail})
endog_star = df[["Use","Fail"]] #我们的响应变量形式为[成功,失败]
exog_star = df["Age"] #我们的自变量
exog_star = sm.add_constant(exog_star) #我们必须包含一个常数,因为默认情况下
#不包括常数
#让我们创建我们的模型
logit_model = sm.GLM(
endog = endog_star,
exog = exog_star,
family = sm.families.Binomial(link=sm.families.links.Logit())).fit()
logit_model.summary()
logit_model.aic
输出:
广义线性模型回归结果
==============================================================================
依赖变量: ['使用', '失败'] 观测数: 9
模型: GLM 残差自由度: 7
模型家族: 二项分布 模型自由度: 1
链接函数: 逻辑函数 缩放: 1.0000
方法: IRLS 对数似然: -18.059
日期: Fri, 12 May 2023 偏差: 1.9469
时间: 12:53:23 Pearson chi2: 1.95
迭代次数: 5 伪R平方(CS): 1.000
协方差类型: 非鲁棒
==============================================================================
系数 标准误 z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
常数 -11.3464 1.172 -9.678 0.000 -13.644 -9.049
年龄 0.5996 0.059 10.231 0.000 0.485 0.715
==============================================================================
40.11730698160573
英文:
I would like to provide an answer which is similar to Josef's answer for someone who wants to use sm.GLM()
function instead of smf.glm()
#Imports
import numpy as np
import pandas as pd
import statsmodels.api as sm
#Let's create a dataframe of our data
use = np.array([5,7,11,20,47,49,54,53,76])
total = np.array([30,33,35,40,67,61,63,59,80])
age = np.arange(16, 25)
fail = total - use
df = pd.DataFrame({"Use": use, "Total": total, "Age": age, "Fail": fail})
endog_star = df[["Use","Fail"]] #our response variable in form [success, failure]
exog_star = df["Age"] #our independent variable
exog_star = sm.add_constant(exog_star) #we have to include a constant because by default
#constant is not included
#Let's create our model
logit_model = sm.GLM(
endog = endog_star,
exog = exog_star,
family = sm.families.Binomial(link=sm.families.links.Logit())).fit()
logit_model.summary()
logit_model.aic
Output:
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: ['Use', 'fail'] No. Observations: 9
Model: GLM Df Residuals: 7
Model Family: Binomial Df Model: 1
Link Function: Logit Scale: 1.0000
Method: IRLS Log-Likelihood: -18.059
Date: Fri, 12 May 2023 Deviance: 1.9469
Time: 12:53:23 Pearson chi2: 1.95
No. Iterations: 5 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -11.3464 1.172 -9.678 0.000 -13.644 -9.049
Age 0.5996 0.059 10.231 0.000 0.485 0.715
==============================================================================
40.11730698160573
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论