2023年5月7日 05:04:42go评论100阅读模式

英文:

Why do I get different results when I want to find the AIC of my model in Python and R?

问题

I see you want me to provide the translated code parts without any additional information. Here's the translation of the code:

Python code:

# Imports
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
# Let's create a dataframe of our data
use = np.array([5, 7, 11, 20, 47, 49, 54, 53, 76])
total = np.array([30, 33, 35, 40, 67, 61, 63, 59, 80])
age = np.arange(16, 25)
p = use / total
df = pd.DataFrame({"Use": use, "Total": total, "Age": age, "P": p})
# Let's create our model
logit_model = smf.glm(formula="P ~ Age", data=df,
                      family=sm.families.Binomial(link=sm.families.links.Logit()),
                      var_weights=total).fit()
logit_model.summary()

R code:

use <- c(5, 7, 11, 20, 47, 49, 54, 53, 76)
total <- c(30, 33, 35, 40, 67, 61, 63, 59, 80)
age <- seq(16, 24)
p <- use / total
logit_temp <- glm(p ~ age, family=binomial, weights=total)
logit_temp

If you have any specific questions or need further assistance, please let me know.

英文:

I have a dataset and I want to study the relationship between the age and the probability of using a credit card. So I decided to apply weighted logistic regression.

Python code:

#Imports
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
#Let&#39;s create a dataframe of our data
use = np.array([5,7,11,20,47,49,54,53,76])
total = np.array([30,33,35,40,67,61,63,59,80])
age = np.arange(16, 25)
p = use / total
df = pd.DataFrame({&quot;Use&quot;: use, &quot;Total&quot;: total, &quot;Age&quot;: age, &quot;P&quot;: p})
#Let&#39;s create our model
logit_model = smf.glm(formula = &quot;P ~ Age&quot;, data = df,
                  family = sm.families.Binomial(link=sm.families.links.Logit()),
                  var_weights = total).fit()
logit_model.summary()

Output

             Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                      P   No. Observations:                    9
Model:                            GLM   Df Residuals:                        7
Model Family:                Binomial   Df Model:                            1
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -147.07
Date:                Mon, 08 May 2023   Deviance:                       1.9469
Time:                        10:22:00   Pearson chi2:                     1.95
No. Iterations:                     5   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust
==============================================================================
             coef    std err          z      P&gt;|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -11.3464      1.172     -9.678      0.000     -13.644      -9.049
Age            0.5996      0.059     10.231      0.000       0.485       0.715
==============================================================================

In this model, I want to calculate the AIC.

logit_model.aic

Output

298.1385764743574

(In the above model I used the argument var_weights as Josef suggested in this thread)

Let's do the same in R.

Rcode:

use = c(5,7,11,20,47,49,54,53,76)
total = c(30,33,35,40,67,61,63,59,80)
age = seq(16,24)
p = use/total
logit_temp = glm(p~age,family = binomial, weights = total)
logit_temp

Output

Call:  glm(formula = p ~ age, family = binomial, weights = total)
Coefficients:
(Intercept)          age  
   -11.3464       0.5996  
Degrees of Freedom: 8 Total (i.e. Null);  7 Residual
Null Deviance:	    156.4 
Residual Deviance: 1.947 	AIC: 40.12

As you can see now, the AIC of the model I created with R is very different from the AIC I found with Python. What should I change in order to have the same results?

答案1

得分: 2

以下是您要翻译的部分：

差异归结为对数似然差异：两个软件包似乎（正确地）将AIC计算为（-2*loglikelihood + 2*df）。

statsmodels GLM函数的帮助文档中写道：

警告：对于标度等于1的模型（即二项式、负二项式和泊松分布等），对数似然和偏差无效。如果指定了方差权重，则loglike和偏差等结果基于拟似然解释。在这种情况下，对数似然未正确指定，基于它的统计量，如AIC或似然比检验，都不适用。

我不太确定该怎么办...

（我花了一些时间尝试通过计算拟似然来复制Python中的-147.07的对数似然，但目前还没有成功）

英文:

The difference reduces to a difference in the log-likelihoods: both packages appear to (correctly) compute AIC as (-2*loglikelihood + 2*df).

The help for statsmodels GLM function says:

> WARNING: Loglikelihood and deviance are not valid in models where scale is equal to 1 (i.e., Binomial, NegativeBinomial, and Poisson). If variance weights are specified, then results such as loglike and deviance are based on a quasi-likelihood interpretation. The loglikelihood is not correctly specified in this case, and statistics based on it, such AIC or likelihood ratio tests, are not appropriate.

I'm not quite sure what to do about that ...

(I spent a little while digging down to try to reproduce the log-likelihood of -147.07 from Python by computing a quasi-likelihood, but so far without success)

答案2

得分: 2

I'm not sure (don't remember) how aic is defined with var_weights in statsmodels GLM.

update
var_weights are treated as scale or dispersion factor in statsmodels GLM. It does not assume that there is a likelihood interpretation of var_weights, and so we only have QMLE and not MLE. (see Ben Bolker's answer).
Parameter estimates and standard errors are the same in both cases and do not directly rely on the MLE/QMLE distinction. However, loglikelihood and statistics based on it, like aic, bic, will not correspond to a correctly specified likelihood in QMLE.

However, the example is just a standard binomial model.

In this case we can specify (success, failure) counts directly as dependent variable. This produces then the same estimate and same aic as the R version.
The results are interpreted as standard maximum likelihood estimate in this case.

fail = total - use
df["fail"] = fail
logit_model2 = smf.glm(formula = "use + fail ~ Age", data = df,
                       family = m.families.Binomial(link=sm.families.links.Logit()),
                       ).fit()
print(logit_model2.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:        ['use', 'fail']   No. Observations:                    9
Model:                            GLM   Df Residuals:                        7
Model Family:                Binomial   Df Model:                            1
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -18.059
Date:                Mon, 08 May 2023   Deviance:                       1.9469
Time:                        11:46:25   Pearson chi2:                     1.95
No. Iterations:                     5   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -11.3464      1.172     -9.678      0.000     -13.644      -9.049
Age            0.5996      0.059     10.231      0.000       0.485       0.715
==============================================================================
logit_model.aic, logit_model2.aic
(298.1385764743575, 40.11730698160575)

As background.
This notebook https://www.statsmodels.org/dev/examples/notebooks/generated/glm_weights.html illustrates some of the properties of var and freq weights and their equivalent standard model when the MLE assumption holds.
This relies to a large extent on averaging and aggregation properties of the linear exponential families in GLM.

英文:

I'm not sure (don't remember) how aic is defined with var_weights in statsmodels GLM.

However, the example is just a standard binomial model.

fail = total - use
df[&quot;fail&quot;] = fail
logit_model2 = smf.glm(formula = &quot;use + fail ~ Age&quot;, data = df,
                       family = m.families.Binomial(link=sm.families.links.Logit()),
                       ).fit()
print(logit_model2.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:        [&#39;use&#39;, &#39;fail&#39;]   No. Observations:                    9
Model:                            GLM   Df Residuals:                        7
Model Family:                Binomial   Df Model:                            1
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -18.059
Date:                Mon, 08 May 2023   Deviance:                       1.9469
Time:                        11:46:25   Pearson chi2:                     1.95
No. Iterations:                     5   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P&gt;|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -11.3464      1.172     -9.678      0.000     -13.644      -9.049
Age            0.5996      0.059     10.231      0.000       0.485       0.715
==============================================================================
logit_model.aic, logit_model2.aic
(298.1385764743575, 40.11730698160575)

答案3

得分: 0

我想提供一个类似于Josef答案的答案，适用于想要使用sm.GLM()函数而不是smf.glm()的人。

#导入
import numpy as np
import pandas as pd
import statsmodels.api as sm
#让我们创建我们的数据的数据框
use = np.array([5,7,11,20,47,49,54,53,76])
total = np.array([30,33,35,40,67,61,63,59,80])
age = np.arange(16, 25)
fail = total - use
df = pd.DataFrame({&quot;Use&quot;: use, &quot;Total&quot;: total, &quot;Age&quot;: age, &quot;Fail&quot;: fail})
endog_star = df[[&quot;Use&quot;,&quot;Fail&quot;]] #我们的响应变量形式为[成功,失败]
exog_star = df[&quot;Age&quot;] #我们的自变量
exog_star = sm.add_constant(exog_star) #我们必须包含一个常数，因为默认情况下
#不包括常数
#让我们创建我们的模型
logit_model = sm.GLM(
    endog = endog_star,
    exog = exog_star,
    family = sm.families.Binomial(link=sm.families.links.Logit())).fit()
logit_model.summary()
logit_model.aic

输出:

                 广义线性模型回归结果
==============================================================================
依赖变量:        [&#39;使用&#39;, &#39;失败&#39;]   观测数:                    9
模型:                            GLM   残差自由度:                        7
模型家族:                二项分布   模型自由度:                            1
链接函数:                  逻辑函数   缩放:                          1.0000
方法:                          IRLS   对数似然:                -18.059
日期:                Fri, 12 May 2023   偏差:                       1.9469
时间:                        12:53:23   Pearson chi2:                     1.95
迭代次数:                     5   伪R平方(CS):              1.000
协方差类型:            非鲁棒
==============================================================================
                 系数    标准误          z      P&gt;|z|      [0.025      0.975]
------------------------------------------------------------------------------
常数        -11.3464      1.172     -9.678      0.000     -13.644      -9.049
年龄            0.5996      0.059     10.231      0.000       0.485       0.715
==============================================================================
40.11730698160573

英文:

I would like to provide an answer which is similar to Josef's answer for someone who wants to use sm.GLM() function instead of smf.glm()

#Imports
import numpy as np
import pandas as pd
import statsmodels.api as sm
#Let&#39;s create a dataframe of our data
use = np.array([5,7,11,20,47,49,54,53,76])
total = np.array([30,33,35,40,67,61,63,59,80])
age = np.arange(16, 25)
fail = total - use
df = pd.DataFrame({&quot;Use&quot;: use, &quot;Total&quot;: total, &quot;Age&quot;: age, &quot;Fail&quot;: fail})
endog_star = df[[&quot;Use&quot;,&quot;Fail&quot;]] #our response variable in form [success, failure]
exog_star = df[&quot;Age&quot;] #our independent variable
exog_star = sm.add_constant(exog_star) #we have to include a constant because by default
#constant is not included
#Let&#39;s create our model
logit_model = sm.GLM(
    endog = endog_star,
    exog = exog_star,
    family = sm.families.Binomial(link=sm.families.links.Logit())).fit()
logit_model.summary()
logit_model.aic

Output:

                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:        [&#39;Use&#39;, &#39;fail&#39;]   No. Observations:                    9
Model:                            GLM   Df Residuals:                        7
Model Family:                Binomial   Df Model:                            1
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -18.059
Date:                Fri, 12 May 2023   Deviance:                       1.9469
Time:                        12:53:23   Pearson chi2:                     1.95
No. Iterations:                     5   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          z      P&gt;|z|      [0.025      0.975]
------------------------------------------------------------------------------
const        -11.3464      1.172     -9.678      0.000     -13.644      -9.049
Age            0.5996      0.059     10.231      0.000       0.485       0.715
==============================================================================
40.11730698160573

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

为什么当我想在Python和R中查找我的模型的AIC时，会得到不同的结果？

问题

答案1

答案2

答案3

AttributeError: Cutoff time DataFrame must contain a column with either the same name as the target dataframe index or a column named "instance_id"

将PIL图像导入FFMPY/FFMPEG以保存为GIF/视频。

如何在Python中比较来自两个不同数据集的日期

比较字符串日期时间与ISO日期时间格式

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论