2023年2月7日 04:42:11go评论100阅读模式

英文:

Why is `summary_col` ignoring the `info_dict` parameter?

问题

I need to run some linear regressions and output Latex code with statsmodels in Python. I am using the summary_col function to achieve that. However, there is either a bug or a misunderstanding from my side. Please see the following code:

import numpy as np 
import statsmodels.api as sm
from statsmodels.iolib.summary2 import summary_col
np.random.seed(123)
nsample = 100
x = np.linspace(0, 10, 100)
X = np.column_stack((x, x ** 2))
beta = np.array([1, 0.1, 10])
e = np.random.normal(size=nsample)
X = sm.add_constant(X)
y1 = np.dot(X, beta) + e
y2 = np.dot(X, beta) + 2 * e
model1 = sm.OLS(y1, X).fit()
model2 = sm.OLS(y2, X).fit()

Now, to have a table with the two models side by side:

out_table = summary_col(
    [model1, model2],
    stars=True, 
    float_format='%.2f',
    info_dict={
        'N': lambda x: "{0:d}".format(int(x.nobs)),
        'R2': lambda x: "{:.2f}".format(x.rsquared)
    }
)

Hence I'd expect a table providing the number of observations and the $R^2$ only since I am explicit about the info_dict argument. The result I get, however, is the following:

==============================
             y I     y II 
------------------------------
const       0.81**  0.63   
             (0.34)  (0.67) 
x1          0.22    0.35   
             (0.16)  (0.31) 
x2          9.99*** 9.98***
             (0.02)  (0.03) 
R-squared   1.00    1.00   
R-squared Adj. 1.00    1.00   
N          100     100    
R2         1.00    1.00   
==============================
Standard errors in
parentheses.
* p<.1, ** p<.05, ***p<.01

Please notice how there are two extra rows with the normal r-squared and the adjusted one. My desired behavior would be:

==============================
             y I     y II 
------------------------------
const       0.81**  0.63   
             (0.34)  (0.67) 
x1          0.22    0.35   
             (0.16)  (0.31) 
x2          9.99*** 9.98***
             (0.02)  (0.03) 
N          100     100    
R2         1.00    1.00   
==============================
Standard errors in
parentheses.
* p<.1, ** p<.05, ***p<.01

The documentation is not the best yet: https://tedboy.github.io/statsmodels_doc/generated/statsmodels.iolib.summary2.summary_col.html

Any ideas on how to display only the information asked by the info_dict argument?

英文:

import numpy as np 
import statsmodels.api as sm
from statsmodels.iolib.summary2 import summary_col
np.random.seed(123)
nsample = 100
x = np.linspace(0, 10, 100)
X = np.column_stack((x, x ** 2))
beta = np.array([1, 0.1, 10])
e = np.random.normal(size=nsample)
X = sm.add_constant(X)
y1 = np.dot(X, beta) + e
y2 = np.dot(X, beta) + 2 * e
model1 = sm.OLS(y1, X).fit()
model2 = sm.OLS(y2, X).fit()

Now, to have a table with the two models side by side:

out_table = summary_col(
    [model1, model2],
    stars=True, 
    float_format=&#39;%.2f&#39;,
    info_dict={
        &#39;N&#39;:lambda x: &quot;{0:d}&quot;.format(int(x.nobs)),
        &#39;R2&#39;:lambda x: &quot;{:.2f}&quot;.format(x.rsquared)
    }
)

Hence I'd expect a table providing the number of observations and the $R^2$ only since I am explicit about the info_dict argument. The result I get however is the following:

==============================
                 y I     y II 
------------------------------
const          0.81**  0.63   
               (0.34)  (0.67) 
x1             0.22    0.35   
               (0.16)  (0.31) 
x2             9.99*** 9.98***
               (0.02)  (0.03) 
R-squared      1.00    1.00   
R-squared Adj. 1.00    1.00   
N              100     100    
R2             1.00    1.00   
==============================
Standard errors in
parentheses.
* p&lt;.1, ** p&lt;.05, ***p&lt;.01

Please notice how there are two extra rows with the normal r-squared and the adjusted one. My desired behavior would be:

==============================
                 y I     y II 
------------------------------
const          0.81**  0.63   
               (0.34)  (0.67) 
x1             0.22    0.35   
               (0.16)  (0.31) 
x2             9.99*** 9.98***
               (0.02)  (0.03) 
N              100     100    
R2             1.00    1.00   
==============================
Standard errors in
parentheses.
* p&lt;.1, ** p&lt;.05, ***p&lt;.01

The documentation is not the best yet: https://tedboy.github.io/statsmodels_doc/generated/statsmodels.iolib.summary2.summary_col.html

Any ideas on how to displayu only the information asked by the info_dict argument?

答案1

得分: 1

以下是您要翻译的内容：

Let's have a look at the source code at

https://github.com/statsmodels/statsmodels/blob/main/statsmodels/iolib/summary2.py

We can see that the function summary_col takes info_dict as an argument and uses it in the following way

if info_dict:
cols = [_col_info(x, info_dict.get(x.model.class.name, info_dict))
for x in results]

In this case, it means that then is called _col_info(model1, info_dict) and _col_info(model2, info_dict) in order to generate your N and R2 rows. The absence of mypy and comments makes these functions quite obscure actually.

Later on, the cols list will be added to the variable summ that will be part of a Summary object.

smry = Summary()
smry._merge_latex = True
smry.add_df(summ, header=True, align='l')

However, cols is actually a redefinition, it was defined before as

cols = [_col_params(x, stars=stars, float_format=float_format) for x in results]

and that constituted the first part of summ.

The issue is that _col_params will add R-squared and R-squared Adj. whether you like it or not, here is the source code

rsquared = getattr(result, 'rsquared', np.nan)
rsquared_adj = getattr(result, 'rsquared_adj', np.nan)
r2 = pd.Series({'R-squared', ""}: rsquared,
{'R-squared Adj.', ""}: rsquared_adj})

if r2 notnull().any():
r2 = r2.apply(lambda x: float_format % x)
res = pd.concat([res, r2], axis=0)
res = pd.DataFrame(res)
res.columns = [str(result.model.endog_names)]
return res

So what I would suggest is to manually modify the tables attribute of your output

rm_extra_rows = lambda t : t.iloc[list(range(6)) + [8,9],:]
out_table.tables = [rm_extra_rows(el) for el in out_table.tables]

After that I get

In [53]: out_table
Out[53]:
<class 'statsmodels.iolib.summary2.Summary'>
"""

=====================
y I y II

const 0.81 0.63
(0.34) (0.67)
x1 0.22 0.35
(0.16) (0.31)
x2 9.99* 9.98***
(0.02) (0.03)
N 100 100
R2 1.00 1.00

Standard errors in
parentheses.

p<.1, ** p<.05,
***p<.01
"""
which should be what you wanted to get.

英文:

Let's have a look at the source code at

https://github.com/statsmodels/statsmodels/blob/main/statsmodels/iolib/summary2.py

We can see that the function summary_col takes info_dict as an argument and uses it in the following way

if info_dict:
    cols = [_col_info(x, info_dict.get(x.model.__class__.__name__, info_dict)) 
        for x in results]

Later on, the cols list will be added to the variable summ that will be part of a Summary object.

smry = Summary()
smry._merge_latex = True
smry.add_df(summ, header=True, align=&#39;l&#39;)

However, cols is actually a redefinition, it was defined before as

cols = [_col_params(x, stars=stars, float_format=float_format) for x in results]

and that constituted the first part of summ.

The issue is that _col_params will add R-squared and R-squared Adj. whether you like it or not, here is the source code

rsquared = getattr(result, &#39;rsquared&#39;, np.nan)
rsquared_adj = getattr(result, &#39;rsquared_adj&#39;, np.nan)
r2 = pd.Series({(&#39;R-squared&#39;, &quot;&quot;): rsquared,
                (&#39;R-squared Adj.&#39;, &quot;&quot;): rsquared_adj})
if r2.notnull().any():
    r2 = r2.apply(lambda x: float_format % x)
    res = pd.concat([res, r2], axis=0)
res = pd.DataFrame(res)
res.columns = [str(result.model.endog_names)]
return res

So what I would suggest is to manually modify the tables attribute of your output

rm_extra_rows = lambda t :  t.iloc[list(range(6)) + [8,9],:]
out_table.tables = [rm_extra_rows(el) for el in out_table.tables]

After that I get

In [53]: out_table
Out[53]: 
&lt;class &#39;statsmodels.iolib.summary2.Summary&#39;&gt;
&quot;&quot;&quot;
=====================
        y I     y II 
---------------------
const 0.81**  0.63   
      (0.34)  (0.67) 
x1    0.22    0.35   
      (0.16)  (0.31) 
x2    9.99*** 9.98***
      (0.02)  (0.03) 
N     100     100    
R2    1.00    1.00   
=====================
Standard errors in
parentheses.
* p&lt;.1, ** p&lt;.05,
***p&lt;.01
&quot;&quot;&quot;

which should be what you wanted to get.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

`summary_col`为什么忽略了`info_dict`参数？

问题

答案1

=====================
y I y II

const 0.81 0.63
(0.34) (0.67)
x1 0.22 0.35
(0.16) (0.31)
x2 9.99* 9.98***
(0.02) (0.03)
N 100 100
R2 1.00 1.00

如何在成对比较图中显示字母？

TypeError: ‘str’对象不能被解释为整数的Python for循环

JSON文件的更正。

Checking if file to be copied already exists in specified directory and if so skip the file and move onto next

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论

问题

答案1

===================== y I y II

const 0.81** 0.63 (0.34) (0.67) x1 0.22 0.35 (0.16) (0.31) x2 9.99*** 9.98*** (0.02) (0.03) N 100 100 R2 1.00 1.00

发表评论

=====================
y I y II

const 0.81 0.63
(0.34) (0.67)
x1 0.22 0.35
(0.16) (0.31)
x2 9.99* 9.98***
(0.02) (0.03)
N 100 100
R2 1.00 1.00