`summary_col`为什么忽略了`info_dict`参数?

huangapple go评论58阅读模式
英文:

Why is `summary_col` ignoring the `info_dict` parameter?

问题

I need to run some linear regressions and output Latex code with statsmodels in Python. I am using the summary_col function to achieve that. However, there is either a bug or a misunderstanding from my side. Please see the following code:

import numpy as np 
import statsmodels.api as sm
from statsmodels.iolib.summary2 import summary_col

np.random.seed(123)

nsample = 100
x = np.linspace(0, 10, 100)
X = np.column_stack((x, x ** 2))
beta = np.array([1, 0.1, 10])
e = np.random.normal(size=nsample)

X = sm.add_constant(X)
y1 = np.dot(X, beta) + e
y2 = np.dot(X, beta) + 2 * e

model1 = sm.OLS(y1, X).fit()
model2 = sm.OLS(y2, X).fit()

Now, to have a table with the two models side by side:

out_table = summary_col(
    [model1, model2],
    stars=True, 
    float_format='%.2f',
    info_dict={
        'N': lambda x: "{0:d}".format(int(x.nobs)),
        'R2': lambda x: "{:.2f}".format(x.rsquared)
    }
)

Hence I'd expect a table providing the number of observations and the $R^2$ only since I am explicit about the info_dict argument. The result I get, however, is the following:

==============================
             y I     y II 
------------------------------
const       0.81**  0.63   
             (0.34)  (0.67) 
x1          0.22    0.35   
             (0.16)  (0.31) 
x2          9.99*** 9.98***
             (0.02)  (0.03) 
R-squared   1.00    1.00   
R-squared Adj. 1.00    1.00   
N          100     100    
R2         1.00    1.00   
==============================
Standard errors in
parentheses.
* p<.1, ** p<.05, ***p<.01

Please notice how there are two extra rows with the normal r-squared and the adjusted one. My desired behavior would be:

==============================
             y I     y II 
------------------------------
const       0.81**  0.63   
             (0.34)  (0.67) 
x1          0.22    0.35   
             (0.16)  (0.31) 
x2          9.99*** 9.98***
             (0.02)  (0.03) 
N          100     100    
R2         1.00    1.00   
==============================
Standard errors in
parentheses.
* p<.1, ** p<.05, ***p<.01

The documentation is not the best yet: https://tedboy.github.io/statsmodels_doc/generated/statsmodels.iolib.summary2.summary_col.html

Any ideas on how to display only the information asked by the info_dict argument?

英文:

I need to run some linear regressions and output Latex code with statsmodels in Python. I am using the summary_col function to achieve that. However, there is either a bug or a misunderstanding from my side. Please see the following code:

import numpy as np 
import statsmodels.api as sm
from statsmodels.iolib.summary2 import summary_col

np.random.seed(123)

nsample = 100
x = np.linspace(0, 10, 100)
X = np.column_stack((x, x ** 2))
beta = np.array([1, 0.1, 10])
e = np.random.normal(size=nsample)

X = sm.add_constant(X)
y1 = np.dot(X, beta) + e
y2 = np.dot(X, beta) + 2 * e

model1 = sm.OLS(y1, X).fit()
model2 = sm.OLS(y2, X).fit()

Now, to have a table with the two models side by side:

out_table = summary_col(
    [model1, model2],
    stars=True, 
    float_format=&#39;%.2f&#39;,
    info_dict={
        &#39;N&#39;:lambda x: &quot;{0:d}&quot;.format(int(x.nobs)),
        &#39;R2&#39;:lambda x: &quot;{:.2f}&quot;.format(x.rsquared)
    }
)

Hence I'd expect a table providing the number of observations and the $R^2$ only since I am explicit about the info_dict argument. The result I get however is the following:

==============================
                 y I     y II 
------------------------------
const          0.81**  0.63   
               (0.34)  (0.67) 
x1             0.22    0.35   
               (0.16)  (0.31) 
x2             9.99*** 9.98***
               (0.02)  (0.03) 
R-squared      1.00    1.00   
R-squared Adj. 1.00    1.00   
N              100     100    
R2             1.00    1.00   
==============================
Standard errors in
parentheses.
* p&lt;.1, ** p&lt;.05, ***p&lt;.01

Please notice how there are two extra rows with the normal r-squared and the adjusted one. My desired behavior would be:

==============================
                 y I     y II 
------------------------------
const          0.81**  0.63   
               (0.34)  (0.67) 
x1             0.22    0.35   
               (0.16)  (0.31) 
x2             9.99*** 9.98***
               (0.02)  (0.03) 
N              100     100    
R2             1.00    1.00   
==============================
Standard errors in
parentheses.
* p&lt;.1, ** p&lt;.05, ***p&lt;.01

The documentation is not the best yet: https://tedboy.github.io/statsmodels_doc/generated/statsmodels.iolib.summary2.summary_col.html

Any ideas on how to displayu only the information asked by the info_dict argument?

答案1

得分: 1

以下是您要翻译的内容:

Let's have a look at the source code at

https://github.com/statsmodels/statsmodels/blob/main/statsmodels/iolib/summary2.py

We can see that the function summary_col takes info_dict as an argument and uses it in the following way

if info_dict:
cols = [_col_info(x, info_dict.get(x.model.class.name, info_dict))
for x in results]

In this case, it means that then is called _col_info(model1, info_dict) and _col_info(model2, info_dict) in order to generate your N and R2 rows. The absence of mypy and comments makes these functions quite obscure actually.

Later on, the cols list will be added to the variable summ that will be part of a Summary object.

smry = Summary()
smry._merge_latex = True
smry.add_df(summ, header=True, align='l')

However, cols is actually a redefinition, it was defined before as

cols = [_col_params(x, stars=stars, float_format=float_format) for x in results]

and that constituted the first part of summ.

The issue is that _col_params will add R-squared and R-squared Adj. whether you like it or not, here is the source code

rsquared = getattr(result, 'rsquared', np.nan)
rsquared_adj = getattr(result, 'rsquared_adj', np.nan)
r2 = pd.Series({'R-squared', ""}: rsquared,
{'R-squared Adj.', ""}: rsquared_adj})

if r2 notnull().any():
r2 = r2.apply(lambda x: float_format % x)
res = pd.concat([res, r2], axis=0)
res = pd.DataFrame(res)
res.columns = [str(result.model.endog_names)]
return res

So what I would suggest is to manually modify the tables attribute of your output

rm_extra_rows = lambda t : t.iloc[list(range(6)) + [8,9],:]
out_table.tables = [rm_extra_rows(el) for el in out_table.tables]

After that I get

In [53]: out_table
Out[53]:
<class 'statsmodels.iolib.summary2.Summary'>
"""

=====================
y I y II

const 0.81** 0.63
(0.34) (0.67)
x1 0.22 0.35
(0.16) (0.31)
x2 9.99*** 9.98***
(0.02) (0.03)
N 100 100
R2 1.00 1.00

Standard errors in
parentheses.

  • p<.1, ** p<.05,
    ***p<.01
    """
    which should be what you wanted to get.
英文:

Let's have a look at the source code at

https://github.com/statsmodels/statsmodels/blob/main/statsmodels/iolib/summary2.py

We can see that the function summary_col takes info_dict as an argument and uses it in the following way

if info_dict:
    cols = [_col_info(x, info_dict.get(x.model.__class__.__name__, info_dict)) 
        for x in results]

In this case, it means that then is called _col_info(model1, info_dict) and _col_info(model2, info_dict) in order to generate your N and R2 rows. The absence of mypy and comments makes these functions quite obscure actually.

Later on, the cols list will be added to the variable summ that will be part of a Summary object.

smry = Summary()
smry._merge_latex = True
smry.add_df(summ, header=True, align=&#39;l&#39;)

However, cols is actually a redefinition, it was defined before as

cols = [_col_params(x, stars=stars, float_format=float_format) for x in results]

and that constituted the first part of summ.

The issue is that _col_params will add R-squared and R-squared Adj. whether you like it or not, here is the source code

rsquared = getattr(result, &#39;rsquared&#39;, np.nan)
rsquared_adj = getattr(result, &#39;rsquared_adj&#39;, np.nan)
r2 = pd.Series({(&#39;R-squared&#39;, &quot;&quot;): rsquared,
                (&#39;R-squared Adj.&#39;, &quot;&quot;): rsquared_adj})

if r2.notnull().any():
    r2 = r2.apply(lambda x: float_format % x)
    res = pd.concat([res, r2], axis=0)
res = pd.DataFrame(res)
res.columns = [str(result.model.endog_names)]
return res

So what I would suggest is to manually modify the tables attribute of your output

rm_extra_rows = lambda t :  t.iloc[list(range(6)) + [8,9],:]
out_table.tables = [rm_extra_rows(el) for el in out_table.tables]

After that I get

In [53]: out_table
Out[53]: 
&lt;class &#39;statsmodels.iolib.summary2.Summary&#39;&gt;
&quot;&quot;&quot;

=====================
        y I     y II 
---------------------
const 0.81**  0.63   
      (0.34)  (0.67) 
x1    0.22    0.35   
      (0.16)  (0.31) 
x2    9.99*** 9.98***
      (0.02)  (0.03) 
N     100     100    
R2    1.00    1.00   
=====================
Standard errors in
parentheses.
* p&lt;.1, ** p&lt;.05,
***p&lt;.01
&quot;&quot;&quot;

which should be what you wanted to get.

huangapple
  • 本文由 发表于 2023年2月7日 04:42:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/75366372.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定