英文:
Why is `summary_col` ignoring the `info_dict` parameter?
问题
I need to run some linear regressions and output Latex
code with statsmodels
in Python
. I am using the summary_col
function to achieve that. However, there is either a bug or a misunderstanding from my side. Please see the following code:
import numpy as np
import statsmodels.api as sm
from statsmodels.iolib.summary2 import summary_col
np.random.seed(123)
nsample = 100
x = np.linspace(0, 10, 100)
X = np.column_stack((x, x ** 2))
beta = np.array([1, 0.1, 10])
e = np.random.normal(size=nsample)
X = sm.add_constant(X)
y1 = np.dot(X, beta) + e
y2 = np.dot(X, beta) + 2 * e
model1 = sm.OLS(y1, X).fit()
model2 = sm.OLS(y2, X).fit()
Now, to have a table with the two models side by side:
out_table = summary_col(
[model1, model2],
stars=True,
float_format='%.2f',
info_dict={
'N': lambda x: "{0:d}".format(int(x.nobs)),
'R2': lambda x: "{:.2f}".format(x.rsquared)
}
)
Hence I'd expect a table providing the number of observations and the $R^2$ only since I am explicit about the info_dict
argument. The result I get, however, is the following:
==============================
y I y II
------------------------------
const 0.81** 0.63
(0.34) (0.67)
x1 0.22 0.35
(0.16) (0.31)
x2 9.99*** 9.98***
(0.02) (0.03)
R-squared 1.00 1.00
R-squared Adj. 1.00 1.00
N 100 100
R2 1.00 1.00
==============================
Standard errors in
parentheses.
* p<.1, ** p<.05, ***p<.01
Please notice how there are two extra rows with the normal r-squared and the adjusted one. My desired behavior would be:
==============================
y I y II
------------------------------
const 0.81** 0.63
(0.34) (0.67)
x1 0.22 0.35
(0.16) (0.31)
x2 9.99*** 9.98***
(0.02) (0.03)
N 100 100
R2 1.00 1.00
==============================
Standard errors in
parentheses.
* p<.1, ** p<.05, ***p<.01
The documentation is not the best yet: https://tedboy.github.io/statsmodels_doc/generated/statsmodels.iolib.summary2.summary_col.html
Any ideas on how to display only the information asked by the info_dict
argument?
英文:
I need to run some linear regressions and output Latex
code with statsmodels
in Python
. I am using the summary_col
function to achieve that. However, there is either a bug or a misunderstanding from my side. Please see the following code:
import numpy as np
import statsmodels.api as sm
from statsmodels.iolib.summary2 import summary_col
np.random.seed(123)
nsample = 100
x = np.linspace(0, 10, 100)
X = np.column_stack((x, x ** 2))
beta = np.array([1, 0.1, 10])
e = np.random.normal(size=nsample)
X = sm.add_constant(X)
y1 = np.dot(X, beta) + e
y2 = np.dot(X, beta) + 2 * e
model1 = sm.OLS(y1, X).fit()
model2 = sm.OLS(y2, X).fit()
Now, to have a table with the two models side by side:
out_table = summary_col(
[model1, model2],
stars=True,
float_format='%.2f',
info_dict={
'N':lambda x: "{0:d}".format(int(x.nobs)),
'R2':lambda x: "{:.2f}".format(x.rsquared)
}
)
Hence I'd expect a table providing the number of observations and the $R^2$ only since I am explicit about the info_dict
argument. The result I get however is the following:
==============================
y I y II
------------------------------
const 0.81** 0.63
(0.34) (0.67)
x1 0.22 0.35
(0.16) (0.31)
x2 9.99*** 9.98***
(0.02) (0.03)
R-squared 1.00 1.00
R-squared Adj. 1.00 1.00
N 100 100
R2 1.00 1.00
==============================
Standard errors in
parentheses.
* p<.1, ** p<.05, ***p<.01
Please notice how there are two extra rows with the normal r-squared and the adjusted one. My desired behavior would be:
==============================
y I y II
------------------------------
const 0.81** 0.63
(0.34) (0.67)
x1 0.22 0.35
(0.16) (0.31)
x2 9.99*** 9.98***
(0.02) (0.03)
N 100 100
R2 1.00 1.00
==============================
Standard errors in
parentheses.
* p<.1, ** p<.05, ***p<.01
The documentation is not the best yet: https://tedboy.github.io/statsmodels_doc/generated/statsmodels.iolib.summary2.summary_col.html
Any ideas on how to displayu only the information asked by the info_dict
argument?
答案1
得分: 1
以下是您要翻译的内容:
Let's have a look at the source code at
https://github.com/statsmodels/statsmodels/blob/main/statsmodels/iolib/summary2.py
We can see that the function summary_col
takes info_dict
as an argument and uses it in the following way
if info_dict:
cols = [_col_info(x, info_dict.get(x.model.class.name, info_dict))
for x in results]
In this case, it means that then is called _col_info(model1, info_dict)
and _col_info(model2, info_dict)
in order to generate your N
and R2
rows. The absence of mypy
and comments makes these functions quite obscure actually.
Later on, the cols
list will be added to the variable summ
that will be part of a Summary
object.
smry = Summary()
smry._merge_latex = True
smry.add_df(summ, header=True, align='l')
However, cols
is actually a redefinition, it was defined before as
cols = [_col_params(x, stars=stars, float_format=float_format) for x in results]
and that constituted the first part of summ
.
The issue is that _col_params
will add R-squared
and R-squared Adj.
whether you like it or not, here is the source code
rsquared = getattr(result, 'rsquared', np.nan)
rsquared_adj = getattr(result, 'rsquared_adj', np.nan)
r2 = pd.Series({'R-squared', ""}: rsquared,
{'R-squared Adj.', ""}: rsquared_adj})
if r2 notnull().any():
r2 = r2.apply(lambda x: float_format % x)
res = pd.concat([res, r2], axis=0)
res = pd.DataFrame(res)
res.columns = [str(result.model.endog_names)]
return res
So what I would suggest is to manually modify the tables
attribute of your output
rm_extra_rows = lambda t : t.iloc[list(range(6)) + [8,9],:]
out_table.tables = [rm_extra_rows(el) for el in out_table.tables]
After that I get
In [53]: out_table
Out[53]:
<class 'statsmodels.iolib.summary2.Summary'>
"""
=====================
y I y II
const 0.81** 0.63
(0.34) (0.67)
x1 0.22 0.35
(0.16) (0.31)
x2 9.99*** 9.98***
(0.02) (0.03)
N 100 100
R2 1.00 1.00
Standard errors in
parentheses.
- p<.1, ** p<.05,
***p<.01
"""
which should be what you wanted to get.
英文:
Let's have a look at the source code at
https://github.com/statsmodels/statsmodels/blob/main/statsmodels/iolib/summary2.py
We can see that the function summary_col
takes info_dict
as an argument and uses it in the following way
if info_dict:
cols = [_col_info(x, info_dict.get(x.model.__class__.__name__, info_dict))
for x in results]
In this case, it means that then is called _col_info(model1, info_dict)
and _col_info(model2, info_dict)
in order to generate your N
and R2
rows. The absence of mypy
and comments makes these functions quite obscure actually.
Later on, the cols
list will be added to the variable summ
that will be part of a Summary
object.
smry = Summary()
smry._merge_latex = True
smry.add_df(summ, header=True, align='l')
However, cols
is actually a redefinition, it was defined before as
cols = [_col_params(x, stars=stars, float_format=float_format) for x in results]
and that constituted the first part of summ
.
The issue is that _col_params
will add R-squared
and R-squared Adj.
whether you like it or not, here is the source code
rsquared = getattr(result, 'rsquared', np.nan)
rsquared_adj = getattr(result, 'rsquared_adj', np.nan)
r2 = pd.Series({('R-squared', ""): rsquared,
('R-squared Adj.', ""): rsquared_adj})
if r2.notnull().any():
r2 = r2.apply(lambda x: float_format % x)
res = pd.concat([res, r2], axis=0)
res = pd.DataFrame(res)
res.columns = [str(result.model.endog_names)]
return res
So what I would suggest is to manually modify the tables
attribute of your output
rm_extra_rows = lambda t : t.iloc[list(range(6)) + [8,9],:]
out_table.tables = [rm_extra_rows(el) for el in out_table.tables]
After that I get
In [53]: out_table
Out[53]:
<class 'statsmodels.iolib.summary2.Summary'>
"""
=====================
y I y II
---------------------
const 0.81** 0.63
(0.34) (0.67)
x1 0.22 0.35
(0.16) (0.31)
x2 9.99*** 9.98***
(0.02) (0.03)
N 100 100
R2 1.00 1.00
=====================
Standard errors in
parentheses.
* p<.1, ** p<.05,
***p<.01
"""
which should be what you wanted to get.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论