英文:
R-style formulas when implementing a power (i.e. square) in a GLM misbehaves
问题
在下面的Python代码中,glm模型规范在model1中没有包括模型中的三次方,但在model2中包括了:
model1 = glm(formula="wage ~ workhours + workhours**3 + C(gender)", data=df, family=sm.families.Gaussian())
model2 = glm(formula="wage ~ workhours + np.power(workhours, 3) + C(gender)", data=df, family=sm.families.Gaussian())
这是一个错误吗?根据文档 x raises something to the power 3.
英文:
In the python code below, the glm model specification does not include the third power in the in model1 but it does in model2:
model1 = glm(formula="wage ~ workhours + workhours**3 + C(gender)", data=df, family=sm.families.Gaussian())
model2 = glm(formula="wage ~ workhours + np.power(workhours, 3) + C(gender)", data=df, family=sm.families.Gaussian())
Is this a bug? According to the documentation **x raises something to the power 3.
答案1
得分: 6
在公式中的 **
被视为公式运算符,而不是普通的指数运算。(这类似于 R 公式中 ^
的工作原理。)
(a+b+c+d)**3
表示模型应包括 a
、b
、c
、d
以及这些变量之间的所有交互作用,最高到 3 次方。
workhours**3
表示模型应包括 workhours
以及所有与之相关的...仅仅是 workhours
... 直到 3 次方... 但没有这种交互项,因此与只使用 workhours
等效。
相比之下,np.power(workhours, 3)
被视为 Python 代码,并计算您想要的幂次运算。
statsmodels 使用 patsy 来处理公式,因此要获取有关公式语言的详细信息,可以查看 patsy 文档。
英文:
**
in a formula is treated as a formula operator, not as regular exponentiation. (This is similar to how ^
works in an R formula.)
(a+b+c+d)**3
means that the model should include a
, b
, c
, d
, and all interactions between these variables up to 3rd order.
workhours**3
means that the model should include workhours
and all interactions between... just workhours
... up to 3rd order... but there are no such interaction terms, so it's equivalent to just workhours
.
In contrast, np.power(workhours, 3)
is treated as Python code, and computes the power you wanted.
statsmodels uses patsy for formula handling, so for full details on the formula language, you can check the patsy docs.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论