“pscl”和 “countreg” 包中的 “zeroinfl” 提供非常不同的结果。为什么?

huangapple go评论120阅读模式
英文:

`zeroinfl` from the `pscl` and `countreg` packages give very different results. Why?

问题

The reason for the difference in results between zeroinfl from the pscl package and zeroinfl from the countreg package could be due to differences in the underlying algorithms or implementations of these functions. Each package may use slightly different optimization methods or default settings, which can lead to variations in the estimated coefficients and model outputs.

Since you mentioned that the results from Stata closely resemble the results of zeroinfl from the pscl package, it's possible that pscl and Stata may share similar algorithmic approaches or settings for this type of model.

To investigate further and potentially resolve the discrepancy, you may consider the following:

  1. Double-check the input data and ensure that it is identical for both models.
  2. Examine the documentation for both packages to understand any differences in default settings or algorithms.
  3. Try adjusting the optimization settings or algorithms used by the countreg package to see if it brings the results closer to those of pscl.
  4. Contact the maintainers or authors of the countreg package to inquire about the differences and seek their guidance on achieving consistency.

It's not uncommon for different software packages to produce slightly different results, even when fitting the same type of statistical model, due to variations in numerical optimization techniques or other factors.

英文:

I am experimenting with the mdvis dataset from the COUNT package of R for a teaching purpose. I fitted a zero-inflated negative-binomial model using the zeroinfl function from the pscl and countreg packages. However, the results of zeroinfl from the pscl package and from countreg package differ a lot.

The models and the outputs are provided below.

zeroinfl from pscl:

  1. mdvisit_zeroinf<-pscl::zeroinfl(numvisit ~ reform + badh + agegrp + educ + loginc | reform + badh + agegrp + educ + loginc, dist = "negbin", data = mdvis, control = zeroinfl.control(method = "BFGS", EM=F)
  2. summary(mdvisit_zeroinf)
  3. Call:
  4. pscl::zeroinfl(formula = numvisit ~ reform + badh + agegrp + educ + loginc |
  5. reform + badh + agegrp + educ + loginc, data = mdvis, dist = "negbin")
  6. Pearson residuals:
  7. Min 1Q Median 3Q Max
  8. -0.9667 -0.8037 -0.3597 0.3154 9.4878
  9. Count model coefficients (negbin with log link):
  10. Estimate Std. Error z value Pr(>|z|)
  11. (Intercept) -0.30316 0.56565 -0.536 0.5920
  12. reform -0.09916 0.05543 -1.789 0.0736 .
  13. badh 1.11555 0.07517 14.841 <2e-16 ***
  14. agegrp2 0.11376 0.06990 1.628 0.1036
  15. agegrp3 0.21126 0.08620 2.451 0.0143 *
  16. educ2 0.07605 0.07209 1.055 0.2914
  17. educ3 -0.03979 0.07295 -0.545 0.5854
  18. loginc 0.13209 0.07499 1.761 0.0782 .
  19. Log(theta) 0.04747 0.05815 0.816 0.4144
  20. Zero-inflation model coefficients (binomial with logit link):
  21. Estimate Std. Error z value Pr(>|z|)
  22. (Intercept) -30.7110 772.2930 -0.040 0.968
  23. reform 14.3274 763.7283 0.019 0.985
  24. badh -1.9503 3.5392 -0.551 0.582
  25. agegrp2 10.9236 113.6292 0.096 0.923
  26. agegrp3 9.7723 113.6558 0.086 0.931
  27. educ2 -0.6632 1.9878 -0.334 0.739
  28. educ3 -0.8743 1.3501 -0.648 0.517
  29. loginc 0.5437 1.9718 0.276 0.783
  30. ---
  31. Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  32. Theta = 1.0486
  33. Number of iterations in BFGS optimization: 106
  34. Log-likelihood: -4557 on 17 Df`

zeroinfl from countreg:

  1. mdvisit_zeroinf2<-countreg::zeroinfl(numvisit ~ reform + badh + agegrp + educ + loginc | reform + badh + agegrp + educ + loginc, dist = "negbin", data = mdvis, control = zeroinfl.control(method = "BFGS", EM=F))
  2. summary(mdvisit_zeroinf2)
  3. Call:
  4. countreg::zeroinfl(formula = numvisit ~ reform + badh + agegrp + educ + loginc |
  5. reform + badh + agegrp + educ + loginc, data = mdvis, dist = "negbin",
  6. control = zeroinfl.control(method = "BFGS", EM = F))
  7. Pearson residuals:
  8. Min 1Q Median 3Q Max
  9. -0.9523 -0.8057 -0.3615 0.3106 9.6300
  10. Count model coefficients (negbin with log link):
  11. Estimate Std. Error z value Pr(>|z|)
  12. (Intercept) -0.44473 0.53752 -0.827 0.40803
  13. reform -0.12962 0.05106 -2.538 0.01113 *
  14. badh 1.13558 0.07413 15.319 < 2e-16 ***
  15. agegrp2 0.05851 0.06359 0.920 0.35756
  16. agegrp3 0.18678 0.07176 2.603 0.00925 **
  17. educ2 0.06398 0.06621 0.966 0.33385
  18. educ3 -0.04825 0.07087 -0.681 0.49599
  19. loginc 0.15338 0.07056 2.174 0.02973 *
  20. Log(theta) 0.01232 0.04778 0.258 0.79652
  21. Zero-inflation model coefficients (binomial with logit link):
  22. Estimate Std. Error z value Pr(>|z|)
  23. (Intercept) -119.137 95.259 -1.251 0.2111
  24. reform 4.821 2.646 1.822 0.0684 .
  25. badh -21.461 4614.989 -0.005 0.9963
  26. agegrp2 9.261 28.341 0.327 0.7438
  27. agegrp3 3.502 27.984 0.125 0.9004
  28. educ2 -19.790 203.802 -0.097 0.9226
  29. educ3 -7.894 4.758 -1.659 0.0971 .
  30. loginc 13.010 10.463 1.243 0.2137
  31. ---
  32. Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  33. Theta = 1.0124
  34. Number of iterations in BFGS optimization: 197
  35. Log-likelihood: -4554 on 17 Df

What could be the reason for such a difference? Results from Stata closely resemble the results of zeroinfl from the pscl package (especially, the results for the count part of the model).

I fited the models exactly similarly by specifying the same options using zeroinfl.control(). I also tried to search online to see if others have reported similar issues previously, and if answers are already available. I searched for answers also on stackoverflow and CV. But I couldn't get any.

答案1

得分: 3

以下是您要翻译的内容:

问题的根源:

问题在于没有零膨胀,但实际上比负二项模型预期的零值少 - 尤其是在某些子组中,涉及到reformagegrp

症状:

由于缺乏零膨胀,二进制零膨胀部分会尽其所能地生成对于某些子组非常接近零的预测零膨胀概率。特别注意该模型部分中的截距非常小,具有大的标准误差。

总体上,这与二进制回归模型中的准完全分离类似。似然性变得非常平坦,并且它取决于优化器的设置在哪里停止(这些设置在psclcountreg之间不同)。

由于零膨胀模型的两个组成部分(计数与二进制部分)无法分开估计,一个组件的估计问题也可能导致另一个组件的估计问题/差异。

备选方案:

在这里,障碍模型有几个优点:(1) 它不仅可以处理零膨胀,还可以处理缺乏零值。(2) 模型的两个组成部分可以分开估计,因此问题不会波及到其他部分。

示例:

让我们比较基本的负二项模型与其零膨胀和障碍对应模型:

  1. data("mdvis", package = "COUNT")
  2. mdvis <- transform(mdvis,
  3. reform = factor(reform),
  4. badh = factor(badh),
  5. agegrp = factor(agegrp),
  6. educ = factor(educ)
  7. )
  8. library("countreg")
  9. f <- numvisit ~ reform + badh + agegrp + educ + loginc
  10. m <- glm.nb(f, data = mdvis)
  11. m2 <- zeroinfl(f, dist = "negbin", data = mdvis)
  12. m3 <- hurdle(f, dist = "negbin", data = mdvis)

就对数似然而言,零膨胀模型仅在基本模型上略有改善,尽管需要的参数几乎是基本模型的两倍。障碍模型改进了一些,但也不多:

  1. c(logLik(m), logLik(m2), logLik(m3))
  2. ## [1] -4560.910 -4554.146 -4550.520

就AIC和BIC而言,零膨胀模型是这三个模型中最差的。AIC略微倾向于障碍模型,而BIC更明确地倾向于基本模型。

  1. AIC(m, m2, m3)
  2. ## df AIC
  3. ## m 9 9139.819
  4. ## m2 17 9142.293
  5. ## m3 17 9135.040
  6. BIC(m, m2, m3)
  7. ## df BIC
  8. ## m 9 9191.195
  9. ## m2 17 9239.336
  10. ## m3 17 9232.083

查看基本模型的根图作为诊断显示,我们可以看到总体上观察到的零值比负二项模型预期的零值要少。但总体上基本模型已经相当合适。

  1. rootogram(m)

“pscl”和 “countreg” 包中的 “zeroinfl” 提供非常不同的结果。为什么?

(注: 具有置信限的根图版本实际上是由另一个正在开发中的包生成的。countreg中的版本没有这些置信限。)

英文:

Source of the problem:

The problem is that there is no zero inflation but actually less zeros than expected from a negative binomial model - espectially in some subgroups with respect to reform and agegrp.

Symptoms:

Due to the lack of zero inflation, the binary zero inflation part tries as hard as it can to produce predicted zero inflation probabilities that are very close to zero for certain subgroups. Notice especially the intercept in that part of the model that is extremely small with a large standard error.

Overall this looks similar to quasi-complete separation in binary regression models. The likelihood becomes very flat and it depends on the settings of the optimizer where exactly it stops (and these settings differ between pscl and countreg).

As the two components of the zero inflation model (count vs. binary part) cannot be estimated separately, problems in the estimation of one component can lead to problems/differences in the estimation of the other component as well.

Alternative:

The hurdle model has several advantages here: (1) It can not only deal with zero inflation but also with a lack of zeros. (2) The two components of the model can be estimated separately and hence problems cannot spill over.

Illustration:

Let's compare the basic negative binomial model with its zero-inflation and hurdle counterparts:

  1. data(&quot;mdvis&quot;, package = &quot;COUNT&quot;)
  2. mdvis &lt;- transform(mdvis,
  3. reform = factor(reform),
  4. badh = factor(badh),
  5. agegrp = factor(agegrp),
  6. educ = factor(educ)
  7. )
  8. library(&quot;countreg&quot;)
  9. f &lt;- numvisit ~ reform + badh + agegrp + educ + loginc
  10. m &lt;- glm.nb(f, data = mdvis)
  11. m2 &lt;- zeroinfl(f, dist = &quot;negbin&quot;, data = mdvis)
  12. m3 &lt;- hurdle(f, dist = &quot;negbin&quot;, data = mdvis)

In terms of the log-likelihood, the zero inflation model only improves very slightly on the basic model despite needing almost twice as many parameters. The hurdle model improves a bit more but also not much:

  1. c(logLik(m), logLik(m2), logLik(m3))
  2. ## [1] -4560.910 -4554.146 -4550.520

In terms of both AIC and BIC the zero inflation model is the worst out of these three models. AIC slightly prefers the hurdle model while BIC prefers the basic model somewhat more clearly.

  1. AIC(m, m2, m3)
  2. ## df AIC
  3. ## m 9 9139.819
  4. ## m2 17 9142.293
  5. ## m3 17 9135.040
  6. BIC(m, m2, m3)
  7. ## df BIC
  8. ## m 9 9191.195
  9. ## m2 17 9239.336
  10. ## m3 17 9232.083

Looking at the rootogram as a diagnostic display for the basic model, we see that overall there are actually less observed zeros than expected from a negative binomial model. But by and large the basic model already fits reasonably well.

  1. rootogram(m)

“pscl”和 “countreg” 包中的 “zeroinfl” 提供非常不同的结果。为什么?

(Note: The version of the rootogram with confidence limits is actually produced by another package under development. The version in countreg does not have these confidence limits.)

huangapple
  • 本文由 发表于 2023年4月13日 17:20:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/76003780.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定