2023年4月13日 21:12:29go评论149阅读模式

英文:

Kolmogorov-Smirnov Test and Regression to derive the goodness of fit for negative binomial and Poisson distribution in R

问题

I have a vector dataset dt and only a few of 500 observations is given below.

[1] 0 4 8 13 15 1 0 5 4 6 0 0 0 1 2 3 0 0 0 0 0 3 0 10 15 0 23 0 6 0 10 2 15 0 3 0 2 0 [39] 3 0 42 4 6 12 3 2 0 0 0 2 3 9 6 0 3 2 1 6 0 0 0 0 0 1 0 0 2 0 0 0 0 7 0 0 0 1

I want to check if the distribution of data follows a negative binomial distribution or Poisson distribution better using the KS test and regression. There are only a few R packages available for the KS test of which not all give a good summary and some are no longer valid.

What I have tried?

Package iZID

install.packages("iZID")
library("iZID")
dis.kstest(dt, nsim = 100, bootstrap = TRUE, distri = 'nb')$pvalue
dis.kstest(dt, nsim = 100, bootstrap = TRUE, distri = 'Poisson')$pvalue

Returns

> dis.kstest(dt, nsim = 100, bootstrap = TRUE, distri = 'nb')$pvalue
[1] 0
> dis.kstest(dt, nsim = 100, bootstrap = TRUE, distri = 'Poisson')$pvalue
[1] 0

Question
First, With the above results it’s hard to recommend which distribution is a better fit. Second, I have also used ks.test() in dgof package but it needs initialization of lambda value for Poisson which is unknown and also Negative binomial should have a probability value initialized which is unknown. Third, I receive an error suggesting that goodfit() in vcd package is not available hence couldn't use this tool.

Can I use a ks-test package other than dgof, iZID, and vcd?
Can I use regression to compare the AIC of two distributions using glm in R? In that case, how to build the model? What is X and Y?

Data in `dput` format

dt <- c(0L, 4L, 8L, 13L, 15L, 1L, 0L, 5L, 4L, 6L, 0L, 0L, 0L, 1L, 2L, 
3L, 0L, 0L, 0L, 0L, 0L, 3L, 0L, 10L, 15L, 0L, 23L, 0L, 6L, 0L, 
10L, 2L, 15L, 0L, 3L, 0L, 2L, 0L, 3L, 0L, 42L, 4L, 6L, 12L, 3L, 
2L, 0L, 0L, 0L, 2L, 3L, 9L, 6L, 0L, 3L, 2L, 1L, 6L, 0L, 0L, 0L, 
0L, 0L, 1L, 0L, 0L, 2L, 0L, 0L, 0L, 0L, 7L, 0L, 0L, 0L, 1L)

英文:

I have a vector dataset dt and only a few of 500 observations is given below.

[1]  0  4  8 13 15  1  0  5  4  6  0  0  0  1  2  3  0  0  0  0  0  3  0 10 15  0 23  0  6  0 10  2 15  0  3  0  2  0
 [39]  3  0 42  4  6 12  3  2  0  0  0  2  3  9  6  0  3  2  1  6  0  0  0  0  0  1  0  0  2  0  0  0  0  7  0  0  0  1

What I have tried?

Package iZID

install.packages(&quot;iZID&quot;)
library(&quot;iZID&quot;)
dis.kstest(dt,nsim=100,bootstrap=TRUE,distri=&#39;nb&#39;)$pvalue
dis.kstest(dt,nsim=100,bootstrap=TRUE,distri=&#39;Poisson&#39;)$pvalue

Returns

&gt; dis.kstest(dt,nsim=100,bootstrap=TRUE,distri=&#39;nb&#39;)$pvalue
[1] 0
&gt; dis.kstest(dt,nsim=100,bootstrap=TRUE,distri=&#39;Poisson&#39;)$pvalue
[1] 0

Can I use a ks-test package other than dgof, iZID, and vcd?
Can I use regression to compare the AIC of two distributions using glm in R? In that case, how to build the model? What is X and Y?

Data in `dput` format

dt &lt;-
c(0L, 4L, 8L, 13L, 15L, 1L, 0L, 5L, 4L, 6L, 0L, 0L, 0L, 1L, 2L, 
3L, 0L, 0L, 0L, 0L, 0L, 3L, 0L, 10L, 15L, 0L, 23L, 0L, 6L, 0L, 
10L, 2L, 15L, 0L, 3L, 0L, 2L, 0L, 3L, 0L, 42L, 4L, 6L, 12L, 3L, 
2L, 0L, 0L, 0L, 2L, 3L, 9L, 6L, 0L, 3L, 2L, 1L, 6L, 0L, 0L, 0L, 
0L, 0L, 1L, 0L, 0L, 2L, 0L, 0L, 0L, 0L, 7L, 0L, 0L, 0L, 1L)

答案1

得分: 1

First, to answer your question about KS test packages, yes, you can try the ks package in R.

It provides the ks.test() function that you could use to perform a KS test on your dataset.

For example, to perform a KS test on your dataset using a Poisson distribution, you can use the following code (of course I am assuming that you already have installed the package, before running the below code):

install.packages("ks", dependencies = T)

library(ks)
ks.test(dt, "ppois", lambda=mean(dt))

Here,
"ppois" is the Poisson distribution,
"lambda" is an estimate for the rate parameter of the Poisson distribution and we have set it to the mean of your data

The above code returns:

Asymptotic one-sample Kolmogorov-Smirnov test

data:  dt
D = 0.44349, p-value = 2.078e-13
alternative hypothesis: two-sided

Which means:

The p-value is very small (2.078e-13), thus, suggesting a strong evidence
against the null hypothesis.
Therefore, you can reject the null hypothesis that the data follow the specified theoretical distribution at the significance level of 0.05, and hence, conclude that the data do not follow the theoretical distribution.

Similarly, you could use the qnbinom distribution for the negative binomial distribution.

Now, to answer your second question about comparing the AIC of two distributions using regression, what you could use is glm() native function in R.

All you need to do is, first, create an output variable Y that represents the count data, and then an explanatory (input) variable X that refers to the index of each observation in your dataset.

Now you can fit two models; one, with the Poisson Distributions and, second, with the Negative Binomial Distributions, respectively, and compare their AIC values to determine which distribution provides a better fit.

library(ks)
library(MASS)

Y <- dt            # create response var

X <- 1:length(Y)   # create independent (input) var

modelPD <- glm(Y ~ X, family = "poisson")   # fit the Poisson Distribution

modelNBD <- glm.nb(Y ~ X) # fit the Negative Binomial Distribution from MASS packages

# compare two AICs
AIC(modelPD)
AIC(modelNBD)

Which returns:

688.5675 and 334.3564

And as we know, lower the AIC value, the better the model fits the data.
Therefore, we can conclude that the Negative Binomial Distribution provides a better fit to the data compared to the Poisson Distribution.

Let me know if this helps...

英文:

First, to answer your question about KS test packages, yes, you can try the ks package in R.

It provides the ks.test() function that you could use to perform a KS test on your dataset.

install.packages(&quot;ks&quot;, dependencies = T)

library(ks)
ks.test(dt, &quot;ppois&quot;, lambda=mean(dt))

Here,
"ppois" is the Poisson distribution,
"lambda" is an estimate for the rate parameter of the Poisson distribution and we have set it to the mean of your data

The above code returns:

Asymptotic one-sample Kolmogorov-Smirnov test

data:  dt
D = 0.44349, p-value = 2.078e-13
alternative hypothesis: two-sided

Which means:

> The p-value is very small (2.078e-13), thus, suggesting a strong evidence
> against the null hypothesis.
> Therefore, you can reject the null hypothesis that the data follow the specified theoretical distribution at the significance level of 0.05, and hence, conclude that the data do not follow the theoretical distribution.

Similarly, you could use the qnbinom distribution for the negative binomial distribution.

Now, to answer your second question about comparing the AIC of two distributions using regression, what you could use is glm() native function in R.

All you need to do is, first, create an output variable Y that represents the count data, and then an explanatory (input) variable X that refers to the index of each observation in your dataset.

library(ks)
library(MASS)

Y &lt;- dt            # create response var

X &lt;- 1:length(Y)   # create independent (input) var

modelPD &lt;- glm(Y ~ X, family = &quot;poisson&quot;)   # fit the Poisson Distribution

modelNBD &lt;- glm.nb(Y ~ X) # fit the Negative Binomial Distribution from MASS packages

# compare two AICs
AIC(modelPD)
AIC(modelNBD)

Which returns:

688.5675 and 334.3564

Let me know if this helps...

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Kolmogorov-Smirnov测试和回归，以在R中推导负二项分布和泊松分布的拟合度。

问题

Data in `dput` format

Data in `dput` format

答案1

How to use shinymanager along with session$onSessionEnded() to close the app upon browser close

使用`rows_append()`时添加因子水平。

在保存绘图之前存储ggplot对象，计算时间会发生很大的变化。

使用lapply函数来修改R中的多个矩阵。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论

问题

Data in dput format

Data in dput format

答案1

发表评论

Data in `dput` format

Data in `dput` format