测量两个具有不均匀数据点的累积分布函数之间的水平距离。

huangapple go评论105阅读模式
英文:

How to measure the horizontal distances between two cdfs with uneven data point

问题

I have translated the code portion as requested. Here's the translated code:

  1. 我有261个数据点的数据集,另一个有373个数据点。以下是数据:
  2. ```r
  3. dataset_1 = data.frame(dataset_name = rep("dataset_1", 261),
  4. value = seq(40, 10000, length.out = 261))
  5. dataset_2 = data.frame(dataset_name = rep("dataset_2", 373),
  6. value = seq(50, 5000, length.out = 373))
  7. dataset <- rbind(dataset_1, dataset_2)

ks检验:

  1. ks.test(dataset$value[dataset$dataset_name=="dataset_1"],
  2. dataset$value[dataset$dataset_name=="dataset_2"],
  3. alternative = c("less")) -> test_result

绘制ecdf图:

  1. library(ggplot2)
  2. dataset %>%
  3. ggplot(aes(x= value, group = dataset_name, color = dataset_name)) +
  4. stat_ecdf(size =2)

测量两个具有不均匀数据点的累积分布函数之间的水平距离。

现在,我需要在每个概率点测量水平距离的值。例如,在0.25时,我们有来自dataset_1的2500和来自dataset_2的1250,因此距离为1250。由于dataset 1有261个数据点,dataset 2有373个数据点。如何生成一个数据框,可以显示这些距离。

我使用线性逼近修改了dataset_1,以创建373个数据点,然后检查了结果。

  1. interpolated_dataset_1 <- approx(dataset_1$value, n = 373)
  2. # 创建数据框
  3. interpolated_dataset_1_dataframe <- data.frame(dataset_name =
  4. "modified_dataset_1", value = interpolated_dataset_1$y)
  5. # 合并数据
  6. modified_dataset <- rbind(dataset, interpolated_dataset_1_dataframe)
  7. # ks检验
  8. ks.test(modified_dataset$value[modified_dataset$dataset_name==
  9. "modified_dataset_1"],
  10. modified_dataset$value[modified_dataset$dataset_name=="dataset_2"],
  11. alternative = c("less")) -> modified_test_result
  12. # 绘制ecdf图
  13. library(ggplot2)
  14. modified_dataset %>%
  15. ggplot(aes(x= value, group = dataset_name, color = dataset_name)) +
  16. stat_ecdf(size =2)

d-统计量几乎相同,但不完全相同,尽管结果是显著的。

是否有一种更好的方法,使用step函数,可以获得完全相同的检验统计量?

  1. <details>
  2. <summary>英文:</summary>
  3. I have dataset 261 data points, and another with 373 data points. Here is the data
  4. ```r
  5. dataset_1 = data.frame(dataset_name = rep(&quot;dataset_1&quot;, 261),
  6. value = seq(40, 10000, length.out = 261))
  7. dataset_2 = data.frame(dataset_name = rep(&quot;dataset_2&quot;, 373),
  8. value = seq(50, 5000, length.out = 373))
  9. dataset &lt;- rbind(dataset_1, dataset_2)

the ks test

  1. ks.test(dataset$value[dataset$dataset_name==&quot;dataset_1&quot;],
  2. dataset$value[dataset$dataset_name==&quot;dataset_2&quot;],
  3. alternative = c(&quot;less&quot;)) -&gt; test_result

Plotting the ecdfs

  1. library(ggplot2)
  2. dataset %&gt;%
  3. ggplot(aes(x= value, group = dataset_name, color = dataset_name)) +
  4. stat_ecdf(size =2)

测量两个具有不均匀数据点的累积分布函数之间的水平距离。

Now, I need to measure the values of horizontal distances at each probability points. For example, at 0.25, we have 2500 from dataset_1, and 1250 from dataset_2, hence the distance is 1250. As dataset 1 has 261, and dataset 2 has 373 points. How can I generate a dataframe that can show me the distances.

I have modified dataset_1 using a linear approximation to create 373 datapoints and then checked the results.

  1. interpolated_dataset_1 &lt;- approx(dataset_1$value, n = 373)
  2. # creating the dataframe
  3. interpolated_dataset_1_dataframe &lt;- data.frame(dataset_name =
  4. &quot;modified_dataset_1&quot;, value = interpolated_dataset_1$y)
  5. # combining the data
  6. modified_dataset &lt;- rbind(dataset,interpolated_dataset_1_dataframe)
  7. # the ks test
  8. ks.test(modified_dataset$value[modified_dataset$dataset_name==
  9. &quot;modified_dataset_1&quot;],
  10. modified_dataset$value[modified_dataset$dataset_name==&quot;dataset_2&quot;],
  11. alternative = c(&quot;less&quot;)) -&gt; modified_test_result
  12. # the ecdfs
  13. library(ggplot2)
  14. modified_dataset %&gt;%
  15. ggplot(aes(x= value, group = dataset_name, color = dataset_name)) +
  16. stat_ecdf(size =2)

The d-statistic is almost the same but not quite, although the result is significant.

Is there a better way to do it using step function where I will get the exact same test statistics?

答案1

得分: 1

更新

我猜 quantile 对你的目的会很有帮助,比以前的解决方案(ecdf + uniroot)更高效。

  1. dstat2 <- function(p, df1 = dataset_1, df2 = dataset_2) {
  2. abs(quantile(df1$value, p) - quantile(df2$value, p))
  3. }

这样

  1. > dstat2(0.25)
  2. 25%
  3. 1242.5
  4. > dstat2(0.5)
  5. 50%
  6. 2495
  7. > dstat2(0.75)
  8. 75%
  9. 3747.5

这是一个使用 ecdf + uniroot 的解决方案

  1. dstat <- function(p, df1 = dataset_1, df2 = dataset_2) {
  2. abs(
  3. diff(
  4. sapply(
  5. list(df1, df2),
  6. \(v) {
  7. with(
  8. v,
  9. uniroot(
  10. \(x) ecdf(value)(x) - p,
  11. range(value)
  12. )$root
  13. )
  14. }
  15. )
  16. )
  17. )
  18. }

然后我们可以得到

  1. > dstat(0.25)
  2. [1] 1242.5
  3. > dstat(0.5)
  4. [1] 2495
  5. > dstat(0.75)
  6. [1] 3747.5
英文:

Update

I guess quantile should be helpful for your purpose, which is more efficient than the previous solution (ecdf + uniroot)

  1. dstat2 &lt;- function(p, df1 = dataset_1, df2 = dataset_2) {
  2. abs(quantile(df1$value, p) - quantile(df2$value, p))
  3. }

such that

  1. &gt; dstat2(0.25)
  2. 25%
  3. 1242.5
  4. &gt; dstat2(0.5)
  5. 50%
  6. 2495
  7. &gt; dstat2(0.75)
  8. 75%
  9. 3747.5

Here is a solution using ecdf + uniroot

  1. dstat &lt;- function(p, df1 = dataset_1, df2 = dataset_2) {
  2. abs(
  3. diff(
  4. sapply(
  5. list(df1, df2),
  6. \(v) {
  7. with(
  8. v,
  9. uniroot(
  10. \(x) ecdf(value)(x) - p,
  11. range(value)
  12. )$root
  13. )
  14. }
  15. )
  16. )
  17. )
  18. }

and we can obtain

  1. &gt; dstat(0.25)
  2. [1] 1242.5
  3. &gt; dstat(0.5)
  4. [1] 2495
  5. &gt; dstat(0.75)
  6. [1] 3747.5

huangapple
  • 本文由 发表于 2023年5月18日 04:59:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/76276152.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定