测量两个具有不均匀数据点的累积分布函数之间的水平距离。

huangapple go评论58阅读模式
英文:

How to measure the horizontal distances between two cdfs with uneven data point

问题

I have translated the code portion as requested. Here's the translated code:

我有261个数据点的数据集,另一个有373个数据点。以下是数据:
```r
dataset_1 = data.frame(dataset_name = rep("dataset_1", 261), 
                       value = seq(40, 10000, length.out = 261))
dataset_2 = data.frame(dataset_name = rep("dataset_2", 373), 
                       value = seq(50, 5000, length.out = 373))

dataset <- rbind(dataset_1, dataset_2)

ks检验:

ks.test(dataset$value[dataset$dataset_name=="dataset_1"],
        dataset$value[dataset$dataset_name=="dataset_2"],
        alternative = c("less")) -> test_result

绘制ecdf图:

library(ggplot2)
dataset %>%
  ggplot(aes(x= value, group = dataset_name, color = dataset_name)) +
  stat_ecdf(size =2)

测量两个具有不均匀数据点的累积分布函数之间的水平距离。

现在,我需要在每个概率点测量水平距离的值。例如,在0.25时,我们有来自dataset_1的2500和来自dataset_2的1250,因此距离为1250。由于dataset 1有261个数据点,dataset 2有373个数据点。如何生成一个数据框,可以显示这些距离。

我使用线性逼近修改了dataset_1,以创建373个数据点,然后检查了结果。

interpolated_dataset_1  <- approx(dataset_1$value, n = 373)

# 创建数据框
interpolated_dataset_1_dataframe <- data.frame(dataset_name = 
              "modified_dataset_1", value = interpolated_dataset_1$y)

# 合并数据
modified_dataset <- rbind(dataset, interpolated_dataset_1_dataframe)

# ks检验
ks.test(modified_dataset$value[modified_dataset$dataset_name==
                               "modified_dataset_1"],
        modified_dataset$value[modified_dataset$dataset_name=="dataset_2"],
        alternative = c("less")) -> modified_test_result

# 绘制ecdf图
library(ggplot2)
modified_dataset %>%
  ggplot(aes(x= value, group = dataset_name, color = dataset_name)) +
  stat_ecdf(size =2)

d-统计量几乎相同,但不完全相同,尽管结果是显著的。

是否有一种更好的方法,使用step函数,可以获得完全相同的检验统计量?


<details>
<summary>英文:</summary>

I have dataset 261 data points, and another with 373 data points. Here is the data
```r
dataset_1 = data.frame(dataset_name = rep(&quot;dataset_1&quot;, 261), 
                       value = seq(40, 10000, length.out = 261))
dataset_2 = data.frame(dataset_name = rep(&quot;dataset_2&quot;, 373), 
                       value = seq(50, 5000, length.out = 373))

dataset &lt;- rbind(dataset_1, dataset_2)

the ks test

ks.test(dataset$value[dataset$dataset_name==&quot;dataset_1&quot;],
        dataset$value[dataset$dataset_name==&quot;dataset_2&quot;],
        alternative = c(&quot;less&quot;)) -&gt; test_result

Plotting the ecdfs

library(ggplot2)
dataset %&gt;% 
  ggplot(aes(x= value, group = dataset_name, color = dataset_name)) +
  stat_ecdf(size =2)

测量两个具有不均匀数据点的累积分布函数之间的水平距离。

Now, I need to measure the values of horizontal distances at each probability points. For example, at 0.25, we have 2500 from dataset_1, and 1250 from dataset_2, hence the distance is 1250. As dataset 1 has 261, and dataset 2 has 373 points. How can I generate a dataframe that can show me the distances.

I have modified dataset_1 using a linear approximation to create 373 datapoints and then checked the results.

interpolated_dataset_1  &lt;- approx(dataset_1$value, n = 373)

# creating the dataframe
interpolated_dataset_1_dataframe &lt;- data.frame(dataset_name = 
              &quot;modified_dataset_1&quot;, value = interpolated_dataset_1$y)

# combining the data
modified_dataset &lt;- rbind(dataset,interpolated_dataset_1_dataframe)

# the ks test
ks.test(modified_dataset$value[modified_dataset$dataset_name==
                               &quot;modified_dataset_1&quot;],
        modified_dataset$value[modified_dataset$dataset_name==&quot;dataset_2&quot;],
        alternative = c(&quot;less&quot;)) -&gt; modified_test_result
# the ecdfs
library(ggplot2)
modified_dataset %&gt;% 
  ggplot(aes(x= value, group = dataset_name, color = dataset_name)) +
  stat_ecdf(size =2)

The d-statistic is almost the same but not quite, although the result is significant.

Is there a better way to do it using step function where I will get the exact same test statistics?

答案1

得分: 1

更新

我猜 quantile 对你的目的会很有帮助,比以前的解决方案(ecdf + uniroot)更高效。

dstat2 <- function(p, df1 = dataset_1, df2 = dataset_2) {
    abs(quantile(df1$value, p) - quantile(df2$value, p))
}

这样

> dstat2(0.25)
   25%
1242.5

> dstat2(0.5)
 50%
2495

> dstat2(0.75)
   75%
3747.5

这是一个使用 ecdf + uniroot 的解决方案

dstat <- function(p, df1 = dataset_1, df2 = dataset_2) {
    abs(
        diff(
            sapply(
                list(df1, df2),
                \(v) {
                    with(
                        v,
                        uniroot(
                            \(x) ecdf(value)(x) - p,
                            range(value)
                        )$root
                    )
                }
            )
        )
    )
}

然后我们可以得到

> dstat(0.25)
[1] 1242.5

> dstat(0.5)
[1] 2495

> dstat(0.75)
[1] 3747.5
英文:

Update

I guess quantile should be helpful for your purpose, which is more efficient than the previous solution (ecdf + uniroot)

dstat2 &lt;- function(p, df1 = dataset_1, df2 = dataset_2) {
    abs(quantile(df1$value, p) - quantile(df2$value, p))
}

such that

&gt; dstat2(0.25)
   25%
1242.5

&gt; dstat2(0.5)
 50%
2495

&gt; dstat2(0.75)
   75%
3747.5

Here is a solution using ecdf + uniroot

dstat &lt;- function(p, df1 = dataset_1, df2 = dataset_2) {
    abs(
        diff(
            sapply(
                list(df1, df2),
                \(v) {
                    with(
                        v,
                        uniroot(
                            \(x) ecdf(value)(x) - p,
                            range(value)
                        )$root
                    )
                }
            )
        )
    )
}

and we can obtain

&gt; dstat(0.25)
[1] 1242.5

&gt; dstat(0.5)
[1] 2495

&gt; dstat(0.75)
[1] 3747.5

huangapple
  • 本文由 发表于 2023年5月18日 04:59:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/76276152.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定