2023年5月18日 04:59:50go评论105阅读模式

英文:

How to measure the horizontal distances between two cdfs with uneven data point

问题

I have translated the code portion as requested. Here's the translated code:

我有261个数据点的数据集，另一个有373个数据点。以下是数据：
```r
dataset_1 = data.frame(dataset_name = rep("dataset_1", 261), 
                       value = seq(40, 10000, length.out = 261))
dataset_2 = data.frame(dataset_name = rep("dataset_2", 373), 
                       value = seq(50, 5000, length.out = 373))
dataset <- rbind(dataset_1, dataset_2)

ks检验：

ks.test(dataset$value[dataset$dataset_name=="dataset_1"],
        dataset$value[dataset$dataset_name=="dataset_2"],
        alternative = c("less")) -> test_result

绘制ecdf图：

library(ggplot2)
dataset %>%
  ggplot(aes(x= value, group = dataset_name, color = dataset_name)) +
  stat_ecdf(size =2)

测量两个具有不均匀数据点的累积分布函数之间的水平距离。

现在，我需要在每个概率点测量水平距离的值。例如，在0.25时，我们有来自dataset_1的2500和来自dataset_2的1250，因此距离为1250。由于dataset 1有261个数据点，dataset 2有373个数据点。如何生成一个数据框，可以显示这些距离。

我使用线性逼近修改了dataset_1，以创建373个数据点，然后检查了结果。

interpolated_dataset_1  <- approx(dataset_1$value, n = 373)
# 创建数据框
interpolated_dataset_1_dataframe <- data.frame(dataset_name = 
              "modified_dataset_1", value = interpolated_dataset_1$y)
# 合并数据
modified_dataset <- rbind(dataset, interpolated_dataset_1_dataframe)
# ks检验
ks.test(modified_dataset$value[modified_dataset$dataset_name==
                               "modified_dataset_1"],
        modified_dataset$value[modified_dataset$dataset_name=="dataset_2"],
        alternative = c("less")) -> modified_test_result
# 绘制ecdf图
library(ggplot2)
modified_dataset %>%
  ggplot(aes(x= value, group = dataset_name, color = dataset_name)) +
  stat_ecdf(size =2)

d-统计量几乎相同，但不完全相同，尽管结果是显著的。

是否有一种更好的方法，使用step函数，可以获得完全相同的检验统计量？


<details>
<summary>英文:</summary>
I have dataset 261 data points, and another with 373 data points. Here is the data
```r
dataset_1 = data.frame(dataset_name = rep(&quot;dataset_1&quot;, 261), 
                       value = seq(40, 10000, length.out = 261))
dataset_2 = data.frame(dataset_name = rep(&quot;dataset_2&quot;, 373), 
                       value = seq(50, 5000, length.out = 373))
dataset &lt;- rbind(dataset_1, dataset_2)

the ks test

ks.test(dataset$value[dataset$dataset_name==&quot;dataset_1&quot;],
        dataset$value[dataset$dataset_name==&quot;dataset_2&quot;],
        alternative = c(&quot;less&quot;)) -&gt; test_result

Plotting the ecdfs

library(ggplot2)
dataset %&gt;% 
  ggplot(aes(x= value, group = dataset_name, color = dataset_name)) +
  stat_ecdf(size =2)

测量两个具有不均匀数据点的累积分布函数之间的水平距离。

Now, I need to measure the values of horizontal distances at each probability points. For example, at 0.25, we have 2500 from dataset_1, and 1250 from dataset_2, hence the distance is 1250. As dataset 1 has 261, and dataset 2 has 373 points. How can I generate a dataframe that can show me the distances.

I have modified dataset_1 using a linear approximation to create 373 datapoints and then checked the results.

interpolated_dataset_1  &lt;- approx(dataset_1$value, n = 373)
# creating the dataframe
interpolated_dataset_1_dataframe &lt;- data.frame(dataset_name = 
              &quot;modified_dataset_1&quot;, value = interpolated_dataset_1$y)
# combining the data
modified_dataset &lt;- rbind(dataset,interpolated_dataset_1_dataframe)
# the ks test
ks.test(modified_dataset$value[modified_dataset$dataset_name==
                               &quot;modified_dataset_1&quot;],
        modified_dataset$value[modified_dataset$dataset_name==&quot;dataset_2&quot;],
        alternative = c(&quot;less&quot;)) -&gt; modified_test_result
# the ecdfs
library(ggplot2)
modified_dataset %&gt;% 
  ggplot(aes(x= value, group = dataset_name, color = dataset_name)) +
  stat_ecdf(size =2)

The d-statistic is almost the same but not quite, although the result is significant.

Is there a better way to do it using step function where I will get the exact same test statistics?

答案1

得分: 1

更新

我猜 quantile 对你的目的会很有帮助，比以前的解决方案（ecdf + uniroot）更高效。

dstat2 <- function(p, df1 = dataset_1, df2 = dataset_2) {
    abs(quantile(df1$value, p) - quantile(df2$value, p))
}

这样

> dstat2(0.25)
   25%
1242.5
> dstat2(0.5)
 50%
2495
> dstat2(0.75)
   75%
3747.5

这是一个使用 ecdf + uniroot 的解决方案

dstat <- function(p, df1 = dataset_1, df2 = dataset_2) {
    abs(
        diff(
            sapply(
                list(df1, df2),
                \(v) {
                    with(
                        v,
                        uniroot(
                            \(x) ecdf(value)(x) - p,
                            range(value)
                        )$root
                    )
                }
            )
        )
    )
}

然后我们可以得到

> dstat(0.25)
[1] 1242.5
> dstat(0.5)
[1] 2495
> dstat(0.75)
[1] 3747.5

英文:

Update

I guess quantile should be helpful for your purpose, which is more efficient than the previous solution (ecdf + uniroot)

dstat2 &lt;- function(p, df1 = dataset_1, df2 = dataset_2) {
    abs(quantile(df1$value, p) - quantile(df2$value, p))
}

such that

&gt; dstat2(0.25)
   25%
1242.5
&gt; dstat2(0.5)
 50%
2495
&gt; dstat2(0.75)
   75%
3747.5

Here is a solution using ecdf + uniroot

dstat &lt;- function(p, df1 = dataset_1, df2 = dataset_2) {
    abs(
        diff(
            sapply(
                list(df1, df2),
                \(v) {
                    with(
                        v,
                        uniroot(
                            \(x) ecdf(value)(x) - p,
                            range(value)
                        )$root
                    )
                }
            )
        )
    )
}

and we can obtain

&gt; dstat(0.25)
[1] 1242.5
&gt; dstat(0.5)
[1] 2495
&gt; dstat(0.75)
[1] 3747.5

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

测量两个具有不均匀数据点的累积分布函数之间的水平距离。

问题

答案1

更新

Update

CSV文件未正确读取（几乎有一半的行被删除）。

声学复杂度指数时间序列输出

应用select()函数创建包含三个变量的新数据框。

如何在R中对不平衡的嵌套rma.mv元分析模型使用emmprep？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论