英文:
How to measure the horizontal distances between two cdfs with uneven data point
问题
I have translated the code portion as requested. Here's the translated code:
我有261个数据点的数据集,另一个有373个数据点。以下是数据:
```r
dataset_1 = data.frame(dataset_name = rep("dataset_1", 261),
value = seq(40, 10000, length.out = 261))
dataset_2 = data.frame(dataset_name = rep("dataset_2", 373),
value = seq(50, 5000, length.out = 373))
dataset <- rbind(dataset_1, dataset_2)
ks检验:
ks.test(dataset$value[dataset$dataset_name=="dataset_1"],
dataset$value[dataset$dataset_name=="dataset_2"],
alternative = c("less")) -> test_result
绘制ecdf图:
library(ggplot2)
dataset %>%
ggplot(aes(x= value, group = dataset_name, color = dataset_name)) +
stat_ecdf(size =2)
现在,我需要在每个概率点测量水平距离的值。例如,在0.25时,我们有来自dataset_1的2500和来自dataset_2的1250,因此距离为1250。由于dataset 1有261个数据点,dataset 2有373个数据点。如何生成一个数据框,可以显示这些距离。
我使用线性逼近修改了dataset_1,以创建373个数据点,然后检查了结果。
interpolated_dataset_1 <- approx(dataset_1$value, n = 373)
# 创建数据框
interpolated_dataset_1_dataframe <- data.frame(dataset_name =
"modified_dataset_1", value = interpolated_dataset_1$y)
# 合并数据
modified_dataset <- rbind(dataset, interpolated_dataset_1_dataframe)
# ks检验
ks.test(modified_dataset$value[modified_dataset$dataset_name==
"modified_dataset_1"],
modified_dataset$value[modified_dataset$dataset_name=="dataset_2"],
alternative = c("less")) -> modified_test_result
# 绘制ecdf图
library(ggplot2)
modified_dataset %>%
ggplot(aes(x= value, group = dataset_name, color = dataset_name)) +
stat_ecdf(size =2)
d-统计量几乎相同,但不完全相同,尽管结果是显著的。
是否有一种更好的方法,使用step函数,可以获得完全相同的检验统计量?
<details>
<summary>英文:</summary>
I have dataset 261 data points, and another with 373 data points. Here is the data
```r
dataset_1 = data.frame(dataset_name = rep("dataset_1", 261),
value = seq(40, 10000, length.out = 261))
dataset_2 = data.frame(dataset_name = rep("dataset_2", 373),
value = seq(50, 5000, length.out = 373))
dataset <- rbind(dataset_1, dataset_2)
the ks test
ks.test(dataset$value[dataset$dataset_name=="dataset_1"],
dataset$value[dataset$dataset_name=="dataset_2"],
alternative = c("less")) -> test_result
Plotting the ecdfs
library(ggplot2)
dataset %>%
ggplot(aes(x= value, group = dataset_name, color = dataset_name)) +
stat_ecdf(size =2)
Now, I need to measure the values of horizontal distances at each probability points. For example, at 0.25, we have 2500 from dataset_1, and 1250 from dataset_2, hence the distance is 1250. As dataset 1 has 261, and dataset 2 has 373 points. How can I generate a dataframe that can show me the distances.
I have modified dataset_1 using a linear approximation to create 373 datapoints and then checked the results.
interpolated_dataset_1 <- approx(dataset_1$value, n = 373)
# creating the dataframe
interpolated_dataset_1_dataframe <- data.frame(dataset_name =
"modified_dataset_1", value = interpolated_dataset_1$y)
# combining the data
modified_dataset <- rbind(dataset,interpolated_dataset_1_dataframe)
# the ks test
ks.test(modified_dataset$value[modified_dataset$dataset_name==
"modified_dataset_1"],
modified_dataset$value[modified_dataset$dataset_name=="dataset_2"],
alternative = c("less")) -> modified_test_result
# the ecdfs
library(ggplot2)
modified_dataset %>%
ggplot(aes(x= value, group = dataset_name, color = dataset_name)) +
stat_ecdf(size =2)
The d-statistic is almost the same but not quite, although the result is significant.
Is there a better way to do it using step function where I will get the exact same test statistics?
答案1
得分: 1
更新
我猜 quantile
对你的目的会很有帮助,比以前的解决方案(ecdf
+ uniroot
)更高效。
dstat2 <- function(p, df1 = dataset_1, df2 = dataset_2) {
abs(quantile(df1$value, p) - quantile(df2$value, p))
}
这样
> dstat2(0.25)
25%
1242.5
> dstat2(0.5)
50%
2495
> dstat2(0.75)
75%
3747.5
这是一个使用 ecdf
+ uniroot
的解决方案
dstat <- function(p, df1 = dataset_1, df2 = dataset_2) {
abs(
diff(
sapply(
list(df1, df2),
\(v) {
with(
v,
uniroot(
\(x) ecdf(value)(x) - p,
range(value)
)$root
)
}
)
)
)
}
然后我们可以得到
> dstat(0.25)
[1] 1242.5
> dstat(0.5)
[1] 2495
> dstat(0.75)
[1] 3747.5
英文:
Update
I guess quantile
should be helpful for your purpose, which is more efficient than the previous solution (ecdf
+ uniroot
)
dstat2 <- function(p, df1 = dataset_1, df2 = dataset_2) {
abs(quantile(df1$value, p) - quantile(df2$value, p))
}
such that
> dstat2(0.25)
25%
1242.5
> dstat2(0.5)
50%
2495
> dstat2(0.75)
75%
3747.5
Here is a solution using ecdf
+ uniroot
dstat <- function(p, df1 = dataset_1, df2 = dataset_2) {
abs(
diff(
sapply(
list(df1, df2),
\(v) {
with(
v,
uniroot(
\(x) ecdf(value)(x) - p,
range(value)
)$root
)
}
)
)
)
}
and we can obtain
> dstat(0.25)
[1] 1242.5
> dstat(0.5)
[1] 2495
> dstat(0.75)
[1] 3747.5
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论