英文:
Filter numbers that are closest to target values and eliminate duplicated observations
问题
以下是翻译好的内容:
我有这个数据框:
data_a <- read.csv(text = "
date,treatment,stage
1,a,1
2,a,10
3,a,20
4,a,30
5,a,60
6,a,70
7,a,89
8,a,91
9,a,92
1,b,1
2,b,10
3,b,20
4,b,30
5,b,59.8
6,b,60.2
7,b,88.8
8,b,90.2
9,b,92
1,c,1
2,c,10
3,c,20
4,c,60
5,c,66
6,c,70
7,c,80
8,c,85
9,c,85")
我需要在每个 treatment
中过滤与 stage
值为10、60和89(或最接近这些目标值的观测值)匹配的观测值。我有以下代码:
filtered_data <- data_a %>%
group_by(treatment) %>%
filter(abs(stage - 10) == min(abs(stage - 10)) |
abs(stage - 60) == min(abs(stage - 60)) |
abs(stage - 89) == min(abs(stage - 89)))
这段代码部分地实现了目标,但在 treatment
为 b 和 c 时存在问题。
在 b 中,有两个观测值与目标的差值相同。因此,两个观测值都被过滤了,这是不希望的。
在 c 中,有两个观测值具有相同的值并且最接近目标,因此选择了这两个观测值,这也是不希望的。
期望的输出如下:
filtered_data <- read.csv(text = "
date,treatment,stage
2,a,10
5,a,60
7,a,89
2,b,10
5,b,59.8
7,b,88.8
2,c,10
4,c,60
8,c,85")
英文:
I have this dataframe:
data_a <- read.csv(text = "
date,treatment,stage
1,a,1
2,a,10
3,a,20
4,a,30
5,a,60
6,a,70
7,a,89
8,a,91
9,a,92
1,b,1
2,b,10
3,b,20
4,b,30
5,b,59.8
6,b,60.2
7,b,88.8
8,b,90.2
9,b,92
1,c,1
2,c,10
3,c,20
4,c,60
5,c,66
6,c,70
7,c,80
8,c,85
9,c,85")
I need to filter within each treatment
the observations matching stage
10, 60, and 89 (or the observation closest to those target values). The code I have is this:
filtered_data <- data_a %>%
group_by(treatment) %>%
filter(abs(stage - 10) == min(abs(stage - 10)) |
abs(stage - 60) == min(abs(stage - 60)) |
abs(stage - 89) == min(abs(stage - 89)))
This code partially does the trick, but there are problems for treatment
b and c.
In b, two observations have the same difference from the target. So, both observations are filtered in, which is not desired.
In c, two observations have the same value and are closest to the target, and therefore both observations are selected, which is not desired.
The desired output is this:
filtered_data <- read.csv(text = "
date,treatment,stage
2,a,10
5,a,60
7,a,89
2,b,10
5,b,59.8
7,b,88.8
2,c,10
4,c,60
8,c,85")
答案1
得分: 2
I would do it thusly
library(tidyverse)
crossing(
data_a,
target_stage = c(10, 60, 89)
) %>%
group_by(treatment, target_stage) %>%
slice_min(
abs(stage-target_stage),
with_ties = F
)
<sup>Created on 2023-05-22 with reprex v2.0.2</sup>
If you expand the grid using crossing
you can then group by this and find the smallest whilst also removing ties
英文:
I would do it thusly
library(tidyverse)
crossing(
data_a,
target_stage = c(10, 60, 89)
) %>%
group_by(treatment, target_stage) %>%
slice_min(
abs(stage-target_stage),
with_ties = F
)
#> # A tibble: 9 × 4
#> # Groups: treatment, target_stage [9]
#> date treatment stage target_stage
#> <int> <chr> <dbl> <dbl>
#> 1 2 a 10 10
#> 2 5 a 60 60
#> 3 7 a 89 89
#> 4 2 b 10 10
#> 5 5 b 59.8 60
#> 6 7 b 88.8 89
#> 7 2 c 10 10
#> 8 4 c 60 60
#> 9 8 c 85 89
<sup>Created on 2023-05-22 with reprex v2.0.2</sup>
If you expand the grid using crossing
you can then group by this and find the smallest whilst also removing ties
答案2
得分: 1
使用dplyr
和purrr
:
library(dplyr)
library(purrr)
map_dfr(c(10, 60, 89),
~ data_a %>%
filter(abs(stage - .x) == min(abs(stage - .x)),
.by = treatment) %>%
slice_min(stage, n = 1, with_ties = FALSE, by = treatment)) %>%
arrange(treatment, date)
使用data.table
:
library(data.table)
setDT(data_a)[
data_a[CJ(stage = c(10, 60, 89), treatment = unique(data_a$treatment)),
on = .(treatment, stage),
roll = "nearest",
.(date, treatment)],
on = .(treatment, date)][
order(treatment, date)]
英文:
Using dplyr
and purrr
:
library(dplyr)
library(purrr)
map_dfr(c(10, 60, 89),
~ data_a %>%
filter(abs(stage - .x) == min(abs(stage - .x)),
.by = treatment) %>%
slice_min(stage, n = 1, with_ties = FALSE, by = treatment)) %>%
arrange(treatment, date)
#> date treatment stage
#> 1 2 a 10.0
#> 2 5 a 60.0
#> 3 7 a 89.0
#> 4 2 b 10.0
#> 5 5 b 59.8
#> 6 7 b 88.8
#> 7 2 c 10.0
#> 8 4 c 60.0
#> 9 8 c 85.0
Using data.table
:
library(data.table)
setDT(data_a)[
data_a[CJ(stage = c(10, 60, 89), treatment = unique(data_a$treatment)),
on = .(treatment, stage),
roll = "nearest",
.(date, treatment)],
on = .(treatment, date)][
order(treatment, date)]
#> date treatment stage
#> 1: 2 a 10.0
#> 2: 5 a 60.0
#> 3: 7 a 89.0
#> 4: 2 b 10.0
#> 5: 5 b 59.8
#> 6: 7 b 88.8
#> 7: 2 c 10.0
#> 8: 4 c 60.0
#> 9: 9 c 85.0
答案3
得分: 1
你可以使用 outer()
+ max.col()
来找到距离 10、60、89 最近或最远的数值。
library(dplyr)
data_a %>%
slice({
mat <- abs(outer(c(10, 60, 89), stage, '-'))
max.col(-mat, "first")
}, .by = treatment)
# date treatment stage
# 1 2 a 10.0
# 2 5 a 60.0
# 3 7 a 89.0
# 4 2 b 10.0
# 5 5 b 59.8
# 6 7 b 88.8
# 7 2 c 10.0
# 8 4 c 60.0
# 9 8 c 85.0
英文:
You can use outer()
+ max.col()
to find the closest or farthest values from 10, 60, 89.
library(dplyr)
data_a %>%
slice({
mat <- abs(outer(c(10, 60, 89), stage, '-'))
max.col(-mat, "first")
}, .by = treatment)
# date treatment stage
# 1 2 a 10.0
# 2 5 a 60.0
# 3 7 a 89.0
# 4 2 b 10.0
# 5 5 b 59.8
# 6 7 b 88.8
# 7 2 c 10.0
# 8 4 c 60.0
# 9 8 c 85.0
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论