过滤最接近目标值的数字并消除重复观察。

huangapple go评论70阅读模式
英文:

Filter numbers that are closest to target values and eliminate duplicated observations

问题

以下是翻译好的内容:

我有这个数据框:

data_a <- read.csv(text = "
date,treatment,stage
1,a,1
2,a,10
3,a,20
4,a,30
5,a,60
6,a,70
7,a,89
8,a,91
9,a,92
1,b,1
2,b,10
3,b,20
4,b,30
5,b,59.8
6,b,60.2
7,b,88.8
8,b,90.2
9,b,92
1,c,1
2,c,10
3,c,20
4,c,60
5,c,66
6,c,70
7,c,80
8,c,85
9,c,85")

我需要在每个 treatment 中过滤与 stage 值为10、60和89(或最接近这些目标值的观测值)匹配的观测值。我有以下代码:

filtered_data <- data_a %>%
  group_by(treatment) %>%
  filter(abs(stage - 10) == min(abs(stage - 10)) |
         abs(stage - 60) == min(abs(stage - 60)) |
         abs(stage - 89) == min(abs(stage - 89)))

这段代码部分地实现了目标,但在 treatment 为 b 和 c 时存在问题。

在 b 中,有两个观测值与目标的差值相同。因此,两个观测值都被过滤了,这是不希望的。

在 c 中,有两个观测值具有相同的值并且最接近目标,因此选择了这两个观测值,这也是不希望的。

期望的输出如下:

filtered_data <- read.csv(text = "
date,treatment,stage
2,a,10
5,a,60
7,a,89
2,b,10
5,b,59.8
7,b,88.8
2,c,10
4,c,60
8,c,85")
英文:

I have this dataframe:

data_a &lt;- read.csv(text = &quot;
date,treatment,stage
1,a,1
2,a,10
3,a,20
4,a,30
5,a,60
6,a,70
7,a,89
8,a,91
9,a,92
1,b,1
2,b,10
3,b,20
4,b,30
5,b,59.8
6,b,60.2
7,b,88.8
8,b,90.2
9,b,92
1,c,1
2,c,10
3,c,20
4,c,60
5,c,66
6,c,70
7,c,80
8,c,85
9,c,85&quot;)

I need to filter within each treatment the observations matching stage 10, 60, and 89 (or the observation closest to those target values). The code I have is this:

filtered_data &lt;- data_a %&gt;%
  group_by(treatment) %&gt;%
  filter(abs(stage - 10) == min(abs(stage - 10)) |
         abs(stage - 60) == min(abs(stage - 60)) |
         abs(stage - 89) == min(abs(stage - 89)))

This code partially does the trick, but there are problems for treatment b and c.

In b, two observations have the same difference from the target. So, both observations are filtered in, which is not desired.

In c, two observations have the same value and are closest to the target, and therefore both observations are selected, which is not desired.

The desired output is this:

filtered_data &lt;- read.csv(text = &quot;
date,treatment,stage
2,a,10
5,a,60
7,a,89
2,b,10
5,b,59.8
7,b,88.8
2,c,10
4,c,60
8,c,85&quot;)

答案1

得分: 2

I would do it thusly

library(tidyverse)

crossing(
  data_a,
  target_stage = c(10, 60, 89)
  ) %>%
  group_by(treatment, target_stage) %>%
  slice_min(
    abs(stage-target_stage),
    with_ties = F
    )

<sup>Created on 2023-05-22 with reprex v2.0.2</sup>

If you expand the grid using crossing you can then group by this and find the smallest whilst also removing ties

英文:

I would do it thusly

library(tidyverse)

crossing(
  data_a,
  target_stage = c(10, 60, 89)
  ) %&gt;% 
  group_by(treatment, target_stage) %&gt;% 
  slice_min(
    abs(stage-target_stage),
    with_ties = F
    )
#&gt; # A tibble: 9 &#215; 4
#&gt; # Groups:   treatment, target_stage [9]
#&gt;    date treatment stage target_stage
#&gt;   &lt;int&gt; &lt;chr&gt;     &lt;dbl&gt;        &lt;dbl&gt;
#&gt; 1     2 a          10             10
#&gt; 2     5 a          60             60
#&gt; 3     7 a          89             89
#&gt; 4     2 b          10             10
#&gt; 5     5 b          59.8           60
#&gt; 6     7 b          88.8           89
#&gt; 7     2 c          10             10
#&gt; 8     4 c          60             60
#&gt; 9     8 c          85             89

<sup>Created on 2023-05-22 with reprex v2.0.2</sup>

If you expand the grid using crossing you can then group by this and find the smallest whilst also removing ties

答案2

得分: 1

使用dplyrpurrr

library(dplyr)
library(purrr)

map_dfr(c(10, 60, 89),
        ~ data_a %>%
          filter(abs(stage - .x) == min(abs(stage - .x)),
                 .by = treatment) %>% 
          slice_min(stage, n = 1, with_ties = FALSE, by = treatment)) %>% 
  arrange(treatment, date)

使用data.table

library(data.table)

setDT(data_a)[
  data_a[CJ(stage = c(10, 60, 89), treatment = unique(data_a$treatment)), 
                on = .(treatment, stage), 
                roll = "nearest", 
                .(date, treatment)], 
  on = .(treatment, date)][
    order(treatment, date)]
英文:

Using dplyr and purrr:

library(dplyr)
library(purrr)

map_dfr(c(10, 60, 89),
        ~ data_a %&gt;%
          filter(abs(stage - .x) == min(abs(stage - .x)),
                 .by = treatment) %&gt;% 
          slice_min(stage, n = 1, with_ties = FALSE, by = treatment)) %&gt;% 
  arrange(treatment, date)

#&gt;   date treatment stage
#&gt; 1    2         a  10.0
#&gt; 2    5         a  60.0
#&gt; 3    7         a  89.0
#&gt; 4    2         b  10.0
#&gt; 5    5         b  59.8
#&gt; 6    7         b  88.8
#&gt; 7    2         c  10.0
#&gt; 8    4         c  60.0
#&gt; 9    8         c  85.0

Using data.table:

library(data.table)

setDT(data_a)[
  data_a[CJ(stage = c(10, 60, 89), treatment = unique(data_a$treatment)), 
                on = .(treatment, stage), 
                roll = &quot;nearest&quot;, 
                .(date, treatment)], 
  on = .(treatment, date)][
    order(treatment, date)]

#&gt;    date treatment stage
#&gt; 1:    2         a  10.0
#&gt; 2:    5         a  60.0
#&gt; 3:    7         a  89.0
#&gt; 4:    2         b  10.0
#&gt; 5:    5         b  59.8
#&gt; 6:    7         b  88.8
#&gt; 7:    2         c  10.0
#&gt; 8:    4         c  60.0
#&gt; 9:    9         c  85.0

答案3

得分: 1

你可以使用 outer() + max.col() 来找到距离 10、60、89 最近或最远的数值。

library(dplyr)

data_a %>%
  slice({
    mat <- abs(outer(c(10, 60, 89), stage, '-'))
    max.col(-mat, "first")
  }, .by = treatment)

#   date treatment stage
# 1    2         a  10.0
# 2    5         a  60.0
# 3    7         a  89.0
# 4    2         b  10.0
# 5    5         b  59.8
# 6    7         b  88.8
# 7    2         c  10.0
# 8    4         c  60.0
# 9    8         c  85.0
英文:

You can use outer() + max.col() to find the closest or farthest values from 10, 60, 89.

library(dplyr)

data_a %&gt;%
  slice({
    mat &lt;- abs(outer(c(10, 60, 89), stage, &#39;-&#39;))
    max.col(-mat, &quot;first&quot;)
  }, .by = treatment)

#   date treatment stage
# 1    2         a  10.0
# 2    5         a  60.0
# 3    7         a  89.0
# 4    2         b  10.0
# 5    5         b  59.8
# 6    7         b  88.8
# 7    2         c  10.0
# 8    4         c  60.0
# 9    8         c  85.0

huangapple
  • 本文由 发表于 2023年5月22日 22:45:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/76307361.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定