使用dplyr根据另一个数据集的范围值筛选数据集中的分组变量。

huangapple go评论68阅读模式
英文:

Filter a grouped variable from a dataset based on the range values of another dataset using dplyr

问题

I want to take the values of a (large) data frame:

我想获取(大型)数据帧的值:

library(tidyverse)
df.grid = expand.grid(x = letters, y = 1:60)
head(df.grid)

Which eventually reaches a 2, a 3, etc.

最终会到达 a 2a 3 等等。

And I have a second data frame which contains some variables (x) that I want just part of a range (min max) that is different for each "x" variables:

我有第二个数据帧,其中包含一些变量(x),我只想要每个 "x" 变量的一部分范围(最小和最大值)都不同:

sub.data = data.frame(x = c("a", "c", "d"), min = c(2, 50, 25), max = c(6, 53, 30))

sub.data

The output should look like something like this:

输出应该类似于这样:

     x    y
1    a    2
2    a    3
3    a    4
4    a    5
5    a    6
6    c   50
7    c   51
8    c   52
9    c   53
10   d   25
11   d   26
12   d   27
13   d   28
14   d   29
15   d   30

I've tried this:

我尝试过这个:

df.grid %>%
  group_by(x) %>%
  filter_if(y > sub.data$min)

But it doesn't work as the min column has multiple values and the "if" part complains.

但它不起作用,因为最小列有多个值,而 "if" 部分出现问题。

I also found this post, but it doesn't seem to work for me as there is no "matching" variables to guide the filtering process.

我还找到了这篇帖子,但似乎不适用于我,因为没有“匹配”变量来指导筛选过程。

I want to avoid using for loops since I want to apply this to a data frame that is 11GB in size.

我想避免使用for循环,因为我想将其应用于大小为11GB的数据帧。

英文:

I want to take the values of a (large) data frame:

library(tidyverse)
df.grid = expand.grid(x = letters, y = 1:60)
head(df.grid)

  x y
1 a 1
2 b 1
3 c 1
4 d 1
5 e 1
6 f 1
[...]

Which eventually reaches a 2, a 3, etc.

And I have a second data frame which contains some variables (x) that I want just part of a range (min max) that is different for each "x" variables

sub.data = data.frame(x = c("a","c","d"), min = c(2,50,25), max = c(6,53,30))

sub.data
  x min max
1 a   2   6
2 c  50  53
3 d  25  30

The output should look like something like this:

     x       y
1    a       2
2    a       3
3    a       4
4    a       5
5    a       6
6    c      50
7    c      51
8    c      52
9    c      53
10   d      25
11   d      26
12   d      27
13   d      28
14   d      29
15   d      30

I've tried this:

df.grid %>% 
  group_by(x) %>% 
  filter_if(y > sub.data$min)

But it doesn't work as the min column has multiple values and the 'if' part complains.

I also found this post, but it doesn't seem to work for me as there is no 'matching' variables to guide the filtering process.

I want to avoid using for loops since I want to apply this to a data frame that is 11GB in size.

答案1

得分: 2

We could use a non-equi join

library(data.table)
setDT(df.grid)[, y1 := y][sub.data, .(x, y), on = .(x, y1 >= min, y1 <= max)]

-output

    x  y
 1: a  2
 2: a  3
 3: a  4
 4: a  5
 5: a  6
 6: c 50
 7: c 51
 8: c 52
 9: c 53
10: d 25
11: d 26
12: d 27
13: d 28
14: d 29
15: d 30

With dplyr version 1.1.0, we could also use non-equi joins with join_by

library(dplyr)
inner_join(df.grid, sub.data, by = join_by(x, y >= min, y <= max)) %>%
    select(x, y)

-output

   x  y
1  a  2
2  a  3
3  a  4
4  a  5
5  a  6
6  d 25
7  d 26
8  d 27
9  d 28
10 d 29
11 d 30
12 c 50
13 c 51
14 c 52
15 c 53

Or as @Davis Vaughan mentioned, use between with a left_join

left_join(sub.data, df.grid, by = join_by(x, between(y$y, x$min, 
     x$max))) %>%
    select(names(df.grid))
英文:

We could use a non-equi join

library(data.table)
setDT(df.grid)[, y1 := y][sub.data, .(x, y), on = .(x, y1 &gt;= min, y1 &lt;= max)]

-output

    x  y
 1: a  2
 2: a  3
 3: a  4
 4: a  5
 5: a  6
 6: c 50
 7: c 51
 8: c 52
 9: c 53
10: d 25
11: d 26
12: d 27
13: d 28
14: d 29
15: d 30

With dplyr version 1.1.0, we could also use non-equi joins with join_by

library(dplyr)
inner_join(df.grid, sub.data, by = join_by(x, y &gt;= min , y &lt;= max)) %&gt;%
    select(x, y)

-output

   x  y
1  a  2
2  a  3
3  a  4
4  a  5
5  a  6
6  d 25
7  d 26
8  d 27
9  d 28
10 d 29
11 d 30
12 c 50
13 c 51
14 c 52
15 c 53

Or as @Davis Vaughan mentioned, use between with a left_joion

left_join(sub.data, df.grid, by = join_by(x, between(y$y, x$min, 
     x$max))) %&gt;%
    select(names(df.grid))

huangapple
  • 本文由 发表于 2023年2月10日 06:07:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/75404937.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定