英文:
Filter a grouped variable from a dataset based on the range values of another dataset using dplyr
问题
I want to take the values of a (large) data frame:
我想获取(大型)数据帧的值:
library(tidyverse)
df.grid = expand.grid(x = letters, y = 1:60)
head(df.grid)
Which eventually reaches a 2
, a 3
, etc.
最终会到达 a 2
,a 3
等等。
And I have a second data frame which contains some variables (x) that I want just part of a range (min max) that is different for each "x" variables:
我有第二个数据帧,其中包含一些变量(x),我只想要每个 "x" 变量的一部分范围(最小和最大值)都不同:
sub.data = data.frame(x = c("a", "c", "d"), min = c(2, 50, 25), max = c(6, 53, 30))
sub.data
The output should look like something like this:
输出应该类似于这样:
x y
1 a 2
2 a 3
3 a 4
4 a 5
5 a 6
6 c 50
7 c 51
8 c 52
9 c 53
10 d 25
11 d 26
12 d 27
13 d 28
14 d 29
15 d 30
I've tried this:
我尝试过这个:
df.grid %>%
group_by(x) %>%
filter_if(y > sub.data$min)
But it doesn't work as the min column has multiple values and the "if" part complains.
但它不起作用,因为最小列有多个值,而 "if" 部分出现问题。
I also found this post, but it doesn't seem to work for me as there is no "matching" variables to guide the filtering process.
我还找到了这篇帖子,但似乎不适用于我,因为没有“匹配”变量来指导筛选过程。
I want to avoid using for loops since I want to apply this to a data frame that is 11GB in size.
我想避免使用for循环,因为我想将其应用于大小为11GB的数据帧。
英文:
I want to take the values of a (large) data frame:
library(tidyverse)
df.grid = expand.grid(x = letters, y = 1:60)
head(df.grid)
x y
1 a 1
2 b 1
3 c 1
4 d 1
5 e 1
6 f 1
[...]
Which eventually reaches a 2
, a 3
, etc.
And I have a second data frame which contains some variables (x) that I want just part of a range (min max) that is different for each "x" variables
sub.data = data.frame(x = c("a","c","d"), min = c(2,50,25), max = c(6,53,30))
sub.data
x min max
1 a 2 6
2 c 50 53
3 d 25 30
The output should look like something like this:
x y
1 a 2
2 a 3
3 a 4
4 a 5
5 a 6
6 c 50
7 c 51
8 c 52
9 c 53
10 d 25
11 d 26
12 d 27
13 d 28
14 d 29
15 d 30
I've tried this:
df.grid %>%
group_by(x) %>%
filter_if(y > sub.data$min)
But it doesn't work as the min column has multiple values and the 'if' part complains.
I also found this post, but it doesn't seem to work for me as there is no 'matching' variables to guide the filtering process.
I want to avoid using for loops since I want to apply this to a data frame that is 11GB in size.
答案1
得分: 2
We could use a non-equi join
library(data.table)
setDT(df.grid)[, y1 := y][sub.data, .(x, y), on = .(x, y1 >= min, y1 <= max)]
-output
x y
1: a 2
2: a 3
3: a 4
4: a 5
5: a 6
6: c 50
7: c 51
8: c 52
9: c 53
10: d 25
11: d 26
12: d 27
13: d 28
14: d 29
15: d 30
With dplyr
version 1.1.0
, we could also use non-equi
joins with join_by
library(dplyr)
inner_join(df.grid, sub.data, by = join_by(x, y >= min, y <= max)) %>%
select(x, y)
-output
x y
1 a 2
2 a 3
3 a 4
4 a 5
5 a 6
6 d 25
7 d 26
8 d 27
9 d 28
10 d 29
11 d 30
12 c 50
13 c 51
14 c 52
15 c 53
Or as @Davis Vaughan mentioned, use between
with a left_join
left_join(sub.data, df.grid, by = join_by(x, between(y$y, x$min,
x$max))) %>%
select(names(df.grid))
英文:
We could use a non-equi join
library(data.table)
setDT(df.grid)[, y1 := y][sub.data, .(x, y), on = .(x, y1 >= min, y1 <= max)]
-output
x y
1: a 2
2: a 3
3: a 4
4: a 5
5: a 6
6: c 50
7: c 51
8: c 52
9: c 53
10: d 25
11: d 26
12: d 27
13: d 28
14: d 29
15: d 30
With dplyr
version 1.1.0
, we could also use non-equi
joins with join_by
library(dplyr)
inner_join(df.grid, sub.data, by = join_by(x, y >= min , y <= max)) %>%
select(x, y)
-output
x y
1 a 2
2 a 3
3 a 4
4 a 5
5 a 6
6 d 25
7 d 26
8 d 27
9 d 28
10 d 29
11 d 30
12 c 50
13 c 51
14 c 52
15 c 53
Or as @Davis Vaughan mentioned, use between
with a left_joion
left_join(sub.data, df.grid, by = join_by(x, between(y$y, x$min,
x$max))) %>%
select(names(df.grid))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论