使用dplyr根据另一个数据集的范围值筛选数据集中的分组变量。

huangapple go评论103阅读模式
英文:

Filter a grouped variable from a dataset based on the range values of another dataset using dplyr

问题

I want to take the values of a (large) data frame:

我想获取(大型)数据帧的值:

  1. library(tidyverse)
  2. df.grid = expand.grid(x = letters, y = 1:60)
  3. head(df.grid)

Which eventually reaches a 2, a 3, etc.

最终会到达 a 2a 3 等等。

And I have a second data frame which contains some variables (x) that I want just part of a range (min max) that is different for each "x" variables:

我有第二个数据帧,其中包含一些变量(x),我只想要每个 "x" 变量的一部分范围(最小和最大值)都不同:

  1. sub.data = data.frame(x = c("a", "c", "d"), min = c(2, 50, 25), max = c(6, 53, 30))
  2. sub.data

The output should look like something like this:

输出应该类似于这样:

  1. x y
  2. 1 a 2
  3. 2 a 3
  4. 3 a 4
  5. 4 a 5
  6. 5 a 6
  7. 6 c 50
  8. 7 c 51
  9. 8 c 52
  10. 9 c 53
  11. 10 d 25
  12. 11 d 26
  13. 12 d 27
  14. 13 d 28
  15. 14 d 29
  16. 15 d 30

I've tried this:

我尝试过这个:

  1. df.grid %>%
  2. group_by(x) %>%
  3. filter_if(y > sub.data$min)

But it doesn't work as the min column has multiple values and the "if" part complains.

但它不起作用,因为最小列有多个值,而 "if" 部分出现问题。

I also found this post, but it doesn't seem to work for me as there is no "matching" variables to guide the filtering process.

我还找到了这篇帖子,但似乎不适用于我,因为没有“匹配”变量来指导筛选过程。

I want to avoid using for loops since I want to apply this to a data frame that is 11GB in size.

我想避免使用for循环,因为我想将其应用于大小为11GB的数据帧。

英文:

I want to take the values of a (large) data frame:

  1. library(tidyverse)
  2. df.grid = expand.grid(x = letters, y = 1:60)
  3. head(df.grid)
  4. x y
  5. 1 a 1
  6. 2 b 1
  7. 3 c 1
  8. 4 d 1
  9. 5 e 1
  10. 6 f 1
  11. [...]

Which eventually reaches a 2, a 3, etc.

And I have a second data frame which contains some variables (x) that I want just part of a range (min max) that is different for each "x" variables

  1. sub.data = data.frame(x = c("a","c","d"), min = c(2,50,25), max = c(6,53,30))
  2. sub.data
  3. x min max
  4. 1 a 2 6
  5. 2 c 50 53
  6. 3 d 25 30

The output should look like something like this:

  1. x y
  2. 1 a 2
  3. 2 a 3
  4. 3 a 4
  5. 4 a 5
  6. 5 a 6
  7. 6 c 50
  8. 7 c 51
  9. 8 c 52
  10. 9 c 53
  11. 10 d 25
  12. 11 d 26
  13. 12 d 27
  14. 13 d 28
  15. 14 d 29
  16. 15 d 30

I've tried this:

  1. df.grid %>%
  2. group_by(x) %>%
  3. filter_if(y > sub.data$min)

But it doesn't work as the min column has multiple values and the 'if' part complains.

I also found this post, but it doesn't seem to work for me as there is no 'matching' variables to guide the filtering process.

I want to avoid using for loops since I want to apply this to a data frame that is 11GB in size.

答案1

得分: 2

We could use a non-equi join

  1. library(data.table)
  2. setDT(df.grid)[, y1 := y][sub.data, .(x, y), on = .(x, y1 >= min, y1 <= max)]

-output

  1. x y
  2. 1: a 2
  3. 2: a 3
  4. 3: a 4
  5. 4: a 5
  6. 5: a 6
  7. 6: c 50
  8. 7: c 51
  9. 8: c 52
  10. 9: c 53
  11. 10: d 25
  12. 11: d 26
  13. 12: d 27
  14. 13: d 28
  15. 14: d 29
  16. 15: d 30

With dplyr version 1.1.0, we could also use non-equi joins with join_by

  1. library(dplyr)
  2. inner_join(df.grid, sub.data, by = join_by(x, y >= min, y <= max)) %>%
  3. select(x, y)

-output

  1. x y
  2. 1 a 2
  3. 2 a 3
  4. 3 a 4
  5. 4 a 5
  6. 5 a 6
  7. 6 d 25
  8. 7 d 26
  9. 8 d 27
  10. 9 d 28
  11. 10 d 29
  12. 11 d 30
  13. 12 c 50
  14. 13 c 51
  15. 14 c 52
  16. 15 c 53

Or as @Davis Vaughan mentioned, use between with a left_join

  1. left_join(sub.data, df.grid, by = join_by(x, between(y$y, x$min,
  2. x$max))) %>%
  3. select(names(df.grid))
英文:

We could use a non-equi join

  1. library(data.table)
  2. setDT(df.grid)[, y1 := y][sub.data, .(x, y), on = .(x, y1 &gt;= min, y1 &lt;= max)]

-output

  1. x y
  2. 1: a 2
  3. 2: a 3
  4. 3: a 4
  5. 4: a 5
  6. 5: a 6
  7. 6: c 50
  8. 7: c 51
  9. 8: c 52
  10. 9: c 53
  11. 10: d 25
  12. 11: d 26
  13. 12: d 27
  14. 13: d 28
  15. 14: d 29
  16. 15: d 30

With dplyr version 1.1.0, we could also use non-equi joins with join_by

  1. library(dplyr)
  2. inner_join(df.grid, sub.data, by = join_by(x, y &gt;= min , y &lt;= max)) %&gt;%
  3. select(x, y)

-output

  1. x y
  2. 1 a 2
  3. 2 a 3
  4. 3 a 4
  5. 4 a 5
  6. 5 a 6
  7. 6 d 25
  8. 7 d 26
  9. 8 d 27
  10. 9 d 28
  11. 10 d 29
  12. 11 d 30
  13. 12 c 50
  14. 13 c 51
  15. 14 c 52
  16. 15 c 53

Or as @Davis Vaughan mentioned, use between with a left_joion

  1. left_join(sub.data, df.grid, by = join_by(x, between(y$y, x$min,
  2. x$max))) %&gt;%
  3. select(names(df.grid))

huangapple
  • 本文由 发表于 2023年2月10日 06:07:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/75404937.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定