2023年2月10日 06:07:31go评论103阅读模式

英文:

Filter a grouped variable from a dataset based on the range values of another dataset using dplyr

问题

I want to take the values of a (large) data frame:

我想获取（大型）数据帧的值：

library(tidyverse)
df.grid = expand.grid(x = letters, y = 1:60)
head(df.grid)

Which eventually reaches a 2, a 3, etc.

最终会到达 a 2，a 3 等等。

And I have a second data frame which contains some variables (x) that I want just part of a range (min max) that is different for each "x" variables:

我有第二个数据帧，其中包含一些变量（x），我只想要每个 "x" 变量的一部分范围（最小和最大值）都不同：

sub.data = data.frame(x = c("a", "c", "d"), min = c(2, 50, 25), max = c(6, 53, 30))
sub.data

The output should look like something like this:

输出应该类似于这样：

     x    y
1    a    2
2    a    3
3    a    4
4    a    5
5    a    6
6    c   50
7    c   51
8    c   52
9    c   53
10   d   25
11   d   26
12   d   27
13   d   28
14   d   29
15   d   30

I've tried this:

我尝试过这个：

df.grid %>%
  group_by(x) %>%
  filter_if(y > sub.data$min)

But it doesn't work as the min column has multiple values and the "if" part complains.

但它不起作用，因为最小列有多个值，而 "if" 部分出现问题。

I also found this post, but it doesn't seem to work for me as there is no "matching" variables to guide the filtering process.

我还找到了这篇帖子，但似乎不适用于我，因为没有“匹配”变量来指导筛选过程。

I want to avoid using for loops since I want to apply this to a data frame that is 11GB in size.

我想避免使用for循环，因为我想将其应用于大小为11GB的数据帧。

英文:

I want to take the values of a (large) data frame:

library(tidyverse)
df.grid = expand.grid(x = letters, y = 1:60)
head(df.grid)
  x y
1 a 1
2 b 1
3 c 1
4 d 1
5 e 1
6 f 1
[...]

Which eventually reaches a 2, a 3, etc.

And I have a second data frame which contains some variables (x) that I want just part of a range (min max) that is different for each "x" variables

sub.data = data.frame(x = c(&quot;a&quot;,&quot;c&quot;,&quot;d&quot;), min = c(2,50,25), max = c(6,53,30))
sub.data
  x min max
1 a   2   6
2 c  50  53
3 d  25  30

The output should look like something like this:

     x       y
1    a       2
2    a       3
3    a       4
4    a       5
5    a       6
6    c      50
7    c      51
8    c      52
9    c      53
10   d      25
11   d      26
12   d      27
13   d      28
14   d      29
15   d      30

I've tried this:

df.grid %&gt;% 
  group_by(x) %&gt;% 
  filter_if(y &gt; sub.data$min)

But it doesn't work as the min column has multiple values and the 'if' part complains.

I also found this post, but it doesn't seem to work for me as there is no 'matching' variables to guide the filtering process.

I want to avoid using for loops since I want to apply this to a data frame that is 11GB in size.

答案1

得分: 2

We could use a non-equi join

library(data.table)
setDT(df.grid)[, y1 := y][sub.data, .(x, y), on = .(x, y1 >= min, y1 <= max)]

-output

    x  y
 1: a  2
 2: a  3
 3: a  4
 4: a  5
 5: a  6
 6: c 50
 7: c 51
 8: c 52
 9: c 53
10: d 25
11: d 26
12: d 27
13: d 28
14: d 29
15: d 30

With dplyr version 1.1.0, we could also use non-equi joins with join_by

library(dplyr)
inner_join(df.grid, sub.data, by = join_by(x, y >= min, y <= max)) %>%
    select(x, y)

-output

Or as @Davis Vaughan mentioned, use between with a left_join

left_join(sub.data, df.grid, by = join_by(x, between(y$y, x$min, 
     x$max))) %>%
    select(names(df.grid))

英文:

We could use a non-equi join

library(data.table)
setDT(df.grid)[, y1 := y][sub.data, .(x, y), on = .(x, y1 &gt;= min, y1 &lt;= max)]

-output

    x  y
 1: a  2
 2: a  3
 3: a  4
 4: a  5
 5: a  6
 6: c 50
 7: c 51
 8: c 52
 9: c 53
10: d 25
11: d 26
12: d 27
13: d 28
14: d 29
15: d 30

With dplyr version 1.1.0, we could also use non-equi joins with join_by

library(dplyr)
inner_join(df.grid, sub.data, by = join_by(x, y &gt;= min , y &lt;= max)) %&gt;%
    select(x, y)

-output

Or as @Davis Vaughan mentioned, use between with a left_joion

left_join(sub.data, df.grid, by = join_by(x, between(y$y, x$min, 
     x$max))) %&gt;%
    select(names(df.grid))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用dplyr根据另一个数据集的范围值筛选数据集中的分组变量。

问题

答案1

Boxplot with additional lines for 10th and 90th percentile in R

使用Rvest进行网页抓取

增加按组重复数字的列

颜色空间转换用于图像的R代码部分

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论