2023年6月26日 22:04:44go评论127阅读模式

英文:

Dplyr filter using range across two columns

问题

抱歉，以下是您要翻译的内容：

抱歉，如果这是一个重复的问题，但我似乎找不到我要找的内容。

我试图使用两列之间的指定范围将数据筛选为4个子集。在我陷入困境之前，我已经做到了以下步骤。我认为解决方案可能在于between、filter_all()或filter_if()之间，但我无法使其工作。任何想法或解决方案都将非常感激，谢谢。我试图分成以下时期：1 - 1815-1865; 2- 1865-1915; 3- 1915-1965 和 1965-2013。

P2 (1865-1915) 的预期输出只要上限或下限阈值中的任何一个将年龄潜在地置于范围内，我都会包括它。

使用两列的范围进行Dplyr筛选

代码

## 读取数据
dat<-read_csv("data/Iceland.csv")
dat <- rbind(dat)
## 按年龄整理整个数据集
dat <-dat %>% arrange(dat)
## 计算时期范围
dat$upper<-dat$year+dat$year_error
dat$lower<-dat$year-dat$year_error
### 分割成时期
p1<-dat %>% filter(between(upper:lower, 1815, 1865))

数据

structure(list(year = c(1800.13122, 1816.105967, 1836.292674, 
1840.330016, 1856.479382, 1864.566799), y = c(-0.41, -0.31, -0.27, 
-0.25, -0.15, -0.15), y_error = c(0.18, 0.19, 0.19, 0.19, 0.18, 
0.19), year_error = c(24.59408182, 29.61627391, 29.34693518, 
30.02607322, 34.75414444, 30.78615486), upper = c(1824.72530182, 
1845.72224091, 1865.63960918, 1870.35608922, 1891.23352644, 1895.35295386
), lower = c(1775.53713818, 1786.48969309, 1806.94573882, 1810.30394278, 
1821.72523756, 1833.78064414)), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))

英文:

Apologies if this is a duplicate, but I cannot seem to find what I am looking for.

I am trying to filter data into 4 sub-sets using a specified range between two columns. I have gotten as far as the following before I get stuck. I think the solution lies somewhere in between, filter_all() or filter_if() but I cannot get it to work. Any thoughts, solutions would be great, thank you. I am trying to split into the following periods 1 - 1815-1865; 2- 1865-1915; 3- 1915-1965 and 1965-2013.

Expected output for P2 (1865-1915) as long as either the upper or lower threshold takes the age to potentially sit within the range I would include it.

使用两列的范围进行Dplyr筛选

Code

## Read in data
dat&lt;-read_csv(&quot;data/Iceland.csv&quot;)
dat &lt;- rbind(dat)
## Arrange the whole data set by age
dat &lt;-dat %&gt;% arrange(dat)
## Calculate ranges for periods
dat$upper&lt;-dat$year+dat$year_error
dat$lower&lt;-dat$year-dat$year_error
### Split into periods
p1&lt;-dat %&gt;% filter(between(upper:lower, 1815, 1865))

Data

structure(list(year = c(1800.13122, 1816.105967, 1836.292674, 
1840.330016, 1856.479382, 1864.566799), y = c(-0.41, -0.31, -0.27, 
-0.25, -0.15, -0.15), y_error = c(0.18, 0.19, 0.19, 0.19, 0.18, 
0.19), year_error = c(24.59408182, 29.61627391, 29.34693518, 
30.02607322, 34.75414444, 30.78615486), upper = c(1824.72530182, 
1845.72224091, 1865.63960918, 1870.35608922, 1891.23352644, 1895.35295386
), lower = c(1775.53713818, 1786.48969309, 1806.94573882, 1810.30394278, 
1821.72523756, 1833.78064414)), row.names = c(NA, -6L), class = c(&quot;tbl_df&quot;, 
&quot;tbl&quot;, &quot;data.frame&quot;))

答案1

得分: 1

除非我误解了你想要做的事情，我认为你可以在filter()函数中使用&（AND）来实现这个目标。对于你的四组数据，可以这样做：

# P1
data %>% filter(upper >= 1815 & lower <= 1865)
# P2
data %>% filter(upper >= 1865 & lower <= 1915)
# P3（空）
data %>% filter(upper >= 1915 & lower <= 1965)
# P4（空）
data %>% filter(upper >= 1965 & lower <= 2013)

这是"P2"组的输出结果：

# A tibble: 4 × 6
   year     y y_error year_error upper lower
  <dbl> <dbl>   <dbl>      <dbl> <dbl> <dbl>
1 1836. -0.27    0.19       29.3 1866. 1807.
2 1840. -0.25    0.19       30.0 1870. 1810.
3 1856. -0.15    0.18       34.8 1891. 1822.
4 1865. -0.15    0.19       30.8 1895. 1834.

我使用了>=和<=，因为你在评论中提到区间应该是包含的。

英文:

Unless I am misunderstanding what you are trying to do, I think you can do this using & (AND) inside filter(). For your four sets this would be:

# P1
data %&gt;% filter(upper &gt;= 1815 &amp; lower &lt;= 1865)
# P2
data %&gt;% filter(upper &gt;= 1865 &amp; lower &lt;= 1915)
# P3 (empty)
data %&gt;% filter(upper &gt;= 1915 &amp; lower &lt;= 1965)
# P4 (empty)
data %&gt;% filter(upper &gt;= 1965 &amp; lower &lt;= 2013)

This is the output for the "P2" set:

# A tibble: 4 &#215; 6
   year     y y_error year_error upper lower
  &lt;dbl&gt; &lt;dbl&gt;   &lt;dbl&gt;      &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
1 1836. -0.27    0.19       29.3 1866. 1807.
2 1840. -0.25    0.19       30.0 1870. 1810.
3 1856. -0.15    0.18       34.8 1891. 1822.
4 1865. -0.15    0.19       30.8 1895. 1834.

I am using >= and <= since you commented that the intervals should be inclusive.

答案2

得分: 0

我们还可以创建一个包含期间的 data.frame / tibble，并将其与 join_by(overlaps()) 连接：

library(dplyr)
# 用于打印的 sigfig 调整
options(pillar.sigfig = 6)
# 定义期间：
periods <- tribble(~period, ~start, ~end,
                         1,   1815, 1865, 
                         2,   1865, 1915,
                         3,   1915, 1965,
                         4,   1965, 2013)
# 通过 overlaps 连接，默认边界是 []，即 <= 和 >=
dat <- dat %>% left_join(periods, join_by(overlaps(lower, upper, start, end)))
dat
#> # A tibble: 10 × 9
#>       year     y y_error year_error   upper   lower period start   end
#>      <dbl> <dbl>   <dbl>      <dbl>   <dbl>   <dbl>  <dbl> <dbl> <dbl>
#>  1 1800.13 -0.41    0.18    24.5941 1824.73 1775.54      1  1815  1865
#>  2 1816.11 -0.31    0.19    29.6163 1845.72 1786.49      1  1815  1865
#>  3 1836.29 -0.27    0.19    29.3469 1865.64 1806.95      1  1815  1865
#>  4 1836.29 -0.27    0.19    29.3469 1865.64 1806.95      2  1865  1915
#>  5 1840.33 -0.25    0.19    30.0261 1870.36 1810.30      1  1815  1865
#>  6 1840.33 -0.25    0.19    30.0261 1870.36 1810.30      2  1865  1915
#>  7 1856.48 -0.15    0.18    34.7541 1891.23 1821.73      1  1815  1865
#>  8 1856.48 -0.15    0.18    34.7541 1891.23 1821.73      2  1865  1915
#>  9 1864.57 -0.15    0.19    30.7862 1895.35 1833.78      1  1815  1865
#> 10 1864.57 -0.15    0.19    30.7862 1895.35 1833.78      2  1865  1915
# 对于 p2：
dat %>% filter(period == 2)
#> # A tibble: 4 × 9
#>      year     y y_error year_error   upper   lower period start   end
#>     <dbl> <dbl>   <dbl>      <dbl>   <dbl>   <dbl>  <dbl> <dbl> <dbl>
#> 1 1836.29 -0.27    0.19    29.3469 1865.64 1806.95      2  1865  1915
#> 2 1840.33 -0.25    0.19    30.0261 1870.36 1810.30      2  1865  1915
#> 3 1856.48 -0.15    0.18    34.7541 1891.23 1821.73      2  1865  1915
#> 4 1864.57 -0.15    0.19    30.7862 1895.35 1833.78      2  1865  1915

输入数据：

dat <- structure(list(year = c(1800.13122, 1816.105967, 1836.292674, 
1840.330016, 1856.479382, 1864.566799), y = c(-0.41, -0.31, -0.27, 
-0.25, -0.15, -0.15), y_error = c(0.18, 0.19, 0.19, 0.19, 0.18, 
0.19), year_error = c(24.59408182, 29.61627391, 29.34693518, 
30.02607322, 34.75414444, 30.78615486), upper = c(1824.72530182, 
1845.72224091, 1865.63960918, 1870.35608922, 1891.23352644, 1895.35295386
), lower = c(1775.53713818, 1786.48969309, 1806.94573882, 1810.30394278, 
1821.72523756, 1833.78064414)), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))

^{创建于 2023-06-26，使用 reprex v2.0.2}

英文:

We could also set up a data.frame / tibble with periods and join it with join_by(overlaps()):

library(dplyr)
# adjust sigfig for printing
options(pillar.sigfig = 6)
# define periods:
periods &lt;- tribble(~period, ~start, ~end,
                         1,   1815, 1865, 
                         2,   1865, 1915,
                         3,   1915, 1965,
                         4,   1965, 2013)
# join by overlaps, default bounds are [], i.e. &lt;= and &gt;=
dat &lt;- dat %&gt;% left_join(periods, join_by(overlaps(lower, upper, start, end)))
dat
#&gt; # A tibble: 10 &#215; 9
#&gt;       year     y y_error year_error   upper   lower period start   end
#&gt;      &lt;dbl&gt; &lt;dbl&gt;   &lt;dbl&gt;      &lt;dbl&gt;   &lt;dbl&gt;   &lt;dbl&gt;  &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt;  1 1800.13 -0.41    0.18    24.5941 1824.73 1775.54      1  1815  1865
#&gt;  2 1816.11 -0.31    0.19    29.6163 1845.72 1786.49      1  1815  1865
#&gt;  3 1836.29 -0.27    0.19    29.3469 1865.64 1806.95      1  1815  1865
#&gt;  4 1836.29 -0.27    0.19    29.3469 1865.64 1806.95      2  1865  1915
#&gt;  5 1840.33 -0.25    0.19    30.0261 1870.36 1810.30      1  1815  1865
#&gt;  6 1840.33 -0.25    0.19    30.0261 1870.36 1810.30      2  1865  1915
#&gt;  7 1856.48 -0.15    0.18    34.7541 1891.23 1821.73      1  1815  1865
#&gt;  8 1856.48 -0.15    0.18    34.7541 1891.23 1821.73      2  1865  1915
#&gt;  9 1864.57 -0.15    0.19    30.7862 1895.35 1833.78      1  1815  1865
#&gt; 10 1864.57 -0.15    0.19    30.7862 1895.35 1833.78      2  1865  1915
# for p2:
dat %&gt;% filter(period == 2)
#&gt; # A tibble: 4 &#215; 9
#&gt;      year     y y_error year_error   upper   lower period start   end
#&gt;     &lt;dbl&gt; &lt;dbl&gt;   &lt;dbl&gt;      &lt;dbl&gt;   &lt;dbl&gt;   &lt;dbl&gt;  &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1836.29 -0.27    0.19    29.3469 1865.64 1806.95      2  1865  1915
#&gt; 2 1840.33 -0.25    0.19    30.0261 1870.36 1810.30      2  1865  1915
#&gt; 3 1856.48 -0.15    0.18    34.7541 1891.23 1821.73      2  1865  1915
#&gt; 4 1864.57 -0.15    0.19    30.7862 1895.35 1833.78      2  1865  1915

Input data:

dat &lt;- structure(list(year = c(1800.13122, 1816.105967, 1836.292674, 
1840.330016, 1856.479382, 1864.566799), y = c(-0.41, -0.31, -0.27, 
-0.25, -0.15, -0.15), y_error = c(0.18, 0.19, 0.19, 0.19, 0.18, 
0.19), year_error = c(24.59408182, 29.61627391, 29.34693518, 
30.02607322, 34.75414444, 30.78615486), upper = c(1824.72530182, 
1845.72224091, 1865.63960918, 1870.35608922, 1891.23352644, 1895.35295386
), lower = c(1775.53713818, 1786.48969309, 1806.94573882, 1810.30394278, 
1821.72523756, 1833.78064414)), row.names = c(NA, -6L), class = c(&quot;tbl_df&quot;, 
&quot;tbl&quot;, &quot;data.frame&quot;))

<sup>Created on 2023-06-26 with reprex v2.0.2</sup>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用两列的范围进行Dplyr筛选

问题

代码

数据

Code

Data

答案1

答案2

tidymodels: loss_accuracy 不提供变量重要性结果

基于数值范围进行筛选

如何在R中创建一个类似Excel中具有多个标题行和列的数据透视表？

在每个分组中添加两行。日期的开始和结束。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论