使用两列的范围进行Dplyr筛选

huangapple go评论92阅读模式
英文:

Dplyr filter using range across two columns

问题

抱歉,以下是您要翻译的内容:

抱歉,如果这是一个重复的问题,但我似乎找不到我要找的内容。

我试图使用两列之间的指定范围将数据筛选为4个子集。在我陷入困境之前,我已经做到了以下步骤。我认为解决方案可能在于betweenfilter_all()filter_if()之间,但我无法使其工作。任何想法或解决方案都将非常感激,谢谢。我试图分成以下时期:1 - 1815-1865; 2- 1865-1915; 3- 1915-1965 和 1965-2013。

P2 (1865-1915) 的预期输出只要上限或下限阈值中的任何一个将年龄潜在地置于范围内,我都会包括它。

使用两列的范围进行Dplyr筛选

代码

## 读取数据
dat<-read_csv("data/Iceland.csv")

dat <- rbind(dat)

## 按年龄整理整个数据集

dat <-dat %>% arrange(dat)

## 计算时期范围

dat$upper<-dat$year+dat$year_error
dat$lower<-dat$year-dat$year_error

### 分割成时期

p1<-dat %>% filter(between(upper:lower, 1815, 1865))

数据

structure(list(year = c(1800.13122, 1816.105967, 1836.292674, 
1840.330016, 1856.479382, 1864.566799), y = c(-0.41, -0.31, -0.27, 
-0.25, -0.15, -0.15), y_error = c(0.18, 0.19, 0.19, 0.19, 0.18, 
0.19), year_error = c(24.59408182, 29.61627391, 29.34693518, 
30.02607322, 34.75414444, 30.78615486), upper = c(1824.72530182, 
1845.72224091, 1865.63960918, 1870.35608922, 1891.23352644, 1895.35295386
), lower = c(1775.53713818, 1786.48969309, 1806.94573882, 1810.30394278, 
1821.72523756, 1833.78064414)), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))
英文:

Apologies if this is a duplicate, but I cannot seem to find what I am looking for.

I am trying to filter data into 4 sub-sets using a specified range between two columns. I have gotten as far as the following before I get stuck. I think the solution lies somewhere in between, filter_all() or filter_if() but I cannot get it to work. Any thoughts, solutions would be great, thank you. I am trying to split into the following periods 1 - 1815-1865; 2- 1865-1915; 3- 1915-1965 and 1965-2013.

Expected output for P2 (1865-1915) as long as either the upper or lower threshold takes the age to potentially sit within the range I would include it.

使用两列的范围进行Dplyr筛选

Code

## Read in data
dat&lt;-read_csv(&quot;data/Iceland.csv&quot;)

dat &lt;- rbind(dat)

## Arrange the whole data set by age

dat &lt;-dat %&gt;% arrange(dat)

## Calculate ranges for periods

dat$upper&lt;-dat$year+dat$year_error
dat$lower&lt;-dat$year-dat$year_error

### Split into periods

p1&lt;-dat %&gt;% filter(between(upper:lower, 1815, 1865))

Data

structure(list(year = c(1800.13122, 1816.105967, 1836.292674, 
1840.330016, 1856.479382, 1864.566799), y = c(-0.41, -0.31, -0.27, 
-0.25, -0.15, -0.15), y_error = c(0.18, 0.19, 0.19, 0.19, 0.18, 
0.19), year_error = c(24.59408182, 29.61627391, 29.34693518, 
30.02607322, 34.75414444, 30.78615486), upper = c(1824.72530182, 
1845.72224091, 1865.63960918, 1870.35608922, 1891.23352644, 1895.35295386
), lower = c(1775.53713818, 1786.48969309, 1806.94573882, 1810.30394278, 
1821.72523756, 1833.78064414)), row.names = c(NA, -6L), class = c(&quot;tbl_df&quot;, 
&quot;tbl&quot;, &quot;data.frame&quot;))

答案1

得分: 1

除非我误解了你想要做的事情,我认为你可以在filter()函数中使用&amp;(AND)来实现这个目标。对于你的四组数据,可以这样做:

# P1
data %>% filter(upper >= 1815 & lower <= 1865)

# P2
data %>% filter(upper >= 1865 & lower <= 1915)

# P3(空)
data %>% filter(upper >= 1915 & lower <= 1965)

# P4(空)
data %>% filter(upper >= 1965 & lower <= 2013)

这是"P2"组的输出结果:

# A tibble: 4 × 6
   year     y y_error year_error upper lower
  <dbl> <dbl>   <dbl>      <dbl> <dbl> <dbl>
1 1836. -0.27    0.19       29.3 1866. 1807.
2 1840. -0.25    0.19       30.0 1870. 1810.
3 1856. -0.15    0.18       34.8 1891. 1822.
4 1865. -0.15    0.19       30.8 1895. 1834.

我使用了&gt;=&lt;=,因为你在评论中提到区间应该是包含的。

英文:

Unless I am misunderstanding what you are trying to do, I think you can do this using &amp; (AND) inside filter(). For your four sets this would be:

# P1
data %&gt;% filter(upper &gt;= 1815 &amp; lower &lt;= 1865)

# P2
data %&gt;% filter(upper &gt;= 1865 &amp; lower &lt;= 1915)

# P3 (empty)
data %&gt;% filter(upper &gt;= 1915 &amp; lower &lt;= 1965)

# P4 (empty)
data %&gt;% filter(upper &gt;= 1965 &amp; lower &lt;= 2013)

This is the output for the "P2" set:

# A tibble: 4 &#215; 6
   year     y y_error year_error upper lower
  &lt;dbl&gt; &lt;dbl&gt;   &lt;dbl&gt;      &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
1 1836. -0.27    0.19       29.3 1866. 1807.
2 1840. -0.25    0.19       30.0 1870. 1810.
3 1856. -0.15    0.18       34.8 1891. 1822.
4 1865. -0.15    0.19       30.8 1895. 1834.

I am using &gt;= and &lt;= since you commented that the intervals should be inclusive.

答案2

得分: 0

我们还可以创建一个包含期间的 data.frame / tibble,并将其与 join_by(overlaps()) 连接:

library(dplyr)
# 用于打印的 sigfig 调整
options(pillar.sigfig = 6)
# 定义期间:
periods <- tribble(~period, ~start, ~end,
                         1,   1815, 1865, 
                         2,   1865, 1915,
                         3,   1915, 1965,
                         4,   1965, 2013)
# 通过 overlaps 连接,默认边界是 [],即 <= 和 >=
dat <- dat %>% left_join(periods, join_by(overlaps(lower, upper, start, end)))
dat
#> # A tibble: 10 × 9
#>       year     y y_error year_error   upper   lower period start   end
#>      <dbl> <dbl>   <dbl>      <dbl>   <dbl>   <dbl>  <dbl> <dbl> <dbl>
#>  1 1800.13 -0.41    0.18    24.5941 1824.73 1775.54      1  1815  1865
#>  2 1816.11 -0.31    0.19    29.6163 1845.72 1786.49      1  1815  1865
#>  3 1836.29 -0.27    0.19    29.3469 1865.64 1806.95      1  1815  1865
#>  4 1836.29 -0.27    0.19    29.3469 1865.64 1806.95      2  1865  1915
#>  5 1840.33 -0.25    0.19    30.0261 1870.36 1810.30      1  1815  1865
#>  6 1840.33 -0.25    0.19    30.0261 1870.36 1810.30      2  1865  1915
#>  7 1856.48 -0.15    0.18    34.7541 1891.23 1821.73      1  1815  1865
#>  8 1856.48 -0.15    0.18    34.7541 1891.23 1821.73      2  1865  1915
#>  9 1864.57 -0.15    0.19    30.7862 1895.35 1833.78      1  1815  1865
#> 10 1864.57 -0.15    0.19    30.7862 1895.35 1833.78      2  1865  1915

# 对于 p2:
dat %>% filter(period == 2)
#> # A tibble: 4 × 9
#>      year     y y_error year_error   upper   lower period start   end
#>     <dbl> <dbl>   <dbl>      <dbl>   <dbl>   <dbl>  <dbl> <dbl> <dbl>
#> 1 1836.29 -0.27    0.19    29.3469 1865.64 1806.95      2  1865  1915
#> 2 1840.33 -0.25    0.19    30.0261 1870.36 1810.30      2  1865  1915
#> 3 1856.48 -0.15    0.18    34.7541 1891.23 1821.73      2  1865  1915
#> 4 1864.57 -0.15    0.19    30.7862 1895.35 1833.78      2  1865  1915

输入数据:

dat <- structure(list(year = c(1800.13122, 1816.105967, 1836.292674, 
1840.330016, 1856.479382, 1864.566799), y = c(-0.41, -0.31, -0.27, 
-0.25, -0.15, -0.15), y_error = c(0.18, 0.19, 0.19, 0.19, 0.18, 
0.19), year_error = c(24.59408182, 29.61627391, 29.34693518, 
30.02607322, 34.75414444, 30.78615486), upper = c(1824.72530182, 
1845.72224091, 1865.63960918, 1870.35608922, 1891.23352644, 1895.35295386
), lower = c(1775.53713818, 1786.48969309, 1806.94573882, 1810.30394278, 
1821.72523756, 1833.78064414)), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))

创建于 2023-06-26,使用 reprex v2.0.2

英文:

We could also set up a data.frame / tibble with periods and join it with join_by(overlaps()):

library(dplyr)
# adjust sigfig for printing
options(pillar.sigfig = 6)
# define periods:
periods &lt;- tribble(~period, ~start, ~end,
                         1,   1815, 1865, 
                         2,   1865, 1915,
                         3,   1915, 1965,
                         4,   1965, 2013)
# join by overlaps, default bounds are [], i.e. &lt;= and &gt;=
dat &lt;- dat %&gt;% left_join(periods, join_by(overlaps(lower, upper, start, end)))
dat
#&gt; # A tibble: 10 &#215; 9
#&gt;       year     y y_error year_error   upper   lower period start   end
#&gt;      &lt;dbl&gt; &lt;dbl&gt;   &lt;dbl&gt;      &lt;dbl&gt;   &lt;dbl&gt;   &lt;dbl&gt;  &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt;  1 1800.13 -0.41    0.18    24.5941 1824.73 1775.54      1  1815  1865
#&gt;  2 1816.11 -0.31    0.19    29.6163 1845.72 1786.49      1  1815  1865
#&gt;  3 1836.29 -0.27    0.19    29.3469 1865.64 1806.95      1  1815  1865
#&gt;  4 1836.29 -0.27    0.19    29.3469 1865.64 1806.95      2  1865  1915
#&gt;  5 1840.33 -0.25    0.19    30.0261 1870.36 1810.30      1  1815  1865
#&gt;  6 1840.33 -0.25    0.19    30.0261 1870.36 1810.30      2  1865  1915
#&gt;  7 1856.48 -0.15    0.18    34.7541 1891.23 1821.73      1  1815  1865
#&gt;  8 1856.48 -0.15    0.18    34.7541 1891.23 1821.73      2  1865  1915
#&gt;  9 1864.57 -0.15    0.19    30.7862 1895.35 1833.78      1  1815  1865
#&gt; 10 1864.57 -0.15    0.19    30.7862 1895.35 1833.78      2  1865  1915

# for p2:
dat %&gt;% filter(period == 2)
#&gt; # A tibble: 4 &#215; 9
#&gt;      year     y y_error year_error   upper   lower period start   end
#&gt;     &lt;dbl&gt; &lt;dbl&gt;   &lt;dbl&gt;      &lt;dbl&gt;   &lt;dbl&gt;   &lt;dbl&gt;  &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1836.29 -0.27    0.19    29.3469 1865.64 1806.95      2  1865  1915
#&gt; 2 1840.33 -0.25    0.19    30.0261 1870.36 1810.30      2  1865  1915
#&gt; 3 1856.48 -0.15    0.18    34.7541 1891.23 1821.73      2  1865  1915
#&gt; 4 1864.57 -0.15    0.19    30.7862 1895.35 1833.78      2  1865  1915

Input data:

dat &lt;- structure(list(year = c(1800.13122, 1816.105967, 1836.292674, 
1840.330016, 1856.479382, 1864.566799), y = c(-0.41, -0.31, -0.27, 
-0.25, -0.15, -0.15), y_error = c(0.18, 0.19, 0.19, 0.19, 0.18, 
0.19), year_error = c(24.59408182, 29.61627391, 29.34693518, 
30.02607322, 34.75414444, 30.78615486), upper = c(1824.72530182, 
1845.72224091, 1865.63960918, 1870.35608922, 1891.23352644, 1895.35295386
), lower = c(1775.53713818, 1786.48969309, 1806.94573882, 1810.30394278, 
1821.72523756, 1833.78064414)), row.names = c(NA, -6L), class = c(&quot;tbl_df&quot;, 
&quot;tbl&quot;, &quot;data.frame&quot;))

<sup>Created on 2023-06-26 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年6月26日 22:04:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/76557437.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定