英文:
Dplyr filter using range across two columns
问题
抱歉,以下是您要翻译的内容:
抱歉,如果这是一个重复的问题,但我似乎找不到我要找的内容。
我试图使用两列之间的指定范围将数据筛选为4个子集。在我陷入困境之前,我已经做到了以下步骤。我认为解决方案可能在于between
、filter_all()
或filter_if()
之间,但我无法使其工作。任何想法或解决方案都将非常感激,谢谢。我试图分成以下时期:1 - 1815-1865; 2- 1865-1915; 3- 1915-1965 和 1965-2013。
P2 (1865-1915) 的预期输出只要上限或下限阈值中的任何一个将年龄潜在地置于范围内,我都会包括它。
代码
## 读取数据
dat<-read_csv("data/Iceland.csv")
dat <- rbind(dat)
## 按年龄整理整个数据集
dat <-dat %>% arrange(dat)
## 计算时期范围
dat$upper<-dat$year+dat$year_error
dat$lower<-dat$year-dat$year_error
### 分割成时期
p1<-dat %>% filter(between(upper:lower, 1815, 1865))
数据
structure(list(year = c(1800.13122, 1816.105967, 1836.292674,
1840.330016, 1856.479382, 1864.566799), y = c(-0.41, -0.31, -0.27,
-0.25, -0.15, -0.15), y_error = c(0.18, 0.19, 0.19, 0.19, 0.18,
0.19), year_error = c(24.59408182, 29.61627391, 29.34693518,
30.02607322, 34.75414444, 30.78615486), upper = c(1824.72530182,
1845.72224091, 1865.63960918, 1870.35608922, 1891.23352644, 1895.35295386
), lower = c(1775.53713818, 1786.48969309, 1806.94573882, 1810.30394278,
1821.72523756, 1833.78064414)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
英文:
Apologies if this is a duplicate, but I cannot seem to find what I am looking for.
I am trying to filter data into 4 sub-sets using a specified range between two columns. I have gotten as far as the following before I get stuck. I think the solution lies somewhere in between
, filter_all()
or filter_if()
but I cannot get it to work. Any thoughts, solutions would be great, thank you. I am trying to split into the following periods 1 - 1815-1865; 2- 1865-1915; 3- 1915-1965 and 1965-2013.
Expected output for P2 (1865-1915) as long as either the upper or lower threshold takes the age to potentially sit within the range I would include it.
Code
## Read in data
dat<-read_csv("data/Iceland.csv")
dat <- rbind(dat)
## Arrange the whole data set by age
dat <-dat %>% arrange(dat)
## Calculate ranges for periods
dat$upper<-dat$year+dat$year_error
dat$lower<-dat$year-dat$year_error
### Split into periods
p1<-dat %>% filter(between(upper:lower, 1815, 1865))
Data
structure(list(year = c(1800.13122, 1816.105967, 1836.292674,
1840.330016, 1856.479382, 1864.566799), y = c(-0.41, -0.31, -0.27,
-0.25, -0.15, -0.15), y_error = c(0.18, 0.19, 0.19, 0.19, 0.18,
0.19), year_error = c(24.59408182, 29.61627391, 29.34693518,
30.02607322, 34.75414444, 30.78615486), upper = c(1824.72530182,
1845.72224091, 1865.63960918, 1870.35608922, 1891.23352644, 1895.35295386
), lower = c(1775.53713818, 1786.48969309, 1806.94573882, 1810.30394278,
1821.72523756, 1833.78064414)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
答案1
得分: 1
除非我误解了你想要做的事情,我认为你可以在filter()
函数中使用&
(AND)来实现这个目标。对于你的四组数据,可以这样做:
# P1
data %>% filter(upper >= 1815 & lower <= 1865)
# P2
data %>% filter(upper >= 1865 & lower <= 1915)
# P3(空)
data %>% filter(upper >= 1915 & lower <= 1965)
# P4(空)
data %>% filter(upper >= 1965 & lower <= 2013)
这是"P2"组的输出结果:
# A tibble: 4 × 6
year y y_error year_error upper lower
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1836. -0.27 0.19 29.3 1866. 1807.
2 1840. -0.25 0.19 30.0 1870. 1810.
3 1856. -0.15 0.18 34.8 1891. 1822.
4 1865. -0.15 0.19 30.8 1895. 1834.
我使用了>=
和<=
,因为你在评论中提到区间应该是包含的。
英文:
Unless I am misunderstanding what you are trying to do, I think you can do this using &
(AND) inside filter()
. For your four sets this would be:
# P1
data %>% filter(upper >= 1815 & lower <= 1865)
# P2
data %>% filter(upper >= 1865 & lower <= 1915)
# P3 (empty)
data %>% filter(upper >= 1915 & lower <= 1965)
# P4 (empty)
data %>% filter(upper >= 1965 & lower <= 2013)
This is the output for the "P2" set:
# A tibble: 4 × 6
year y y_error year_error upper lower
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1836. -0.27 0.19 29.3 1866. 1807.
2 1840. -0.25 0.19 30.0 1870. 1810.
3 1856. -0.15 0.18 34.8 1891. 1822.
4 1865. -0.15 0.19 30.8 1895. 1834.
I am using >=
and <=
since you commented that the intervals should be inclusive.
答案2
得分: 0
我们还可以创建一个包含期间的 data.frame / tibble,并将其与 join_by(overlaps())
连接:
library(dplyr)
# 用于打印的 sigfig 调整
options(pillar.sigfig = 6)
# 定义期间:
periods <- tribble(~period, ~start, ~end,
1, 1815, 1865,
2, 1865, 1915,
3, 1915, 1965,
4, 1965, 2013)
# 通过 overlaps 连接,默认边界是 [],即 <= 和 >=
dat <- dat %>% left_join(periods, join_by(overlaps(lower, upper, start, end)))
dat
#> # A tibble: 10 × 9
#> year y y_error year_error upper lower period start end
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1800.13 -0.41 0.18 24.5941 1824.73 1775.54 1 1815 1865
#> 2 1816.11 -0.31 0.19 29.6163 1845.72 1786.49 1 1815 1865
#> 3 1836.29 -0.27 0.19 29.3469 1865.64 1806.95 1 1815 1865
#> 4 1836.29 -0.27 0.19 29.3469 1865.64 1806.95 2 1865 1915
#> 5 1840.33 -0.25 0.19 30.0261 1870.36 1810.30 1 1815 1865
#> 6 1840.33 -0.25 0.19 30.0261 1870.36 1810.30 2 1865 1915
#> 7 1856.48 -0.15 0.18 34.7541 1891.23 1821.73 1 1815 1865
#> 8 1856.48 -0.15 0.18 34.7541 1891.23 1821.73 2 1865 1915
#> 9 1864.57 -0.15 0.19 30.7862 1895.35 1833.78 1 1815 1865
#> 10 1864.57 -0.15 0.19 30.7862 1895.35 1833.78 2 1865 1915
# 对于 p2:
dat %>% filter(period == 2)
#> # A tibble: 4 × 9
#> year y y_error year_error upper lower period start end
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1836.29 -0.27 0.19 29.3469 1865.64 1806.95 2 1865 1915
#> 2 1840.33 -0.25 0.19 30.0261 1870.36 1810.30 2 1865 1915
#> 3 1856.48 -0.15 0.18 34.7541 1891.23 1821.73 2 1865 1915
#> 4 1864.57 -0.15 0.19 30.7862 1895.35 1833.78 2 1865 1915
输入数据:
dat <- structure(list(year = c(1800.13122, 1816.105967, 1836.292674,
1840.330016, 1856.479382, 1864.566799), y = c(-0.41, -0.31, -0.27,
-0.25, -0.15, -0.15), y_error = c(0.18, 0.19, 0.19, 0.19, 0.18,
0.19), year_error = c(24.59408182, 29.61627391, 29.34693518,
30.02607322, 34.75414444, 30.78615486), upper = c(1824.72530182,
1845.72224091, 1865.63960918, 1870.35608922, 1891.23352644, 1895.35295386
), lower = c(1775.53713818, 1786.48969309, 1806.94573882, 1810.30394278,
1821.72523756, 1833.78064414)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
创建于 2023-06-26,使用 reprex v2.0.2
英文:
We could also set up a data.frame / tibble with periods and join it with join_by(overlaps())
:
library(dplyr)
# adjust sigfig for printing
options(pillar.sigfig = 6)
# define periods:
periods <- tribble(~period, ~start, ~end,
1, 1815, 1865,
2, 1865, 1915,
3, 1915, 1965,
4, 1965, 2013)
# join by overlaps, default bounds are [], i.e. <= and >=
dat <- dat %>% left_join(periods, join_by(overlaps(lower, upper, start, end)))
dat
#> # A tibble: 10 × 9
#> year y y_error year_error upper lower period start end
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1800.13 -0.41 0.18 24.5941 1824.73 1775.54 1 1815 1865
#> 2 1816.11 -0.31 0.19 29.6163 1845.72 1786.49 1 1815 1865
#> 3 1836.29 -0.27 0.19 29.3469 1865.64 1806.95 1 1815 1865
#> 4 1836.29 -0.27 0.19 29.3469 1865.64 1806.95 2 1865 1915
#> 5 1840.33 -0.25 0.19 30.0261 1870.36 1810.30 1 1815 1865
#> 6 1840.33 -0.25 0.19 30.0261 1870.36 1810.30 2 1865 1915
#> 7 1856.48 -0.15 0.18 34.7541 1891.23 1821.73 1 1815 1865
#> 8 1856.48 -0.15 0.18 34.7541 1891.23 1821.73 2 1865 1915
#> 9 1864.57 -0.15 0.19 30.7862 1895.35 1833.78 1 1815 1865
#> 10 1864.57 -0.15 0.19 30.7862 1895.35 1833.78 2 1865 1915
# for p2:
dat %>% filter(period == 2)
#> # A tibble: 4 × 9
#> year y y_error year_error upper lower period start end
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1836.29 -0.27 0.19 29.3469 1865.64 1806.95 2 1865 1915
#> 2 1840.33 -0.25 0.19 30.0261 1870.36 1810.30 2 1865 1915
#> 3 1856.48 -0.15 0.18 34.7541 1891.23 1821.73 2 1865 1915
#> 4 1864.57 -0.15 0.19 30.7862 1895.35 1833.78 2 1865 1915
Input data:
dat <- structure(list(year = c(1800.13122, 1816.105967, 1836.292674,
1840.330016, 1856.479382, 1864.566799), y = c(-0.41, -0.31, -0.27,
-0.25, -0.15, -0.15), y_error = c(0.18, 0.19, 0.19, 0.19, 0.18,
0.19), year_error = c(24.59408182, 29.61627391, 29.34693518,
30.02607322, 34.75414444, 30.78615486), upper = c(1824.72530182,
1845.72224091, 1865.63960918, 1870.35608922, 1891.23352644, 1895.35295386
), lower = c(1775.53713818, 1786.48969309, 1806.94573882, 1810.30394278,
1821.72523756, 1833.78064414)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
<sup>Created on 2023-06-26 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论