使用两列的范围进行Dplyr筛选

huangapple go评论127阅读模式
英文:

Dplyr filter using range across two columns

问题

抱歉,以下是您要翻译的内容:

抱歉,如果这是一个重复的问题,但我似乎找不到我要找的内容。

我试图使用两列之间的指定范围将数据筛选为4个子集。在我陷入困境之前,我已经做到了以下步骤。我认为解决方案可能在于betweenfilter_all()filter_if()之间,但我无法使其工作。任何想法或解决方案都将非常感激,谢谢。我试图分成以下时期:1 - 1815-1865; 2- 1865-1915; 3- 1915-1965 和 1965-2013。

P2 (1865-1915) 的预期输出只要上限或下限阈值中的任何一个将年龄潜在地置于范围内,我都会包括它。

使用两列的范围进行Dplyr筛选

代码

  1. ## 读取数据
  2. dat<-read_csv("data/Iceland.csv")
  3. dat <- rbind(dat)
  4. ## 按年龄整理整个数据集
  5. dat <-dat %>% arrange(dat)
  6. ## 计算时期范围
  7. dat$upper<-dat$year+dat$year_error
  8. dat$lower<-dat$year-dat$year_error
  9. ### 分割成时期
  10. p1<-dat %>% filter(between(upper:lower, 1815, 1865))

数据

  1. structure(list(year = c(1800.13122, 1816.105967, 1836.292674,
  2. 1840.330016, 1856.479382, 1864.566799), y = c(-0.41, -0.31, -0.27,
  3. -0.25, -0.15, -0.15), y_error = c(0.18, 0.19, 0.19, 0.19, 0.18,
  4. 0.19), year_error = c(24.59408182, 29.61627391, 29.34693518,
  5. 30.02607322, 34.75414444, 30.78615486), upper = c(1824.72530182,
  6. 1845.72224091, 1865.63960918, 1870.35608922, 1891.23352644, 1895.35295386
  7. ), lower = c(1775.53713818, 1786.48969309, 1806.94573882, 1810.30394278,
  8. 1821.72523756, 1833.78064414)), row.names = c(NA, -6L), class = c("tbl_df",
  9. "tbl", "data.frame"))
英文:

Apologies if this is a duplicate, but I cannot seem to find what I am looking for.

I am trying to filter data into 4 sub-sets using a specified range between two columns. I have gotten as far as the following before I get stuck. I think the solution lies somewhere in between, filter_all() or filter_if() but I cannot get it to work. Any thoughts, solutions would be great, thank you. I am trying to split into the following periods 1 - 1815-1865; 2- 1865-1915; 3- 1915-1965 and 1965-2013.

Expected output for P2 (1865-1915) as long as either the upper or lower threshold takes the age to potentially sit within the range I would include it.

使用两列的范围进行Dplyr筛选

Code

  1. ## Read in data
  2. dat&lt;-read_csv(&quot;data/Iceland.csv&quot;)
  3. dat &lt;- rbind(dat)
  4. ## Arrange the whole data set by age
  5. dat &lt;-dat %&gt;% arrange(dat)
  6. ## Calculate ranges for periods
  7. dat$upper&lt;-dat$year+dat$year_error
  8. dat$lower&lt;-dat$year-dat$year_error
  9. ### Split into periods
  10. p1&lt;-dat %&gt;% filter(between(upper:lower, 1815, 1865))

Data

  1. structure(list(year = c(1800.13122, 1816.105967, 1836.292674,
  2. 1840.330016, 1856.479382, 1864.566799), y = c(-0.41, -0.31, -0.27,
  3. -0.25, -0.15, -0.15), y_error = c(0.18, 0.19, 0.19, 0.19, 0.18,
  4. 0.19), year_error = c(24.59408182, 29.61627391, 29.34693518,
  5. 30.02607322, 34.75414444, 30.78615486), upper = c(1824.72530182,
  6. 1845.72224091, 1865.63960918, 1870.35608922, 1891.23352644, 1895.35295386
  7. ), lower = c(1775.53713818, 1786.48969309, 1806.94573882, 1810.30394278,
  8. 1821.72523756, 1833.78064414)), row.names = c(NA, -6L), class = c(&quot;tbl_df&quot;,
  9. &quot;tbl&quot;, &quot;data.frame&quot;))

答案1

得分: 1

除非我误解了你想要做的事情,我认为你可以在filter()函数中使用&amp;(AND)来实现这个目标。对于你的四组数据,可以这样做:

  1. # P1
  2. data %>% filter(upper >= 1815 & lower <= 1865)
  3. # P2
  4. data %>% filter(upper >= 1865 & lower <= 1915)
  5. # P3(空)
  6. data %>% filter(upper >= 1915 & lower <= 1965)
  7. # P4(空)
  8. data %>% filter(upper >= 1965 & lower <= 2013)

这是"P2"组的输出结果:

  1. # A tibble: 4 × 6
  2. year y y_error year_error upper lower
  3. <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
  4. 1 1836. -0.27 0.19 29.3 1866. 1807.
  5. 2 1840. -0.25 0.19 30.0 1870. 1810.
  6. 3 1856. -0.15 0.18 34.8 1891. 1822.
  7. 4 1865. -0.15 0.19 30.8 1895. 1834.

我使用了&gt;=&lt;=,因为你在评论中提到区间应该是包含的。

英文:

Unless I am misunderstanding what you are trying to do, I think you can do this using &amp; (AND) inside filter(). For your four sets this would be:

  1. # P1
  2. data %&gt;% filter(upper &gt;= 1815 &amp; lower &lt;= 1865)
  3. # P2
  4. data %&gt;% filter(upper &gt;= 1865 &amp; lower &lt;= 1915)
  5. # P3 (empty)
  6. data %&gt;% filter(upper &gt;= 1915 &amp; lower &lt;= 1965)
  7. # P4 (empty)
  8. data %&gt;% filter(upper &gt;= 1965 &amp; lower &lt;= 2013)

This is the output for the "P2" set:

  1. # A tibble: 4 &#215; 6
  2. year y y_error year_error upper lower
  3. &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
  4. 1 1836. -0.27 0.19 29.3 1866. 1807.
  5. 2 1840. -0.25 0.19 30.0 1870. 1810.
  6. 3 1856. -0.15 0.18 34.8 1891. 1822.
  7. 4 1865. -0.15 0.19 30.8 1895. 1834.

I am using &gt;= and &lt;= since you commented that the intervals should be inclusive.

答案2

得分: 0

我们还可以创建一个包含期间的 data.frame / tibble,并将其与 join_by(overlaps()) 连接:

  1. library(dplyr)
  2. # 用于打印的 sigfig 调整
  3. options(pillar.sigfig = 6)
  4. # 定义期间:
  5. periods <- tribble(~period, ~start, ~end,
  6. 1, 1815, 1865,
  7. 2, 1865, 1915,
  8. 3, 1915, 1965,
  9. 4, 1965, 2013)
  10. # 通过 overlaps 连接,默认边界是 [],即 <= 和 >=
  11. dat <- dat %>% left_join(periods, join_by(overlaps(lower, upper, start, end)))
  12. dat
  13. #> # A tibble: 10 × 9
  14. #> year y y_error year_error upper lower period start end
  15. #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
  16. #> 1 1800.13 -0.41 0.18 24.5941 1824.73 1775.54 1 1815 1865
  17. #> 2 1816.11 -0.31 0.19 29.6163 1845.72 1786.49 1 1815 1865
  18. #> 3 1836.29 -0.27 0.19 29.3469 1865.64 1806.95 1 1815 1865
  19. #> 4 1836.29 -0.27 0.19 29.3469 1865.64 1806.95 2 1865 1915
  20. #> 5 1840.33 -0.25 0.19 30.0261 1870.36 1810.30 1 1815 1865
  21. #> 6 1840.33 -0.25 0.19 30.0261 1870.36 1810.30 2 1865 1915
  22. #> 7 1856.48 -0.15 0.18 34.7541 1891.23 1821.73 1 1815 1865
  23. #> 8 1856.48 -0.15 0.18 34.7541 1891.23 1821.73 2 1865 1915
  24. #> 9 1864.57 -0.15 0.19 30.7862 1895.35 1833.78 1 1815 1865
  25. #> 10 1864.57 -0.15 0.19 30.7862 1895.35 1833.78 2 1865 1915
  26. # 对于 p2:
  27. dat %>% filter(period == 2)
  28. #> # A tibble: 4 × 9
  29. #> year y y_error year_error upper lower period start end
  30. #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
  31. #> 1 1836.29 -0.27 0.19 29.3469 1865.64 1806.95 2 1865 1915
  32. #> 2 1840.33 -0.25 0.19 30.0261 1870.36 1810.30 2 1865 1915
  33. #> 3 1856.48 -0.15 0.18 34.7541 1891.23 1821.73 2 1865 1915
  34. #> 4 1864.57 -0.15 0.19 30.7862 1895.35 1833.78 2 1865 1915

输入数据:

  1. dat <- structure(list(year = c(1800.13122, 1816.105967, 1836.292674,
  2. 1840.330016, 1856.479382, 1864.566799), y = c(-0.41, -0.31, -0.27,
  3. -0.25, -0.15, -0.15), y_error = c(0.18, 0.19, 0.19, 0.19, 0.18,
  4. 0.19), year_error = c(24.59408182, 29.61627391, 29.34693518,
  5. 30.02607322, 34.75414444, 30.78615486), upper = c(1824.72530182,
  6. 1845.72224091, 1865.63960918, 1870.35608922, 1891.23352644, 1895.35295386
  7. ), lower = c(1775.53713818, 1786.48969309, 1806.94573882, 1810.30394278,
  8. 1821.72523756, 1833.78064414)), row.names = c(NA, -6L), class = c("tbl_df",
  9. "tbl", "data.frame"))

创建于 2023-06-26,使用 reprex v2.0.2

英文:

We could also set up a data.frame / tibble with periods and join it with join_by(overlaps()):

  1. library(dplyr)
  2. # adjust sigfig for printing
  3. options(pillar.sigfig = 6)
  4. # define periods:
  5. periods &lt;- tribble(~period, ~start, ~end,
  6. 1, 1815, 1865,
  7. 2, 1865, 1915,
  8. 3, 1915, 1965,
  9. 4, 1965, 2013)
  10. # join by overlaps, default bounds are [], i.e. &lt;= and &gt;=
  11. dat &lt;- dat %&gt;% left_join(periods, join_by(overlaps(lower, upper, start, end)))
  12. dat
  13. #&gt; # A tibble: 10 &#215; 9
  14. #&gt; year y y_error year_error upper lower period start end
  15. #&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
  16. #&gt; 1 1800.13 -0.41 0.18 24.5941 1824.73 1775.54 1 1815 1865
  17. #&gt; 2 1816.11 -0.31 0.19 29.6163 1845.72 1786.49 1 1815 1865
  18. #&gt; 3 1836.29 -0.27 0.19 29.3469 1865.64 1806.95 1 1815 1865
  19. #&gt; 4 1836.29 -0.27 0.19 29.3469 1865.64 1806.95 2 1865 1915
  20. #&gt; 5 1840.33 -0.25 0.19 30.0261 1870.36 1810.30 1 1815 1865
  21. #&gt; 6 1840.33 -0.25 0.19 30.0261 1870.36 1810.30 2 1865 1915
  22. #&gt; 7 1856.48 -0.15 0.18 34.7541 1891.23 1821.73 1 1815 1865
  23. #&gt; 8 1856.48 -0.15 0.18 34.7541 1891.23 1821.73 2 1865 1915
  24. #&gt; 9 1864.57 -0.15 0.19 30.7862 1895.35 1833.78 1 1815 1865
  25. #&gt; 10 1864.57 -0.15 0.19 30.7862 1895.35 1833.78 2 1865 1915
  26. # for p2:
  27. dat %&gt;% filter(period == 2)
  28. #&gt; # A tibble: 4 &#215; 9
  29. #&gt; year y y_error year_error upper lower period start end
  30. #&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
  31. #&gt; 1 1836.29 -0.27 0.19 29.3469 1865.64 1806.95 2 1865 1915
  32. #&gt; 2 1840.33 -0.25 0.19 30.0261 1870.36 1810.30 2 1865 1915
  33. #&gt; 3 1856.48 -0.15 0.18 34.7541 1891.23 1821.73 2 1865 1915
  34. #&gt; 4 1864.57 -0.15 0.19 30.7862 1895.35 1833.78 2 1865 1915

Input data:

  1. dat &lt;- structure(list(year = c(1800.13122, 1816.105967, 1836.292674,
  2. 1840.330016, 1856.479382, 1864.566799), y = c(-0.41, -0.31, -0.27,
  3. -0.25, -0.15, -0.15), y_error = c(0.18, 0.19, 0.19, 0.19, 0.18,
  4. 0.19), year_error = c(24.59408182, 29.61627391, 29.34693518,
  5. 30.02607322, 34.75414444, 30.78615486), upper = c(1824.72530182,
  6. 1845.72224091, 1865.63960918, 1870.35608922, 1891.23352644, 1895.35295386
  7. ), lower = c(1775.53713818, 1786.48969309, 1806.94573882, 1810.30394278,
  8. 1821.72523756, 1833.78064414)), row.names = c(NA, -6L), class = c(&quot;tbl_df&quot;,
  9. &quot;tbl&quot;, &quot;data.frame&quot;))

<sup>Created on 2023-06-26 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年6月26日 22:04:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/76557437.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定