如何在R中执行具有固定日期范围的筛选连接?

huangapple go评论82阅读模式
英文:

How to perform a filtering join with fixed date range in R?

问题

我尝试使用dplyr执行一个带有固定日期范围(例如2天)的过滤连接。请参见下面的示例:

  1. library(tibble)
  2. # 创建 tibble1
  3. patient_id <- c("patient1", "patient2")
  4. laterality <- c("L", "R")
  5. date <- as.Date(c("2020-10-24", "2010-09-24"))
  6. tibble1 <- tibble(patient_id, laterality, date)
  7. # 创建 tibble2
  8. patient_id <- c("patient1", "patient2", "patient1", "patient1")
  9. laterality <- c("L", "R", "R", "L")
  10. date <- as.Date(c("2020-10-24", "2010-09-24", "2010-09-18", "2020-10-25"))
  11. type <- c("dark", "light", "dark", "light")
  12. tibble2 <- tibble(patient_id, laterality, date, type)
  13. # 创建输出
  14. patient_id <- c("patient1", "patient2", "patient1")
  15. laterality <- c("L", "R", "L")
  16. date <- as.Date(c("2020-10-24", "2010-09-24", "2020-10-25"))
  17. type <- c("dark", "light", "light")
  18. output <- tibble(patient_id, laterality, date, type)

我想要使用tibble1过滤tibble2,并使用固定的日期范围(+/- 2天),应该得到output。我尝试使用semi_join,但不确定如何结合日期范围。

  1. semi_join(tibble2, tibble1)

我已经看到其他解决方案,它们具有单独的fromto列来定义范围,但希望对所有行使用单一的固定范围。

英文:

I am trying to use dplyr to perform a filtering join with a fixed date range e.g. 2 days. Please see example below:

  1. library(tibble)
  2. #Create tibble1
  3. patient_id &lt;- c(&quot;patient1&quot;, &quot;patient2&quot;)
  4. laterality &lt;- c(&quot;L&quot;, &quot;R&quot;)
  5. date &lt;- as.Date(c(&quot;2020-10-24&quot;, &quot;2010-09-24&quot;))
  6. tibble1 &lt;- tibble(patient_id, laterality, date)
  7. #Create tibble2
  8. patient_id &lt;- c(&quot;patient1&quot;, &quot;patient2&quot;, &quot;patient1&quot;, &quot;patient1&quot;)
  9. laterality &lt;- c(&quot;L&quot;, &quot;R&quot;, &quot;R&quot;, &quot;L&quot;)
  10. date &lt;- as.Date(c(&quot;2020-10-24&quot;, &quot;2010-09-24&quot;, &quot;2010-09-18&quot;, &quot;2020-10-25&quot;))
  11. type &lt;- c(&quot;dark&quot;, &quot;light&quot;, &quot;dark&quot;, &quot;light&quot;)
  12. tibble2 &lt;- tibble(patient_id, laterality, date, type)
  13. #Create output
  14. patient_id &lt;- c(&quot;patient1&quot;, &quot;patient2&quot;, &quot;patient1&quot;)
  15. laterality &lt;- c(&quot;L&quot;, &quot;R&quot;, &quot;L&quot;)
  16. date &lt;- as.Date(c(&quot;2020-10-24&quot;, &quot;2010-09-24&quot;, &quot;2020-10-25&quot;))
  17. type &lt;- c(&quot;dark&quot;, &quot;light&quot;, &quot;light&quot;)
  18. output &lt;- tibble(patient_id, laterality, date, type)

I would like to filter tibble2 using tibble1, with a fixed date range of +/- 2 days, which should give output. I have tried using semi_join but not sure how to incorporate the date range.

  1. semi_join(tibble2,tibble1)

I have seen other solution which have separate from and to columns to define the range, but was hoping to do it with a single fixed range for all rows.

Thank you!

答案1

得分: 2

使用 mutate 来为每一行创建起始/结束值,然后连接数据。

  1. library(dplyr)
  2. tibble2 %>%
  3. mutate(start = date - 2, end = date + 2, date = NULL) %>%
  4. right_join(tibble1, by = c("patient_id", "laterality")) %>%
  5. filter(date >= start & date <= end) %>%
  6. select(-start, -end)
  7. # patient_id laterality type date
  8. # <chr> <chr> <chr> <date>
  9. # 1 patient1 L dark 2020-10-24
  10. # 2 patient2 R light 2010-09-24
  11. # 3 patient1 L light 2020-10-24

这是给定代码的翻译。

英文:

Use mutate to create the start/end values for each row then join the data

  1. library(dplyr)
  2. tibble2 %&gt;%
  3. mutate(start=date-2, end=date+2, date=NULL) %&gt;%
  4. right_join(tibble1, join_by(patient_id, laterality, between(y$date, x$start, x$end))) %&gt;%
  5. select(-start, -end)
  6. # patient_id laterality type date
  7. # &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;date&gt;
  8. # 1 patient1 L dark 2020-10-24
  9. # 2 patient2 R light 2010-09-24
  10. # 3 patient1 L light 2020-10-24

答案2

得分: 0

你可以使用dplyr::filter()和dplyr::between()来实现这样的结果:

  1. filter(tibble2, between(date,
  2. min(tibble1$date) - 2,
  3. max(tibble1$date) + 2))
英文:

I presume you want the range to be specified by the max and min dates in tibble1?

You can achieve such result using dplyr::filter() and dplyr::between():

  1. filter(tibble2, between(date,
  2. min(tibble1$date) - 2,
  3. max(tibble1$date) + 2))

huangapple
  • 本文由 发表于 2023年5月17日 21:08:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/76272470.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定