子查询中的内部查询引用的表与外部查询不同。

huangapple go评论62阅读模式
英文:

Subquery with inner query referring to a different table than the outer query

问题

如何使用dplyr语法编写一个子查询,其中内部查询引用不同于外部查询的表?

考虑以下示例:

library(DBI)
library(dplyr)

con <- dbConnect(RSQLite::SQLite(), ":memory:")

table1 <- data.frame(
  id = c(1,2,3),
  date = c("2019-01-04", "2019-01-04", "2019-01-05")
)

table2 <- data.frame(
  id = c(1,1,2,3),
  date = c("AAA", "BBB", "CCC", "DDD")
)

dbWriteTable(con, "table1", table1, overwrite = TRUE)
dbWriteTable(con, "table2", table2, overwrite = TRUE)

dbGetQuery(con, "
SELECT * 
FROM table2
WHERE id in (
  SELECT id
  FROM table1
  WHERE date='2019-01-04'
)
")

一种方法是首先编写内部查询,然后在filter()中使用结果。但是,对于我的实际示例,这将不起作用,因为有数十万个匹配的ID,当执行第二个查询时,数据库(MS SQL Server)会引发错误。

ids <- con %>%
  tbl("table1") %>%
  filter(date == "2019-01-04") %>%
  pull(id)

con %>%
  tbl("table2") %>%
  filter(
    id %in% ids
  ) %>%
  show_query()

<SQL>
SELECT *
FROM `table2`
WHERE (`id` IN (1.0, 2.0)) # 这将在匹配的ID数量较多时出现问题

根据我理解的方式(请参见此问题),在数据库中编写子查询并执行它的方式如下:

result <- con %>%
  tbl("table2") %>%
  filter(
    id %in% (
      con %>%
      tbl("table1") %>%
        filter(date == "2019-01-04") %>%
        pull(id)
    )
  )

这样可以正常运行,但在result上使用collect()show_query()会引发以下错误:

Error in `purrr::map_chr()`:
ℹ In index: 2.
Caused by error in `UseMethod()`:
! no applicable method for 'escape' applied to an object of class "c('SQLiteConnection', 'DBIConnection', 'DBIObject')"
Run `rlang::last_trace()` to see where the error occurred.
英文:

How does one write a subquery with dplyr syntax where the inner query refers to a different table than the outer query?

Consider this example

library(DBI)
library(dplyr)

con &lt;- dbConnect(RSQLite::SQLite(), &quot;:memory:&quot;)

table1 &lt;- data.frame(
  id = c(1,2,3),
  date = c(&quot;2019-01-04&quot;, &quot;2019-01-04&quot;, &quot;2019-01-05&quot;)
)

table2 &lt;- data.frame(
  id = c(1,1,2,3),
  date = c(&quot;AAA&quot;, &quot;BBB&quot;, &quot;CCC&quot;, &quot;DDD&quot;)
)

dbWriteTable(con, &quot;table1&quot;, table1, overwrite = T)
dbWriteTable(con, &quot;table2&quot;, table2, overwrite = T)

dbGetQuery(con, &quot;
SELECT * 
FROM table2
WHERE id in (
  SELECT id
  FROM table1
  WHERE date=&#39;2019-01-04&#39;
)
&quot;)

One idea is to write the inner query first and then use the result in filter(). But this will not work with my real example because there are hundreds of thousands of matching ids and the database (MS SQL Server) throws an error when executing the second query.

ids &lt;- con %&gt;%
  tbl(&quot;table1&quot;) %&gt;%
  filter(date == &quot;2019-01-04&quot;) %&gt;%
  pull(id)

con %&gt;%
  tbl(&quot;table2&quot;) %&gt;%
  filter(
    id %in% ids
  ) %&gt;%
  show_query()

&lt;SQL&gt;
SELECT *
FROM `table2`
WHERE (`id` IN (1.0, 2.0)) # This is going to be a problem with a large number of matching ids

As far as I understand (see this question), the way to write the subquery and execute it in the database is this

result &lt;- con %&gt;%
  tbl(&quot;table2&quot;) %&gt;%
  filter(
    id %in% (
      con %&gt;%
      tbl(&quot;table1&quot;) %&gt;%
        filter(date == &quot;2019-01-04&quot;) %&gt;%
        pull(id)
    )
  )

This runs without error, but using collect() or show_query() on the result throws this error:

Error in `purrr::map_chr()`:
ℹ In index: 2.
Caused by error in `UseMethod()`:
! no applicable method for &#39;escape&#39; applied to an object of class &quot;c(&#39;SQLiteConnection&#39;, &#39;DBIConnection&#39;, &#39;DBIObject&#39;)&quot;
Run `rlang::last_trace()` to see where the error occurred.

答案1

得分: 1

在内部连接中,我认为您可以直接使用tbl(.)引用。

inner_join(tbl(con, "table1"), tbl(con, "table2"), by = "id") %>%
  filter(date.x == "2019-01-04") %>%
  select(id, date = date.y) %>%
  collect()
# # A tibble: 3 × 2
#      id date 
#   <dbl> <chr>
# 1     1 AAA  
# 2     1 BBB  
# 3     2 CCC  

正如下面所示,它仍然是一个"lazy"操作:

inner_join(tbl(con, "table1"), tbl(con, "table2"), by = "id") %>%
  filter(date.x == "2019-01-04") %>%
  select(id, date = date.y) %>%
  show_query()
# <SQL>
# SELECT `id`, `date.y` AS `date`
# FROM (
#   SELECT `LHS`.`id` AS `id`, `LHS`.`date` AS `date.x`, `RHS`.`date` AS `date.y`
#   FROM `table1` AS `LHS`
#   INNER JOIN `table2` AS `RHS`
#     ON (`LHS`.`id` = `RHS`.`id`)
# )
# WHERE (`date.x` = '2019-01-04')
英文:

I think you can just use the tbl(.) references within an inner join.

inner_join(tbl(con, &quot;table1&quot;), tbl(con, &quot;table2&quot;), by = &quot;id&quot;) %&gt;%
  filter(date.x == &quot;2019-01-04&quot;) %&gt;%
  select(id, date = date.y) %&gt;%
  collect()
# # A tibble: 3 &#215; 2
#      id date 
#   &lt;dbl&gt; &lt;chr&gt;
# 1     1 AAA  
# 2     1 BBB  
# 3     2 CCC  

It's still a "lazy" operation, as seen here:

inner_join(tbl(con, &quot;table1&quot;), tbl(con, &quot;table2&quot;), by = &quot;id&quot;) %&gt;%
  filter(date.x == &quot;2019-01-04&quot;) %&gt;%
  select(id, date = date.y) %&gt;%
  show_query()
# &lt;SQL&gt;
# SELECT `id`, `date.y` AS `date`
# FROM (
#   SELECT `LHS`.`id` AS `id`, `LHS`.`date` AS `date.x`, `RHS`.`date` AS `date.y`
#   FROM `table1` AS `LHS`
#   INNER JOIN `table2` AS `RHS`
#     ON (`LHS`.`id` = `RHS`.`id`)
# )
# WHERE (`date.x` = &#39;2019-01-04&#39;)

huangapple
  • 本文由 发表于 2023年3月31日 20:49:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/75898740.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定