英文:
Subquery with inner query referring to a different table than the outer query
问题
如何使用dplyr
语法编写一个子查询,其中内部查询引用不同于外部查询的表?
考虑以下示例:
library(DBI)
library(dplyr)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
table1 <- data.frame(
id = c(1,2,3),
date = c("2019-01-04", "2019-01-04", "2019-01-05")
)
table2 <- data.frame(
id = c(1,1,2,3),
date = c("AAA", "BBB", "CCC", "DDD")
)
dbWriteTable(con, "table1", table1, overwrite = TRUE)
dbWriteTable(con, "table2", table2, overwrite = TRUE)
dbGetQuery(con, "
SELECT *
FROM table2
WHERE id in (
SELECT id
FROM table1
WHERE date='2019-01-04'
)
")
一种方法是首先编写内部查询,然后在filter()
中使用结果。但是,对于我的实际示例,这将不起作用,因为有数十万个匹配的ID,当执行第二个查询时,数据库(MS SQL Server)会引发错误。
ids <- con %>%
tbl("table1") %>%
filter(date == "2019-01-04") %>%
pull(id)
con %>%
tbl("table2") %>%
filter(
id %in% ids
) %>%
show_query()
<SQL>
SELECT *
FROM `table2`
WHERE (`id` IN (1.0, 2.0)) # 这将在匹配的ID数量较多时出现问题
根据我理解的方式(请参见此问题),在数据库中编写子查询并执行它的方式如下:
result <- con %>%
tbl("table2") %>%
filter(
id %in% (
con %>%
tbl("table1") %>%
filter(date == "2019-01-04") %>%
pull(id)
)
)
这样可以正常运行,但在result
上使用collect()
或show_query()
会引发以下错误:
Error in `purrr::map_chr()`:
ℹ In index: 2.
Caused by error in `UseMethod()`:
! no applicable method for 'escape' applied to an object of class "c('SQLiteConnection', 'DBIConnection', 'DBIObject')"
Run `rlang::last_trace()` to see where the error occurred.
英文:
How does one write a subquery with dplyr
syntax where the inner query refers to a different table than the outer query?
Consider this example
library(DBI)
library(dplyr)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
table1 <- data.frame(
id = c(1,2,3),
date = c("2019-01-04", "2019-01-04", "2019-01-05")
)
table2 <- data.frame(
id = c(1,1,2,3),
date = c("AAA", "BBB", "CCC", "DDD")
)
dbWriteTable(con, "table1", table1, overwrite = T)
dbWriteTable(con, "table2", table2, overwrite = T)
dbGetQuery(con, "
SELECT *
FROM table2
WHERE id in (
SELECT id
FROM table1
WHERE date='2019-01-04'
)
")
One idea is to write the inner query first and then use the result in filter()
. But this will not work with my real example because there are hundreds of thousands of matching ids and the database (MS SQL Server) throws an error when executing the second query.
ids <- con %>%
tbl("table1") %>%
filter(date == "2019-01-04") %>%
pull(id)
con %>%
tbl("table2") %>%
filter(
id %in% ids
) %>%
show_query()
<SQL>
SELECT *
FROM `table2`
WHERE (`id` IN (1.0, 2.0)) # This is going to be a problem with a large number of matching ids
As far as I understand (see this question), the way to write the subquery and execute it in the database is this
result <- con %>%
tbl("table2") %>%
filter(
id %in% (
con %>%
tbl("table1") %>%
filter(date == "2019-01-04") %>%
pull(id)
)
)
This runs without error, but using collect()
or show_query()
on the result
throws this error:
Error in `purrr::map_chr()`:
ℹ In index: 2.
Caused by error in `UseMethod()`:
! no applicable method for 'escape' applied to an object of class "c('SQLiteConnection', 'DBIConnection', 'DBIObject')"
Run `rlang::last_trace()` to see where the error occurred.
答案1
得分: 1
在内部连接中,我认为您可以直接使用tbl(.)
引用。
inner_join(tbl(con, "table1"), tbl(con, "table2"), by = "id") %>%
filter(date.x == "2019-01-04") %>%
select(id, date = date.y) %>%
collect()
# # A tibble: 3 × 2
# id date
# <dbl> <chr>
# 1 1 AAA
# 2 1 BBB
# 3 2 CCC
正如下面所示,它仍然是一个"lazy"操作:
inner_join(tbl(con, "table1"), tbl(con, "table2"), by = "id") %>%
filter(date.x == "2019-01-04") %>%
select(id, date = date.y) %>%
show_query()
# <SQL>
# SELECT `id`, `date.y` AS `date`
# FROM (
# SELECT `LHS`.`id` AS `id`, `LHS`.`date` AS `date.x`, `RHS`.`date` AS `date.y`
# FROM `table1` AS `LHS`
# INNER JOIN `table2` AS `RHS`
# ON (`LHS`.`id` = `RHS`.`id`)
# )
# WHERE (`date.x` = '2019-01-04')
英文:
I think you can just use the tbl(.)
references within an inner join.
inner_join(tbl(con, "table1"), tbl(con, "table2"), by = "id") %>%
filter(date.x == "2019-01-04") %>%
select(id, date = date.y) %>%
collect()
# # A tibble: 3 × 2
# id date
# <dbl> <chr>
# 1 1 AAA
# 2 1 BBB
# 3 2 CCC
It's still a "lazy" operation, as seen here:
inner_join(tbl(con, "table1"), tbl(con, "table2"), by = "id") %>%
filter(date.x == "2019-01-04") %>%
select(id, date = date.y) %>%
show_query()
# <SQL>
# SELECT `id`, `date.y` AS `date`
# FROM (
# SELECT `LHS`.`id` AS `id`, `LHS`.`date` AS `date.x`, `RHS`.`date` AS `date.y`
# FROM `table1` AS `LHS`
# INNER JOIN `table2` AS `RHS`
# ON (`LHS`.`id` = `RHS`.`id`)
# )
# WHERE (`date.x` = '2019-01-04')
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论