英文:
Joining multiple tables with WHERE clause for each table
问题
选择 *
从 A 表
加入 B 表 在 A.idx = B.idx
加入 C 表 在 A.idx = C.idx
其中 A.create_dt 或 A.last_dt 在 '2023-05-01' 和 '2023-05-31' 之间
或 B.create_dt 或 B.last_dt 在 '2023-05-01' 和 '2023-05-31' 之间
或 C.create_dt 或 C.last_dt 在 '2023-05-01' 和 '2023-05-31' 之间;
这是一个 PostgreSQL 数据库。
在多表联接时,我只想选择那些每个表的 create_dt
或 last_dt
是最新的数据。然而,存在速度太慢的问题。在这种情况下如何解决?
当我只保留 A 表的 WHERE
条件时,性能是正常的。
英文:
Ex.
SELECT *
FROM A
JOIN B ON A.idx = B.idx
JOIN C ON A.idx = C.idx
WHERE A.create_dt BETWEEN '2023-05-01' AND '2023-05-31'
OR A.last_dt BETWEEN '2023-05-01' AND '2023-05-31'
OR B.create_dt BETWEEN '2023-05-01' AND '2023-05-31'
OR B.last_dt BETWEEN '2023-05-01' AND '2023-05-31'
OR C.create_dt BETWEEN '2023-05-01' AND '2023-05-31'
OR C.last_dt BETWEEN '2023-05-01' AND '2023-05-31';
It is a PostgreSQL DB
While multi-joining the tables, I want to SELECT
only the data where the create_dt
or last_dt
of each table is the latest.
However, there is a problem that the speed is too slow. How to solve in this case?
Performance is fine when I leave only the A table WHERE
condition.
答案1
得分: 2
我只会翻译代码部分,以下是翻译好的代码部分:
我要 `SELECT` 只有每个表的 `create_dt` 或 `last_dt` 是最新的数据。
然后你肯定不能在 `idx` 上连接,因为每个表的最新数据不会共享相同的 `idx` 值。
你的查询不是你说的那样。这个是:
~~~pgsql
SELECT *
FROM (
(
SELECT GREATEST(create_dt, last_dt) AS latest_a, *
FROM a
WHERE create_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER BY create_dt DESC
LIMIT 1
)
UNION ALL
(
SELECT GREATEST(create_dt, last_dt) AS latest_a, *
FROM a
WHERE last_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER BY last_dt DESC
LIMIT 1
)
ORDER BY latest_a
LIMIT 1
) a
CROSS JOIN (
(
SELECT GREATEST(create_dt, last_dt) AS latest_b, *
FROM b
WHERE create_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER BY create_dt DESC
LIMIT 1
)
UNION ALL
(
SELECT GREATEST(create_dt, last_dt) AS latest_b, *
FROM b
WHERE last_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER BY last_dt DESC
LIMIT 1
)
ORDER BY latest_b
LIMIT 1
) b
CROSS JOIN (
(
SELECT GREATEST(create_dt, last_dt) AS latest_c, *
FROM a
WHERE create_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER BY create_dt DESC
LIMIT 1
)
UNION ALL
(
SELECT GREATEST(create_dt, last_dt) AS latest_c, *
FROM a
WHERE last_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER BY last_dt DESC
LIMIT 1
)
ORDER BY latest_c
LIMIT 1
) c -- USING (idx);
~~~
所有括号都是必需的。
有点啰嗦,但这是速度最快的方式 - 前提是你有这些索引:
~~~pgsql
CREATE INDEX a_create_dt_idx ON A (create_dt);
CREATE INDEX a_last_dt_idx ON A (last_dt);
CREATE INDEX b_create_dt_idx ON B (create_dt);
CREATE INDEX b_last_dt_idx ON B (last_dt);
CREATE INDEX c_create_dt_idx ON C (create_dt);
CREATE INDEX c_last_dt_idx ON C (last_dt);
~~~
每个表格将进行两次索引查找,直接选择符合条件的一行。
我使用了无条件的 `CROSS JOIN` 进行连接,因为每个子查询只返回一行,前提是至少有一个符合条件的。
如果其中一个子查询找不到行,结果将为空。也许你真的希望使用 `FULL OUTER JOIN` 来保留其他表的结果,如果有一个为空。或者只返回3行结果。
再次说明,如果你没有确切地说明你真正需要什么,我不会感到惊讶。我的猜测是你想要类似这样的东西(每个表的每个 `idx` 的最新行):
```sql
(
SELECT DISTINCT ON (idx) 'a_latest_create_dt' AS what, *
FROM a
WHERE create_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER BY idx DESC, create_dt DESC, last_dt DESC
)
UNION ALL
(
SELECT DISTINCT ON (idx) 'a_latest_last_dt' AS what, *
FROM a
WHERE last_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER BY idx DESC, last_dt DESC, create_dt DESC
)
UNION ALL
(
SELECT DISTINCT ON (idx) 'b_latest_create_dt' AS what, *
FROM b
WHERE create_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER BY idx DESC, create_dt DESC, last_dt DESC
)
UNION ALL
(
SELECT DISTINCT ON (idx) 'b_latest_last_dt' AS what, *
FROM b
WHERE last_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER BY idx DESC, last_dt DESC, create_dt DESC
)
UNION ALL
(
SELECT DISTINCT ON (idx) 'c_latest_create_dt' AS what, *
FROM c
WHERE create_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER BY idx DESC, create_dt DESC, last_dt DESC
)
UNION ALL
(
SELECT DISTINCT ON (idx) 'c_latest_last_dt' AS what, *
FROM c
WHERE last_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER BY idx DESC, last_dt DESC, create_dt DESC
)
ORDER BY idx, what;
~~~
这次我列出了每个表的每个 `idx` 的最新行:一个是 `create_dt`,一个是 `last_dt`。如果在每个类别中都有一个符合时间范围的行,就会有6行。
关于 `DISTINCT ON`:
- https://stackoverflow.com/questions/3800551/select-first-row-in-each-group-by-group/7630564#7630564
假设所有涉及的列都是 `NOT NULL`。否则你可能需要做更多的工作...
匹配的索引:
~~~pgsql
CREATE INDEX a_create_dt_idx ON A (idx, create_dt);
CREATE INDEX a_last_dt_idx ON A (idx, last_dt);
CREATE INDEX b_create_dt_idx ON B (idx, create_dt);
CREATE INDEX b_last_dt_idx ON B (idx
<details>
<summary>英文:</summary>
> I want to `SELECT` only the data where the `create_dt` or `last_dt` of each table is the latest.
Then you certainly cannot join on `idx` since the latest per table won't share the same `idx` value.
Your query does not do what you say. This one does:
~~~pgsql
SELECT *
FROM (
(
SELECT GREATEST(create_dt, last_dt) AS latest_a, *
FROM a
WHERE create_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER BY create_dt DESC
LIMIT 1
)
UNION ALL
(
SELECT GREATEST(create_dt, last_dt) AS latest_a, *
FROM a
WHERE last_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER BY last_dt DESC
LIMIT 1
)
ORDER BY latest_a
LIMIT 1
) a
CROSS JOIN (
(
SELECT GREATEST(create_dt, last_dt) AS latest_b, *
FROM b
WHERE create_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER BY create_dt DESC
LIMIT 1
)
UNION ALL
(
SELECT GREATEST(create_dt, last_dt) AS latest_b, *
FROM b
WHERE last_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER BY last_dt DESC
LIMIT 1
)
ORDER BY latest_b
LIMIT 1
) b
CROSS JOIN (
(
SELECT GREATEST(create_dt, last_dt) AS latest_c, *
FROM a
WHERE create_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER BY create_dt DESC
LIMIT 1
)
UNION ALL
(
SELECT GREATEST(create_dt, last_dt) AS latest_c, *
FROM a
WHERE last_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER BY last_dt DESC
LIMIT 1
)
ORDER BY latest_c
LIMIT 1
) c -- USING (idx);
~~~
All parentheses required.
A bit verbose. But it's as fast as this gets - provided you have these indexes:
~~~pgsql
CREATE INDEX a_create_dt_idx ON A (create_dt);
CREATE INDEX a_last_dt_idx ON A (last_dt);
CREATE INDEX b_create_dt_idx ON B (create_dt);
CREATE INDEX b_last_dt_idx ON B (last_dt);
CREATE INDEX c_create_dt_idx ON C (create_dt);
CREATE INDEX c_last_dt_idx ON C (last_dt);
~~~
It will be two index seeks per table, directly picking the one qualifying row every time.
I am joining with an unconditional `CROSS JOIN`, since each subquery returns exactly *one* row, provided at least one qualifies.
If one of the subqueries finds no row, the result is empty. Maybe you really want a `FULL OUTER JOIN` to preserve results from the other tables if one comes up empty. Or just 3 result rows.
Then again, I wouldn't be surprised if you didn't exactly say what you really need. My educated guess: you want something like this (the latest row **per idx** from each table):
~~~pgsql
(
SELECT DISTINCT ON (idx) 'a_latest_create_dt' AS what, *
FROM a
WHERE create_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER BY idx DESC, create_dt DESC, last_dt DESC
)
UNION ALL
(
SELECT DISTINCT ON (idx) 'a_latest_last_dt' AS what, *
FROM a
WHERE last_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER BY idx DESC, last_dt DESC, create_dt DESC
)
UNION ALL
(
SELECT DISTINCT ON (idx) 'b_latest_create_dt' AS what, *
FROM b
WHERE create_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER BY idx DESC, create_dt DESC, last_dt DESC
)
UNION ALL
(
SELECT DISTINCT ON (idx) 'b_latest_last_dt' AS what, *
FROM b
WHERE last_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER BY idx DESC, last_dt DESC, create_dt DESC
)
UNION ALL
(
SELECT DISTINCT ON (idx) 'c_latest_create_dt' AS what, *
FROM c
WHERE create_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER BY idx DESC, create_dt DESC, last_dt DESC
)
UNION ALL
(
SELECT DISTINCT ON (idx) 'c_latest_last_dt' AS what, *
FROM c
WHERE last_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER BY idx DESC, last_dt DESC, create_dt DESC
)
ORDER BY idx, what;
~~~
This time I list the latest rows per idx for each table: one for `create_dt`, one for `last_dt`. Makes 6 rows if there is one in the time frame for every flavor.
About `DISTINCT ON`:
- https://stackoverflow.com/questions/3800551/select-first-row-in-each-group-by-group/7630564#7630564
Assuming all involved columns are `NOT NULL`. Else you may have to do more ...
Matching indexes:
~~~pgsql
CREATE INDEX a_create_dt_idx ON A (idx, create_dt);
CREATE INDEX a_last_dt_idx ON A (idx, last_dt);
CREATE INDEX b_create_dt_idx ON B (idx, create_dt);
CREATE INDEX b_last_dt_idx ON B (idx, last_dt);
CREATE INDEX c_create_dt_idx ON C (idx, create_dt);
CREATE INDEX c_last_dt_idx ON C (idx, last_dt);
~~~
</details>
# 答案2
**得分**: 1
尝试向三个表中添加以下索引:
```sql
CREATE INDEX idx_a ON A (idx, create_dt, last_dt);
CREATE INDEX idx_b ON B (idx, create_dt, last_dt);
CREATE INDEX idx_c ON C (idx, create_dt, last_dt);
如果使用这些索引,应该会加速查询中的连接操作。
英文:
You might try adding the following indices to the three tables:
<!-- language: sql -->
CREATE INDEX idx_a ON A (idx, create_dt, last_dt);
CREATE INDEX idx_b ON B (idx, create_dt, last_dt);
CREATE INDEX idx_c ON C (idx, create_dt, last_dt);
These indices, if used, should speed up the joins in your query.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论