使用每个表的WHERE子句连接多个表。

huangapple go评论72阅读模式
英文:

Joining multiple tables with WHERE clause for each table

问题

选择 *
从 A 表
加入 B 表 在 A.idx = B.idx
加入 C 表 在 A.idx = C.idx
其中 A.create_dt 或 A.last_dt 在 '2023-05-01' 和 '2023-05-31' 之间
或 B.create_dt 或 B.last_dt 在 '2023-05-01' 和 '2023-05-31' 之间
或 C.create_dt 或 C.last_dt 在 '2023-05-01' 和 '2023-05-31' 之间;

这是一个 PostgreSQL 数据库。

在多表联接时,我只想选择那些每个表的 create_dtlast_dt 是最新的数据。然而,存在速度太慢的问题。在这种情况下如何解决?

当我只保留 A 表的 WHERE 条件时,性能是正常的。

英文:

Ex.

SELECT *
FROM A
JOIN B ON A.idx = B.idx
JOIN C ON A.idx = C.idx
WHERE A.create_dt BETWEEN '2023-05-01' AND '2023-05-31'
  OR A.last_dt BETWEEN '2023-05-01' AND '2023-05-31'
  OR B.create_dt BETWEEN '2023-05-01' AND '2023-05-31'
  OR B.last_dt BETWEEN '2023-05-01' AND '2023-05-31'
  OR C.create_dt BETWEEN '2023-05-01' AND '2023-05-31'
  OR C.last_dt BETWEEN '2023-05-01' AND '2023-05-31';

It is a PostgreSQL DB

While multi-joining the tables, I want to SELECT only the data where the create_dt or last_dt of each table is the latest.
However, there is a problem that the speed is too slow. How to solve in this case?

Performance is fine when I leave only the A table WHERE condition.

答案1

得分: 2

我只会翻译代码部分,以下是翻译好的代码部分:

我要 `SELECT` 只有每个表的 `create_dt`  `last_dt` 是最新的数据。
然后你肯定不能在 `idx` 上连接,因为每个表的最新数据不会共享相同的 `idx` 值。
你的查询不是你说的那样。这个是:

~~~pgsql
SELECT *
FROM  (
   (
   SELECT GREATEST(create_dt, last_dt) AS latest_a, *
   FROM   a
   WHERE  create_dt BETWEEN '2023-05-01' AND '2023-05-31'
   ORDER  BY create_dt DESC
   LIMIT 1
   )
   UNION ALL
   (
   SELECT GREATEST(create_dt, last_dt) AS latest_a, *
   FROM   a
   WHERE  last_dt BETWEEN '2023-05-01' AND '2023-05-31'
   ORDER  BY last_dt DESC
   LIMIT 1
   )
   ORDER BY latest_a
   LIMIT 1
   ) a
CROSS JOIN  (
   (
   SELECT GREATEST(create_dt, last_dt) AS latest_b, *
   FROM   b
   WHERE  create_dt BETWEEN '2023-05-01' AND '2023-05-31'
   ORDER  BY create_dt DESC
   LIMIT 1
   )
   UNION ALL
   (
   SELECT GREATEST(create_dt, last_dt) AS latest_b, *
   FROM   b
   WHERE  last_dt BETWEEN '2023-05-01' AND '2023-05-31'
   ORDER  BY last_dt DESC
   LIMIT 1
   )
   ORDER BY latest_b
   LIMIT 1
   ) b
CROSS JOIN  (
   (
   SELECT GREATEST(create_dt, last_dt) AS latest_c, *
   FROM   a
   WHERE  create_dt BETWEEN '2023-05-01' AND '2023-05-31'
   ORDER  BY create_dt DESC
   LIMIT 1
   )
   UNION ALL
   (
   SELECT GREATEST(create_dt, last_dt) AS latest_c, *
   FROM   a
   WHERE  last_dt BETWEEN '2023-05-01' AND '2023-05-31'
   ORDER  BY last_dt DESC
   LIMIT 1
   )
   ORDER BY latest_c
   LIMIT 1
   ) c -- USING (idx);
~~~

所有括号都是必需的。
有点啰嗦,但这是速度最快的方式 - 前提是你有这些索引:

~~~pgsql
CREATE INDEX a_create_dt_idx ON A (create_dt);
CREATE INDEX a_last_dt_idx ON A (last_dt);

CREATE INDEX b_create_dt_idx ON B (create_dt);
CREATE INDEX b_last_dt_idx ON B (last_dt);

CREATE INDEX c_create_dt_idx ON C (create_dt);
CREATE INDEX c_last_dt_idx ON C (last_dt);
~~~

每个表格将进行两次索引查找,直接选择符合条件的一行。

我使用了无条件的 `CROSS JOIN` 进行连接,因为每个子查询只返回一行,前提是至少有一个符合条件的。

如果其中一个子查询找不到行,结果将为空。也许你真的希望使用 `FULL OUTER JOIN` 来保留其他表的结果,如果有一个为空。或者只返回3行结果。

再次说明,如果你没有确切地说明你真正需要什么,我不会感到惊讶。我的猜测是你想要类似这样的东西(每个表的每个 `idx` 的最新行):

```sql
(
SELECT DISTINCT ON (idx) 'a_latest_create_dt' AS what, *
FROM   a
WHERE  create_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER  BY idx DESC, create_dt DESC, last_dt DESC
)
UNION ALL
(
SELECT DISTINCT ON (idx) 'a_latest_last_dt' AS what, *
FROM   a
WHERE  last_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER  BY idx DESC, last_dt DESC, create_dt DESC
)

UNION ALL
(
SELECT DISTINCT ON (idx) 'b_latest_create_dt' AS what, *
FROM   b
WHERE  create_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER  BY idx DESC, create_dt DESC, last_dt DESC
)
UNION ALL
(
SELECT DISTINCT ON (idx) 'b_latest_last_dt' AS what, *
FROM   b
WHERE  last_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER  BY idx DESC, last_dt DESC, create_dt DESC
)

UNION ALL
(
SELECT DISTINCT ON (idx) 'c_latest_create_dt' AS what, *
FROM   c
WHERE  create_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER  BY idx DESC, create_dt DESC, last_dt DESC
)
UNION ALL
(
SELECT DISTINCT ON (idx) 'c_latest_last_dt' AS what, *
FROM   c
WHERE  last_dt BETWEEN '2023-05-01' AND '2023-05-31'
ORDER  BY idx DESC, last_dt DESC, create_dt DESC
)
ORDER BY idx, what;
~~~

这次我列出了每个表的每个 `idx` 的最新行:一个是 `create_dt`,一个是 `last_dt`。如果在每个类别中都有一个符合时间范围的行,就会有6行。

关于 `DISTINCT ON`

- https://stackoverflow.com/questions/3800551/select-first-row-in-each-group-by-group/7630564#7630564

假设所有涉及的列都是 `NOT NULL`。否则你可能需要做更多的工作...

匹配的索引:

~~~pgsql
CREATE INDEX a_create_dt_idx ON A (idx, create_dt);
CREATE INDEX a_last_dt_idx ON A (idx, last_dt);

CREATE INDEX b_create_dt_idx ON B (idx, create_dt);
CREATE INDEX b_last_dt_idx ON B (idx

<details>
<summary>英文:</summary>

&gt; I want to `SELECT` only the data where the `create_dt` or `last_dt` of each table is the latest.

Then you certainly cannot join on `idx` since the latest per table won&#39;t share the same `idx` value.

Your query does not do what you say. This one does:

~~~pgsql
SELECT *
FROM  (
   (
   SELECT GREATEST(create_dt, last_dt) AS latest_a, *
   FROM   a
   WHERE  create_dt BETWEEN &#39;2023-05-01&#39; AND &#39;2023-05-31&#39;
   ORDER  BY create_dt DESC
   LIMIT 1
   )
   UNION ALL
   (
   SELECT GREATEST(create_dt, last_dt) AS latest_a, *
   FROM   a
   WHERE  last_dt BETWEEN &#39;2023-05-01&#39; AND &#39;2023-05-31&#39;
   ORDER  BY last_dt DESC
   LIMIT 1
   )
   ORDER BY latest_a
   LIMIT 1
   ) a
CROSS JOIN  (
   (
   SELECT GREATEST(create_dt, last_dt) AS latest_b, *
   FROM   b
   WHERE  create_dt BETWEEN &#39;2023-05-01&#39; AND &#39;2023-05-31&#39;
   ORDER  BY create_dt DESC
   LIMIT 1
   )
   UNION ALL
   (
   SELECT GREATEST(create_dt, last_dt) AS latest_b, *
   FROM   b
   WHERE  last_dt BETWEEN &#39;2023-05-01&#39; AND &#39;2023-05-31&#39;
   ORDER  BY last_dt DESC
   LIMIT 1
   )
   ORDER BY latest_b
   LIMIT 1
   ) b
CROSS JOIN  (
   (
   SELECT GREATEST(create_dt, last_dt) AS latest_c, *
   FROM   a
   WHERE  create_dt BETWEEN &#39;2023-05-01&#39; AND &#39;2023-05-31&#39;
   ORDER  BY create_dt DESC
   LIMIT 1
   )
   UNION ALL
   (
   SELECT GREATEST(create_dt, last_dt) AS latest_c, *
   FROM   a
   WHERE  last_dt BETWEEN &#39;2023-05-01&#39; AND &#39;2023-05-31&#39;
   ORDER  BY last_dt DESC
   LIMIT 1
   )
   ORDER BY latest_c
   LIMIT 1
   ) c -- USING (idx);
~~~

All parentheses required.  
A bit verbose. But it&#39;s as fast as this gets - provided you have these indexes:

~~~pgsql
CREATE INDEX a_create_dt_idx ON A (create_dt);
CREATE INDEX a_last_dt_idx ON A (last_dt);

CREATE INDEX b_create_dt_idx ON B (create_dt);
CREATE INDEX b_last_dt_idx ON B (last_dt);

CREATE INDEX c_create_dt_idx ON C (create_dt);
CREATE INDEX c_last_dt_idx ON C (last_dt);
~~~

It will be two index seeks per table, directly picking the one qualifying row every time.

I am joining with an unconditional `CROSS JOIN`, since each subquery returns exactly *one* row, provided at least one qualifies.

If one of the subqueries finds no row, the result is empty. Maybe you really want a `FULL OUTER JOIN` to preserve results from the other tables if one comes up empty. Or just 3 result rows.

Then again, I wouldn&#39;t be surprised if you didn&#39;t exactly say what you really need. My educated guess: you want something like this (the latest row **per idx** from each table):

~~~pgsql
(
SELECT DISTINCT ON (idx) &#39;a_latest_create_dt&#39; AS what, *
FROM   a
WHERE  create_dt BETWEEN &#39;2023-05-01&#39; AND &#39;2023-05-31&#39;
ORDER  BY idx DESC, create_dt DESC, last_dt DESC
)
UNION ALL
(
SELECT DISTINCT ON (idx) &#39;a_latest_last_dt&#39; AS what, *
FROM   a
WHERE  last_dt BETWEEN &#39;2023-05-01&#39; AND &#39;2023-05-31&#39;
ORDER  BY idx DESC, last_dt DESC, create_dt DESC
)

UNION ALL
(
SELECT DISTINCT ON (idx) &#39;b_latest_create_dt&#39; AS what, *
FROM   b
WHERE  create_dt BETWEEN &#39;2023-05-01&#39; AND &#39;2023-05-31&#39;
ORDER  BY idx DESC, create_dt DESC, last_dt DESC
)
UNION ALL
(
SELECT DISTINCT ON (idx) &#39;b_latest_last_dt&#39; AS what, *
FROM   b
WHERE  last_dt BETWEEN &#39;2023-05-01&#39; AND &#39;2023-05-31&#39;
ORDER  BY idx DESC, last_dt DESC, create_dt DESC
)

UNION ALL
(
SELECT DISTINCT ON (idx) &#39;c_latest_create_dt&#39; AS what, *
FROM   c
WHERE  create_dt BETWEEN &#39;2023-05-01&#39; AND &#39;2023-05-31&#39;
ORDER  BY idx DESC, create_dt DESC, last_dt DESC
)
UNION ALL
(
SELECT DISTINCT ON (idx) &#39;c_latest_last_dt&#39; AS what, *
FROM   c
WHERE  last_dt BETWEEN &#39;2023-05-01&#39; AND &#39;2023-05-31&#39;
ORDER  BY idx DESC, last_dt DESC, create_dt DESC
)
ORDER BY idx, what;
~~~

This time I list the latest rows per idx for each table: one for `create_dt`, one for `last_dt`. Makes 6 rows if there is one in the time frame for every flavor.

About `DISTINCT ON`:

- https://stackoverflow.com/questions/3800551/select-first-row-in-each-group-by-group/7630564#7630564

Assuming all involved columns are `NOT NULL`. Else you may have to do more ...

Matching indexes:

~~~pgsql
CREATE INDEX a_create_dt_idx ON A (idx, create_dt);
CREATE INDEX a_last_dt_idx ON A (idx, last_dt);

CREATE INDEX b_create_dt_idx ON B (idx, create_dt);
CREATE INDEX b_last_dt_idx ON B (idx, last_dt);

CREATE INDEX c_create_dt_idx ON C (idx, create_dt);
CREATE INDEX c_last_dt_idx ON C (idx, last_dt);
~~~

</details>



# 答案2
**得分**: 1

尝试向三个表中添加以下索引:

```sql
CREATE INDEX idx_a ON A (idx, create_dt, last_dt);
CREATE INDEX idx_b ON B (idx, create_dt, last_dt);
CREATE INDEX idx_c ON C (idx, create_dt, last_dt);

如果使用这些索引,应该会加速查询中的连接操作。

英文:

You might try adding the following indices to the three tables:

<!-- language: sql -->

CREATE INDEX idx_a ON A (idx, create_dt, last_dt);
CREATE INDEX idx_b ON B (idx, create_dt, last_dt);
CREATE INDEX idx_c ON C (idx, create_dt, last_dt);

These indices, if used, should speed up the joins in your query.

huangapple
  • 本文由 发表于 2023年6月2日 10:01:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/76386704.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定