英文:
How to include columns from two tables based on a common column value but include rows only from the left table?
问题
我有两个表格 tb1 和 tb2。tb2 通过一个名为 tk_id 的外键与 tb1 相连接。以下是我的两个表格的样子:
tb1
列名 | 数据类型 | 校对规则 | 允许空值 | 默认值
-----------------+-----------------------------+-----------+----------+--------------------------------------------------
id | 整数 | | 非空 | 下一个值('tb2_id_seq'::regclass)
创建时间 | 无时区时间戳 | | 非空 |
修改时间 | 无时区时间戳 | | 非空 |
状态 | 双精度浮点数 | | 非空 |
tk_id | UUID | | 非空 |
tb2
列名 | 数据类型 | 校对规则 | 允许空值 | 默认值
----------------+-----------------------------+-----------+----------+---------
id | UUID | | 非空 |
创建时间 | 无时区时间戳 | | 非空 |
修改时间 | 无时区时间戳 | | 非空 |
目标ID | UUID | | 非空 |
源ID | UUID | | 非空 |
tk_id | UUID | | 非空 |
现在我需要从 tb1 获取所有行,这些行具有来自 tb1 和 tb2 的列,当 tk_id 值匹配两个行时。
这是我尝试过的:
select tb1.created_at, tb1.status, tb2.source_id, tb2.destination_id from tb1
inner join tb2 on tb1.tk_id = tb2.tk_id where
tb1.created_at > timezone('utc', now()) - interval '40 minutes';
但我得到了太多的行。通常在 40分钟 的时间间隔内,会有大约 800 条记录,但在连接后,我得到了大约 100,000+ 条记录。
编辑: 经过一些阅读和几次尝试,我在查询中进行了一些更改,并成功将行数减少到了预期的行数。这是我的查询现在:
SELECT count(*) FROM tb1 LEFT OUTER JOIN (SELECT DISTINCT tk_id FROM tb2) t2
ON tb1.tk_id = t2.tk_id where tb1.created_at > timezone('utc', now()) -
interval '40 minutes';
但现在我无法在我的选择查询中获取 tb2 的列。
我做错了什么?
编辑2: 如果我之前没有表达清楚,我很抱歉。连接条件应该基于右表中 tk_id 的最新出现。因此,对于左表的每一行 (tb1),它应该与右表的 tk_id 的最新出现匹配,并提取右表的列。
英文:
I have two tables tb1 and tb2. tb2 is connected to tb1 via a foreign key named tk_id. Here is how my two tables look like
tb1
Column | Type | Collation | Nullable | Default
-----------------+-----------------------------+-----------+----------+--------------------------------------------------
id | integer | | not null | nextval('tb2_id_seq'::regclass)
created_at | timestamp without time zone | | not null |
modified_at | timestamp without time zone | | not null |
status | double precision | | not null |
tk_id | uuid | | not null |
tb2
Column | Type | Collation | Nullable | Default
----------------+-----------------------------+-----------+----------+---------
id | uuid | | not null |
created_at | timestamp without time zone | | not null |
modified_at | timestamp without time zone | | not null |
destination_id | uuid | | not null |
source_id | uuid | | not null |
tk_id | uuid | | not null |
Now I need to get all rows from tb1 which has columns from both tb1 and tb2 when the tk_id value matches for both the rows.
This is what I tried:
select tb1.created_at, tb1.status, tb2.source_id, tb2.destination_id from tb1
inner join tb2 on tb1.tk_id = tb2.tk_id where
tb1.created_at > timezone('utc', now()) - interval '40 minutes';
But I am getting way too many rows. Usually in a 40min interval, there would be around 800 records but after the join, I am getting around 100,000+ records.
EDIT: After some reading and a few tries, I made some changes in my query and managed to get the rows down to the expected number of rows. This is my query now
SELECT count(*) FROM tb1 LEFT OUTER JOIN (SELECT DISTINCT tk_id FROM tb2) t2
ON tb1.tk_id = t2.tk_id where tb1.created_at > timezone('utc', now()) -
interval '40 minutes';
But now I can't get the columns of tb2 in my select query.
What am I doing wrong?
EDIT2: Sorry if I couldn't make this clearer earlier. The join condition should be based on the latest occurrence of tk_id of the right table. So for every row of left table (tb1), it should match against the latest occurrence of tk_id of right table (tb2) and fetch the right table's columns.
答案1
得分: 1
以下是翻译好的内容:
由于在 tb1
中有明显的多行与 tb1
中的单行匹配,您需要定义要选择哪一行。或者需要一些汇总?
此查询返回所有来自 tb1
的符合条件的行,并从 tb2
中最后创建的一个匹配行添加列(如果有的话):
SELECT tb1.created_at, tb1.status, tb2.*
FROM tb1
LEFT JOIN LATERAL (
SELECT tb2.source_id, tb2.destination_id
FROM tb2
WHERE tb2.tk_id = tb1.tk_id
ORDER BY created_at DESC, id DESC
LIMIT 1
) tb2 ON true
WHERE tb1.created_at > timezone('utc', now()) - interval '40 minutes';
在这里添加了 id DESC
作为补充,因为 created_at
可能不是唯一的。根据您未透露的需求进行调整。
相关链接:
请在 tb2(tk_id, created_at, id)
上添加索引来支持此查询。当然,还需要在 tb1(created_at)
上添加另一个索引。或者甚至可以使用覆盖索引在 tb1(created_at)
上包含 (tk_id, status) 列。参考:
英文:
Since there are obviously many rows in tb2
for a single row in tb1
, you need to define which row to pick. Or some aggregation?
This query returns all qualifying rows from tb1
, and adds columns from the one matching row in tb2
that was created last (if any):
SELECT tb1.created_at, tb1.status, tb2.*
FROM tb1
LEFT JOIN LATERAL (
SELECT tb2.source_id, tb2.destination_id
FROM tb2
WHERE tb2.tk_id = tb1.tk_id
ORDER BY created_at DESC, id DESC
LIMIT 1
) tb2 ON true
WHERE tb1.created_at > timezone('utc', now()) - interval '40 minutes';
Added id DESC
as tiebreaker since created_at
might not be unique.
Adapt to your undisclosed needs.
Related:
- https://stackoverflow.com/questions/28550679/what-is-the-difference-between-a-lateral-join-and-a-subquery-in-postgresql/28557803#28557803
- https://stackoverflow.com/questions/76514703/sql-two-aggregate-functions-without-duplicates/76514912#76514912
Support this query with an index on tb2(tk_id, created_at, id)
.
And another index on tb1(created_at)
, obviously. Or even a covering index on tb1(created_at) INCLUDE (tk_id, status)
. See:
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论