英文:
Return values from previous row based on date ordered union CTE running slowly
问题
以下是您提供的内容的翻译部分:
从以下的体育比赛表格中:
id_ | p1_id | p2_id | match_date | p1_stat | p2_stat |
---|---|---|---|---|---|
852666 | 1 | 2 | 01/01/1997 | 1301 | 249 |
852842 | 1 | 2 | 13/01/1997 | 2837 | 2441 |
853471 | 2 | 1 | 05/05/1997 | 1474 | 952 |
4760 | 2 | 1 | 25/05/1998 | 1190 | 1486 |
6713 | 2 | 1 | 18/01/1999 | 2084 | 885 |
9365 | 2 | 1 | 01/11/1999 | 2894 | 2040 |
11456 | 1 | 2 | 15/05/2000 | 2358 | 1491 |
13022 | 1 | 2 | 14/08/2000 | 2722 | 2401 |
29159 | 1 | 2 | 26/08/2002 | 431 | 2769 |
44915 | 1 | 2 | 07/10/2002 | 1904 | 482 |
对于所选的比赛 id_
,我想返回两位球员的他们各自的上一场比赛统计数据,无论该球员在他们上一场比赛中是 p1 还是 p2。请参阅 id_ = 11456
的预期输出如下:
id_ | p1_id | p2_id | match_date | p1_stat | p2_stat | p1_prev_stat | p2_prev_stat |
---|---|---|---|---|---|---|---|
11456 | 1 | 2 | 15/05/2000 | 2358 | 1491 | 2040 | 2894 |
以下是用于在这个大小的表格上运行的 SQL 查询:
WITH cte_1 AS (
(
SELECT
id_,
match_date,
p1_id AS player_id,
p1_stat AS stat
FROM
test.match_table
UNION ALL
SELECT
id_,
match_date,
p2_id AS player_id,
p2_stat AS stat
FROM
test.match_table
)
),
cte_2 AS (
SELECT
id_,
player_id,
LAG(stat) OVER (
PARTITION BY player_id
ORDER BY
match_date,
id_
) AS prev_stat
FROM
cte_1
)
SELECT
m.*,
cte_p1.prev_stat AS p1_prev_stat,
cte_p2.prev_stat AS p2_prev_stat
FROM
test.match_table AS m
JOIN cte_2 AS cte_p1 ON cte_p1.id_ = m.id_
AND cte_p1.player_id = m.p1_id
JOIN cte_2 AS cte_p2 ON cte_p2.id_ = m.id_
AND cte_p2.player_id = m.p2_id
WHERE m.id_ = 11456
ORDER BY m.match_date
然而,实际的表格有130万行,需要大约12秒的时间。从这个答案中,问题似乎是CTE加载了所有/大多数表格行,而不仅仅是需要的行。然而,该解决方案未涵盖这种用例。是否有人有任何关于如何改善性能的建议?这里是创建小表格的SQL:
CREATE TABLE `match_table` (
`id_` int NOT NULL AUTO_INCREMENT,
`p1_id` int NOT NULL,
`p2_id` int NOT NULL,
`match_date` date NOT NULL,
`p1_stat` int DEFAULT NULL,
`p2_stat` int DEFAULT NULL,
PRIMARY KEY (`id_`),
KEY `ix__p1_id` (`p1_id`),
KEY `ix__p2_id` (`p2_id`),
KEY `ix__match_date` (`match_date`),
KEY `ix__comp` (`p1_id`, `p2_id`, `match_date`)
);
INSERT INTO `match_table`
VALUES
(
4760, 2, 1, '1998-05-25', 1190, 1486
),
(6713, 2, 1, '1999-01-18', 2084, 885),
(
9365, 2, 1, '1999-11-01', 2894, 2040
),
(
11456, 1, 2, '2000-05-15', 2358, 1491
),
(
13022, 1, 2, '2000-08-14', 2722, 2401
),
(
29159, 1, 2, '2002-08-26', 431, 2769
),
(
44915, 1, 2, '2002-10-07', 1904, 482
),
(
852666, 1, 2, '1997-01-01', 1301, 249
),
(
852842, 1, 2, '1997-01-13', 2837, 2441
),
(
853471, 2, 1, '1997-05-05', 1474, 952
);
如果您需要进一步的帮助,请告诉我。
英文:
From the following table of sport matches:
id_ | p1_id | p2_id | match_date | p1_stat | p2_stat |
---|---|---|---|---|---|
852666 | 1 | 2 | 01/01/1997 | 1301 | 249 |
852842 | 1 | 2 | 13/01/1997 | 2837 | 2441 |
853471 | 2 | 1 | 05/05/1997 | 1474 | 952 |
4760 | 2 | 1 | 25/05/1998 | 1190 | 1486 |
6713 | 2 | 1 | 18/01/1999 | 2084 | 885 |
9365 | 2 | 1 | 01/11/1999 | 2894 | 2040 |
11456 | 1 | 2 | 15/05/2000 | 2358 | 1491 |
13022 | 1 | 2 | 14/08/2000 | 2722 | 2401 |
29159 | 1 | 2 | 26/08/2002 | 431 | 2769 |
44915 | 1 | 2 | 07/10/2002 | 1904 | 482 |
For a selected match id_
I'd like to return both players' stats for their respective previous matches no matter whether the player in question was p1 or p2 in their last match. See below for the expected output for id_ = 11456
:
id_ | p1_id | p2_id | match_date | p1_stat | p2_stat | p1_prev_stat | p2_prev_stat |
---|---|---|---|---|---|---|---|
11456 | 1 | 2 | 15/05/2000 | 2358 | 1491 | 2040 | 2894 |
The following SQL works just fine on a table this size:
WITH cte_1 AS (
(
SELECT
id_,
match_date,
p1_id AS player_id,
p1_stat AS stat
FROM
test.match_table
UNION ALL
SELECT
id_,
match_date,
p2_id AS player_id,
p2_stat AS stat
FROM
test.match_table
)
),
cte_2 AS (
SELECT
id_,
player_id,
LAG(stat) OVER (
PARTITION BY player_id
ORDER BY
match_date,
id_
) AS prev_stat
FROM
cte_1
)
SELECT
m.*,
cte_p1.prev_stat AS p1_prev_stat,
cte_p2.prev_stat AS p2_prev_stat
FROM
test.match_table AS m
JOIN cte_2 AS cte_p1 ON cte_p1.id_ = m.id_
AND cte_p1.player_id = m.p1_id
JOIN cte_2 AS cte_p2 ON cte_p2.id_ = m.id_
AND cte_p2.player_id = m.p2_id
WHERE m.id_ = 11456
ORDER BY m.match_date
However, the actual table is 1.3m rows which takes circa 12 seconds. From this answer the issue seems to be that the CTEs are loading all/most of the table rows rather than just those that are needed. However, the solution doesn't cover this use case.
Would anyone have any suggestions as to how I might be able to improve this performance?
Here's the SQL to create the small table:
CREATE TABLE `match_table` (
`id_` int NOT NULL AUTO_INCREMENT,
`p1_id` int NOT NULL,
`p2_id` int NOT NULL,
`match_date` date NOT NULL,
`p1_stat` int DEFAULT NULL,
`p2_stat` int DEFAULT NULL,
PRIMARY KEY (`id_`),
KEY `ix__p1_id` (`p1_id`),
KEY `ix__p2_id` (`p2_id`),
KEY `ix__match_date` (`match_date`),
KEY `ix__comp` (`p1_id`, `p2_id`, `match_date`)
);
INSERT INTO `match_table`
VALUES
(
4760, 2, 1, '1998-05-25', 1190, 1486
),
(6713, 2, 1, '1999-01-18', 2084, 885),
(
9365, 2, 1, '1999-11-01', 2894, 2040
),
(
11456, 1, 2, '2000-05-15', 2358, 1491
),
(
13022, 1, 2, '2000-08-14', 2722, 2401
),
(
29159, 1, 2, '2002-08-26', 431, 2769
),
(
44915, 1, 2, '2002-10-07', 1904, 482
),
(
852666, 1, 2, '1997-01-01', 1301, 249
),
(
852842, 1, 2, '1997-01-13', 2837, 2441
),
(
853471, 2, 1, '1997-05-05', 1474, 952
);
答案1
得分: 1
以下是已翻译的代码部分:
首先,数据库中拥有正确的索引非常关键。您应该确保`id_`、`p1_id`、`p2_id`和`match_date`被索引,因为它们在您的查询中被使用。
其次,不要使用两个CTE(Common Table Expressions),直接在主查询中使用`LAG()`函数。这样可以避免重复行的需要,提高查询速度。以下是如何做到这一点的示例:
SELECT
m.*,
LAG(m.p1_stat) OVER (PARTITION BY m.p1_id ORDER BY m.match_date, m.id_) AS p1_prev_stat,
LAG(m.p2_stat) OVER (PARTITION BY m.p2_id ORDER BY m.match_date, m.id_) AS p2_prev_stat
FROM
test.match_table AS m
WHERE
m.id_ = 11456
ORDER BY
m.match_date;
此查询执行以下操作:
- 删除了联合操作,这会使行数翻倍。
- 跳过了两次连接原始表的步骤。
- 在主查询中直接应用`LAG()`函数以获取先前的统计信息。
但是,如果这不满足您的性能需求,考虑创建一个摘要表。该表将跟踪每场比赛中每个玩家的统计信息。每当添加新的比赛结果时,更新摘要表。这会增加存储需求并影响写入性能,但极大提高了读取性能。
<h4>根据评论更新</h4>
提高速度的一个潜在解决方案是仅考虑在您感兴趣的比赛之前发生的比赛。因此,不再查看每场比赛,而是只查看可能包含相关信息的比赛。可以尝试以下方法:
WITH selected_match AS (
SELECT * FROM match_table WHERE id_ = 11456
),
previous_matches AS (
SELECT
id_,
match_date,
p1_id AS player_id,
p1_stat AS stat
FROM
match_table
WHERE
match_date < (SELECT match_date FROM selected_match)
UNION ALL
SELECT
id_,
match_date,
p2_id AS player_id,
p2_stat AS stat
FROM
match_table
WHERE
match_date < (SELECT match_date FROM selected_match)
),
previous_stats AS (
SELECT
id_,
player_id,
LAG(stat) OVER (
PARTITION BY player_id
ORDER BY
match_date,
id_
) AS prev_stat
FROM
previous_matches
)
SELECT
m.*,
prev_stat_p1.prev_stat AS p1_prev_stat,
prev_stat_p2.prev_stat AS p2_prev_stat
FROM
selected_match AS m
JOIN previous_stats AS prev_stat_p1 ON prev_stat_p1.id_ = m.id_
AND prev_stat_p1.player_id = m.p1_id
JOIN previous_stats AS prev_stat_p2 ON prev_stat_p2.id_ = m.id_
AND prev_stat_p2.player_id = m.p2_id
ORDER BY m.match_date
根据您的要求,这是已经翻译好的代码部分,没有其他内容。
英文:
First, having the right indexes on your database is key. You should make sure that id_
, p1_id
, p2_id
, match_date
are indexed since they are used in your query.
Second, instead of using two CTEs (Common Table Expressions), use the LAG()
function directly in your main query. This removes the need to duplicate rows and can speed up your query. Here's how you can do it:
SELECT
m.*,
LAG(m.p1_stat) OVER (PARTITION BY m.p1_id ORDER BY m.match_date, m.id_) AS p1_prev_stat,
LAG(m.p2_stat) OVER (PARTITION BY m.p2_id ORDER BY m.match_date, m.id_) AS p2_prev_stat
FROM
test.match_table AS m
WHERE
m.id_ = 11456
ORDER BY
m.match_date;
This query does the following:
- Removes the union, which was doubling your row count.
- Skips the step of joining the original table twice.
- Applies the
LAG()
function directly in the main query to get previous
stats.
However, if this doesn't give you the performance you need, consider creating a summary table. This table would track each player's stats for each match. Whenever a new match result is added, update the summary table. This requires more storage and affects write performance, but greatly improves read performance.
<h4>Update based on comment</h4>
One potential solution to speed it up is to only consider matches that occurred before the match you're interested in. So, instead of looking at every single match, we only look at those that could contain relevant information. Something like this:
WITH selected_match AS (
SELECT * FROM match_table WHERE id_ = 11456
),
previous_matches AS (
SELECT
id_,
match_date,
p1_id AS player_id,
p1_stat AS stat
FROM
match_table
WHERE
match_date < (SELECT match_date FROM selected_match)
UNION ALL
SELECT
id_,
match_date,
p2_id AS player_id,
p2_stat AS stat
FROM
match_table
WHERE
match_date < (SELECT match_date FROM selected_match)
),
previous_stats AS (
SELECT
id_,
player_id,
LAG(stat) OVER (
PARTITION BY player_id
ORDER BY
match_date,
id_
) AS prev_stat
FROM
previous_matches
)
SELECT
m.*,
prev_stat_p1.prev_stat AS p1_prev_stat,
prev_stat_p2.prev_stat AS p2_prev_stat
FROM
selected_match AS m
JOIN previous_stats AS prev_stat_p1 ON prev_stat_p1.id_ = m.id_
AND prev_stat_p1.player_id = m.p1_id
JOIN previous_stats AS prev_stat_p2 ON prev_stat_p2.id_ = m.id_
AND prev_stat_p2.player_id = m.p2_id
ORDER BY m.match_date
Here, we first select the match of interest. Then, we only look at matches that happened before this match. We collect stats for these earlier matches, and then join this with the match of interest to get the previous stats for the players in the match.
This might not be a big improvement if most matches occurred before the match of interest. It could be more effective to create an index on match_date
, if one doesn't already exist.
<h4>Final Update based on comment</h4>
WITH selected_match AS (
SELECT * FROM match_table WHERE id_ = 11456
),
previous_matches AS (
SELECT
m.id_,
m.match_date,
m.p1_id AS player_id,
m.p1_stat AS stat
FROM
match_table AS m
JOIN selected_match AS sm ON sm.p1_id = m.p1_id OR sm.p1_id = m.p2_id
UNION ALL
SELECT
m.id_,
m.match_date,
m.p2_id AS player_id,
m.p2_stat AS stat
FROM
match_table AS m
JOIN selected_match AS sm ON sm.p2_id = m.p1_id OR sm.p2_id = m.p2_id
),
previous_stats AS (
SELECT
id_,
player_id,
LAG(stat) OVER (
PARTITION BY player_id
ORDER BY
match_date,
id_
) AS prev_stat
FROM
previous_matches
)
SELECT
m.*,
prev_stat_p1.prev_stat AS p1_prev_stat,
prev_stat_p2.prev_stat AS p2_prev_stat
FROM
selected_match AS m
JOIN previous_stats AS prev_stat_p1 ON prev_stat_p1.id_ = m.id_
AND prev_stat_p1.player_id = m.p1_id
JOIN previous_stats AS prev_stat_p2 ON prev_stat_p2.id_ = m.id_
AND prev_stat_p2.player_id = m.p2_id
ORDER BY m.match_date
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论