英文:
Normalize multiple records into a single record looking at the enrollment period
问题
我有一个雪花表,其中包括学生及其入学期间、开始日期和结束日期,如下所示:
学生ID | 开始日期 | 结束日期 |
---|---|---|
mid | 2023-01-01 | 2023-04-30 |
mid | 2022-12-01 | 2022-12-31 |
mid | 2022-11-01 | 2022-11-30 |
mid | 2022-02-01 | 2022-10-31 |
mid | 2022-10-01 | 2022-10-31 |
我需要将其规范化为一行,其中显示学生的入学日期从2022-02-01开始,结束于2023-04-30。要将其视为一行,必须有一个连续的入学期间。中间的任何间断都将创建两行,以此类推。
我尝试使用Lag和Lead函数,但无法弄清楚。任何帮助将不胜感激。
英文:
I have a snowflake table which consists of students and their enrollment period, startdate and enddate as below:
studentid | startdate | enddate |
---|---|---|
mid | 2023-01-01 | 2023-04-30 |
mid | 2022-12-01 | 2022-12-31 |
mid | 2022-11-01 | 2022-11-30 |
mid | 2022-02-01 | 2022-10-31 |
mid | 2022-10-01 | 2022-10-31 |
I need to normalize this into one single row, where it says student enrollment starts on 2022-02-01 and ends on 2023-04-30, There should be one continuous enrollment to consider it as one row. One Break in between will create two rows likewise more breaks more rows.
I tried with Lag Lead Functions but couldn't figure it out. Any help would be appreciated.
答案1
得分: 0
我不确定它是否涵盖了所有用例,但请尝试以下操作:
- 创建一个测试表并填充一些测试数据:
CREATE OR REPLACE TABLE enrollment_intervals (studentid int, startdate date, enddate date);
INSERT INTO enrollment_intervals
VALUES
(1, '2023-01-01', '2023-04-30'),
(1, '2022-12-01', '2022-12-31'),
(1, '2022-11-01', '2022-11-30'),
(1, '2022-02-01', '2022-10-31'),
(1, '2022-10-01', '2022-10-31'),
(2, '2023-01-01', '2023-02-01'),
(3, '2024-01-01', '2024-05-01');
- 实际查询:
WITH ordered_intervals AS (
SELECT studentid,
startdate,
enddate,
LEAD(startdate) OVER (PARTITION BY studentid ORDER BY startdate) next_startdate
FROM enrollment_intervals
ORDER BY studentid, startdate
),
merged_intervals AS (
SELECT studentid,
startdate,
COALESCE(
(CASE WHEN enddate < DATEADD(day, -1, next_startdate) THEN enddate ELSE NULL END),
MAX(enddate) OVER (PARTITION BY studentid ORDER BY startdate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
) enddate
FROM ordered_intervals
),
merged_intervals_with_lag AS (
SELECT *,
LAG(enddate) OVER (PARTITION BY studentid ORDER BY startdate) as prev_enddate
FROM merged_intervals
),
grouped_intervals AS (
SELECT *,
SUM(CASE WHEN startdate > DATEADD(day, 1, prev_enddate) THEN 1 ELSE 0 END)
OVER (PARTITION BY studentid ORDER BY startdate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as group_id
FROM merged_intervals_with_lag
)
SELECT studentid,
MIN(startdate) as continuous_start,
MAX(enddate) as continuous_end
FROM grouped_intervals
GROUP BY studentid, group_id
ORDER BY studentid, continuous_start;
这里的想法是使用公共表达式(CTE)来合并时间段,检查是否没有间隙,并确保不会遗漏重叠和包含。希望能帮助您。
英文:
I am not sure if it covers all use-cases but try this:
- Create a test table and fill it with some test data:
CREATE OR REPLACE TABLE enrollment_intervals (studentid int, startdate date, enddate date);
INSERT INTO enrollment_intervals
VALUES
(1, '2023-01-01', '2023-04-30'),
(1, '2022-12-01', '2022-12-31'),
(1, '2022-11-01', '2022-11-30'),
(1, '2022-02-01', '2022-10-31'),
(1, '2022-10-01', '2022-10-31'),
(2, '2023-01-01', '2023-02-01'),
(3, '2024-01-01', '2024-05-01');
- Actual query:
WITH ordered_intervals AS (
SELECT studentid,
startdate,
enddate,
LEAD(startdate) OVER (PARTITION BY studentid ORDER BY startdate) next_startdate
FROM enrollment_intervals
ORDER BY studentid, startdate
),
merged_intervals AS (
SELECT studentid,
startdate,
COALESCE(
(CASE WHEN enddate < DATEADD(day, -1, next_startdate) THEN enddate ELSE NULL END),
MAX(enddate) OVER (PARTITION BY studentid ORDER BY startdate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
) enddate
FROM ordered_intervals
),
merged_intervals_with_lag AS (
SELECT *,
LAG(enddate) OVER (PARTITION BY studentid ORDER BY startdate) as prev_enddate
FROM merged_intervals
),
grouped_intervals AS (
SELECT *,
SUM(CASE WHEN startdate > DATEADD(day, 1, prev_enddate) THEN 1 ELSE 0 END)
OVER (PARTITION BY studentid ORDER BY startdate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as group_id
FROM merged_intervals_with_lag
)
SELECT studentid,
MIN(startdate) as continuous_start,
MAX(enddate) as continuous_end
FROM grouped_intervals
GROUP BY studentid, group_id
ORDER BY studentid, continuous_start;
The idea here that you use CTE to merge intervals checking if there are no gaps and ensure that you are not missing overlaps and inclusions. I hope it helps.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论