将多个记录归一化为一个记录,查看报名期。

huangapple go评论62阅读模式
英文:

Normalize multiple records into a single record looking at the enrollment period

问题

我有一个雪花表,其中包括学生及其入学期间、开始日期和结束日期,如下所示:

学生ID 开始日期 结束日期
mid 2023-01-01 2023-04-30
mid 2022-12-01 2022-12-31
mid 2022-11-01 2022-11-30
mid 2022-02-01 2022-10-31
mid 2022-10-01 2022-10-31

我需要将其规范化为一行,其中显示学生的入学日期从2022-02-01开始,结束于2023-04-30。要将其视为一行,必须有一个连续的入学期间。中间的任何间断都将创建两行,以此类推。

我尝试使用Lag和Lead函数,但无法弄清楚。任何帮助将不胜感激。

英文:

I have a snowflake table which consists of students and their enrollment period, startdate and enddate as below:

studentid startdate enddate
mid 2023-01-01 2023-04-30
mid 2022-12-01 2022-12-31
mid 2022-11-01 2022-11-30
mid 2022-02-01 2022-10-31
mid 2022-10-01 2022-10-31

I need to normalize this into one single row, where it says student enrollment starts on 2022-02-01 and ends on 2023-04-30, There should be one continuous enrollment to consider it as one row. One Break in between will create two rows likewise more breaks more rows.

I tried with Lag Lead Functions but couldn't figure it out. Any help would be appreciated.

答案1

得分: 0

我不确定它是否涵盖了所有用例,但请尝试以下操作:

  1. 创建一个测试表并填充一些测试数据:
CREATE OR REPLACE TABLE enrollment_intervals (studentid int, startdate date, enddate date);

INSERT INTO enrollment_intervals  
VALUES
    (1, '2023-01-01', '2023-04-30'),
    (1, '2022-12-01', '2022-12-31'),
    (1, '2022-11-01', '2022-11-30'),
    (1, '2022-02-01', '2022-10-31'),
    (1, '2022-10-01', '2022-10-31'),
    (2, '2023-01-01', '2023-02-01'),
    (3, '2024-01-01', '2024-05-01');
  1. 实际查询:
WITH ordered_intervals AS (
    SELECT studentid,
           startdate,
           enddate,
           LEAD(startdate) OVER (PARTITION BY studentid ORDER BY startdate) next_startdate
    FROM enrollment_intervals
    ORDER BY studentid, startdate
),
merged_intervals AS (
    SELECT studentid,
           startdate,
           COALESCE(
               (CASE WHEN enddate < DATEADD(day, -1, next_startdate) THEN enddate ELSE NULL END),
               MAX(enddate) OVER (PARTITION BY studentid ORDER BY startdate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
           ) enddate
    FROM ordered_intervals
),
merged_intervals_with_lag AS (
    SELECT *,
           LAG(enddate) OVER (PARTITION BY studentid ORDER BY startdate) as prev_enddate
    FROM merged_intervals
),
grouped_intervals AS (
    SELECT *,
           SUM(CASE WHEN startdate > DATEADD(day, 1, prev_enddate) THEN 1 ELSE 0 END)
           OVER (PARTITION BY studentid ORDER BY startdate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as group_id
    FROM merged_intervals_with_lag
)
SELECT studentid,
       MIN(startdate) as continuous_start,
       MAX(enddate) as continuous_end
FROM grouped_intervals
GROUP BY studentid, group_id
ORDER BY studentid, continuous_start;

这里的想法是使用公共表达式(CTE)来合并时间段,检查是否没有间隙,并确保不会遗漏重叠和包含。希望能帮助您。

英文:

I am not sure if it covers all use-cases but try this:

  1. Create a test table and fill it with some test data:
CREATE OR REPLACE TABLE enrollment_intervals (studentid int, startdate date, enddate date);

INSERT INTO enrollment_intervals  
VALUES
    (1, &#39;2023-01-01&#39;, &#39;2023-04-30&#39;),
    (1, &#39;2022-12-01&#39;, &#39;2022-12-31&#39;),
    (1, &#39;2022-11-01&#39;, &#39;2022-11-30&#39;),
    (1, &#39;2022-02-01&#39;, &#39;2022-10-31&#39;),
    (1, &#39;2022-10-01&#39;, &#39;2022-10-31&#39;),
    (2, &#39;2023-01-01&#39;, &#39;2023-02-01&#39;),
    (3, &#39;2024-01-01&#39;, &#39;2024-05-01&#39;);

  1. Actual query:
WITH ordered_intervals AS (
    SELECT studentid,
           startdate,
           enddate,
           LEAD(startdate) OVER (PARTITION BY studentid ORDER BY startdate) next_startdate
    FROM enrollment_intervals
    ORDER BY studentid, startdate
),
merged_intervals AS (
    SELECT studentid,
           startdate,
           COALESCE(
               (CASE WHEN enddate &lt; DATEADD(day, -1, next_startdate) THEN enddate ELSE NULL END),
               MAX(enddate) OVER (PARTITION BY studentid ORDER BY startdate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
           ) enddate
    FROM ordered_intervals
),
merged_intervals_with_lag AS (
    SELECT *,
           LAG(enddate) OVER (PARTITION BY studentid ORDER BY startdate) as prev_enddate
    FROM merged_intervals
),
grouped_intervals AS (
    SELECT *,
           SUM(CASE WHEN startdate &gt; DATEADD(day, 1, prev_enddate) THEN 1 ELSE 0 END)
           OVER (PARTITION BY studentid ORDER BY startdate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as group_id
    FROM merged_intervals_with_lag
)
SELECT studentid,
       MIN(startdate) as continuous_start,
       MAX(enddate) as continuous_end
FROM grouped_intervals
GROUP BY studentid, group_id
ORDER BY studentid, continuous_start;

The idea here that you use CTE to merge intervals checking if there are no gaps and ensure that you are not missing overlaps and inclusions. I hope it helps.

huangapple
  • 本文由 发表于 2023年7月24日 18:59:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/76753814.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定