SQL:查找每个组中每个ID的所有可能年份组合。

huangapple go评论48阅读模式
英文:

SQL: Finding Out All Possible Combinations of Years Per ID per Group

问题

你的查询逻辑基本上是正确的,你已经成功地创建了一个表,显示了每个学生的初始专业以及在每一年的出席情况。接下来,你可以使用这个表来回答你的问题,例如查找在特定年份开始学习某专业的学生在后续年份的情况。

不过,为了确保查询结果的准确性,你可能需要进一步测试和验证你的查询。你可以从数据库中随机选择几个学生的记录,手动计算他们的初始专业和每年的出席情况,然后与查询结果进行比较,以确保结果一致。

另外,你可以继续构建你的查询,以回答更多的具体问题,例如在某专业学习连续几年的学生数量等。这将有助于进一步验证你的查询逻辑和结果。希望这些建议有助于你进一步完善你的数据分析工作。

英文:

I am working with Netezza SQL.

I have the following dataset ("my_table") about students (over the years 2010-2015), the current degree major they are enrolled in, the date than an exam was taken, and the exam result:

    student_id current_major year exam_result
1            1       Science 2010           0
2            1          Arts 2013           1
3            1          Arts 2013           0
4            2       Science 2010           1
5            2          Arts 2011           1
6            2       Science 2013           1
7            3          Arts 2010           1
8            3          Arts 2015           1
9            4          Arts 2010           0
10           4       Science 2013           1
11           5          Arts 2010           0
12           5          Arts 2011           0
13           5       Science 2012           1

My Question: I want to find out if the initial degree that a student started with affects how many years the students stayed in the university. (The period of analysis is from 2010 to 2015 - a student could join the university anywhere between 2010 and 2015.)

To answer this question, I first wanted to make a table that shows the number of students that attended the college for each combination of years:

select year_2010,
 year_2011,
year_2012,
 year_2013,
 year_2014,
 year_2015, 
count(*) 
from
(
select student_id,
max(case when (year = 2010) then 1 else 0 end) as 'year_2010',
max(case when (year = 2011) then 1 else 0 end) as 'year_2011',
max(case when (year = 2012) then 1 else 0 end) as 'year_2012',
max(case when (year = 2013) then 1 else 0 end) as 'year_2013',
max(case when (year = 2014) then 1 else 0 end) as 'year_2014',
max(case when (year = 2015) then 1 else 0 end) as 'year_2015',
from my_table
group by student_id)a
group by year_2010,
 year_2011,
 year_2012,
 year_2013,
 year_2014,
 year_2015;


   year_2010 year_2011 year_2012 year_2013 year_2014 year_2015 count_star()
1          0         1         0         1         0         0           24
2          0         0         0         0         1         1           17
3          0         0         1         1         0         0           22
4          1         1         1         0         0         0           12
5          0         1         1         1         1         0           12
6          1         0         0         0         1         0           23
7          0         0         1         0         0         0           49

My Problem: Now, I want to "group" this table using the earliest available major for each student (i.e. what was the student studying in their first recorded year at the university). For the sake of this problem, let's assume that a student can not change their major in the first year and must wait until at least the second year to do this. However, from the second year onwards, a student can switch their degree major multiple times within the same year.

I would like to answer questions such as:

  • Of the students who enrolled in the university in 2010 and initially started studying sciences - how many of these students studied 5 consecutive years?

  • Of the students who enrolled in the university in 2011 and initially started studying arts - how many of these students studied at least 2 years between 2011 and 2015?

I think that such questions can be answered by:

  • Step 1: Using a PARTITION function to find out the earliest year (via a CTE)
  • Step 2: Using the existing query
  • Step 3: Joining the results from Step 1 and Step 2 together

Here is my attempt to do this:

WITH earliest_major AS (
    SELECT student_id, current_major AS earliest_major
    FROM (
        SELECT student_id, current_major, year,
        ROW_NUMBER() OVER (PARTITION BY student_id ORDER BY year) AS rn
        FROM my_table
    ) sub
    WHERE rn = 1
)
SELECT em.earliest_major,
       year_2010,
       year_2011,
       year_2012,
       year_2013,
       year_2014,
       year_2015, 
       COUNT(*) 
FROM (
    SELECT student_id,
           MAX(CASE WHEN (year = 2010) THEN 1 ELSE 0 END) AS year_2010,
           MAX(CASE WHEN (year = 2011) THEN 1 ELSE 0 END) AS year_2011,
           MAX(CASE WHEN (year = 2012) THEN 1 ELSE 0 END) AS year_2012,
           MAX(CASE WHEN (year = 2013) THEN 1 ELSE 0 END) AS year_2013,
           MAX(CASE WHEN (year = 2014) THEN 1 ELSE 0 END) AS year_2014,
           MAX(CASE WHEN (year = 2015) THEN 1 ELSE 0 END) AS year_2015
    FROM my_table
    GROUP BY student_id
) a
JOIN earliest_major em ON a.student_id = em.student_id
GROUP BY em.earliest_major, 
         year_2010, 
         year_2011, 
         year_2012, 
         year_2013, 
         year_2014, 
         year_2015;

The query seems to run and produce results in the desired format:

  earliest_major year_2010 year_2011 year_2012 year_2013 year_2014 year_2015 count_star()
1         Science         0         1         0         1         0         0           15
2            Arts         0         0         0         0         1         1           11
3            Arts         0         0         1         1         0         0           13
4            Arts         1         1         1         0         0         0            8
5         Science         0         1         1         1         1         0            7

But I am not sure if my logic is correct - can someone please help me with this?

Thanks!

答案1

得分: 1

我认为你应该放弃你的列格式和案例。你实际上想要一个直方图。最好使用“长”格式;如果需要,应用层可以重新格式化它。

创建表 major_exams(
  student_id int not null,
  current_major text not null,
  year smallint not null check (year between 1900 and 2200),
  exam_result boolean not null
);

 major_exams(student_id, current_major, year, exam_result) 插入值
( 1, '科学', 2010, false),
( 1,    '艺术', 2013, true),
( 1,    '艺术', 2013, false),
( 2, '科学', 2010, true),
( 2,    '艺术', 2011, true),
( 2, '科学', 2013, true),
( 3,    '艺术', 2010, true),
( 3,    '艺术', 2015, true),
( 4,    '艺术', 2010, false),
( 4, '科学', 2013, true),
( 5,    '艺术', 2010, false),
( 5,    '艺术', 2011, false),
( 5, '科学', 2012, true);

选择 major, duration, count(*) as n
 (
    选择 first_majors.major,
        last_majors.year - first_majors.year + 1 as duration
     (
        选择  (student_id)  distinct, current_major as major, year
         major_exams
        排序 通过 student_id, year
    ) first_majors
    加入 (
        选择 student_id, max(year) as year
         major_exams
        分组 通过 student_id
    ) last_majors  last_majors.student_id = first_majors.student_id
) major_bounds
分组 通过 major, duration
排序 通过 major, duration;
major    duration  n
艺术       3         1
艺术       4         1
艺术       6         1
科学       4         2
英文:

I think you should drop your columnar format and your cases. You effectively want a histogram. It's preferable for this to be in "long" format; the application layer can reformat that if necessary.

create table major_exams(
  student_id int not null,
  current_major text not null,
  year smallint not null check (year between 1900 and 2200),
  exam_result boolean not null
);

insert into major_exams(student_id, current_major, year, exam_result) values
( 1, 'Science', 2010, false),
( 1,    'Arts', 2013, true),
( 1,    'Arts', 2013, false),
( 2, 'Science', 2010, true),
( 2,    'Arts', 2011, true),
( 2, 'Science', 2013, true),
( 3,    'Arts', 2010, true),
( 3,    'Arts', 2015, true),
( 4,    'Arts', 2010, false),
( 4, 'Science', 2013, true),
( 5,    'Arts', 2010, false),
( 5,    'Arts', 2011, false),
( 5, 'Science', 2012, true);

select major, duration, count(*) as n
from (
    select first_majors.major,
        last_majors.year - first_majors.year + 1 as duration
    from (
        select distinct on (student_id) student_id, current_major as major, year
        from major_exams
        order by student_id, year
    ) first_majors
    join (
        select student_id, max(year) as year
        from major_exams
        group by student_id
    ) last_majors on last_majors.student_id = first_majors.student_id
) major_bounds
group by major, duration
order by major, duration;
major	 	duration 	n
Arts 		3 			1
Arts 		4 			1
Arts 		6 			1
Science 	4 			2

huangapple
  • 本文由 发表于 2023年7月18日 07:28:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/76708678.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定