SQL:学习如何在SQL中使用百分位函数

huangapple go评论64阅读模式
英文:

SQL: Learning How to Use The Percentile Functions in SQL

问题

I understand that you want a translation of the code and problem statement into Chinese. Here's the translation:

我正在使用 **Netezza SQL**。

我有以下表格:

    CREATE TABLE MY_TABLE (
        country VARCHAR(50),
        gender CHAR(1),
        age INTEGER,
        height FLOAT,
        owns_bicycle VARCHAR(3)
    );
    
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES ('UK', 'M', 25, 175.99, 'No');
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES ('UK', 'F', 63, 163.65, 'No');
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES ('UK', 'F', 83, 166.01, 'Yes');
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES ('Canada', 'M', 94, 178.92, 'Yes');
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES ('UK', 'M', 63, 173.24, 'Yes');
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES ('Canada', 'M', 87, 156.27, 'Yes');
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES ('Canada', 'F', 89, 159.26, 'No');
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES ('Canada', 'M', 61, 179.14, 'Yes');
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES ('UK', 'F', 49, 167.55, 'No');
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES ('USA', 'F', 53, 172.82, 'No');
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES ('Canada', 'F', 62, 161.18, 'No');
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES ('Canada', 'F', 31, 173.08, 'No');
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES ('USA', 'F', 33, 166.13, 'Yes');
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES ('Canada', 'F', 50, 177.42, 'No');
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES ('UK', 'F', 57, 155.56, 'No');
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES ('USA', 'M', 57, 158.47, 'Yes');
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES ('UK', 'F', 27, 152.85, 'No');
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES ('UK', 'F', 89, 156.31, 'Yes');
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES ('USA', 'F', 99, 179.31, 'No');

**我的问题:**

- 首先,我想按身高值将身高分为3个相等大小的组(例如0%-33%,33%-66%,66%-99%)。
- 接下来,我想按年龄值将年龄分为5个相等大小的组(例如0%-20%,20%-40%等)。
- 然后,对于每个独特的国家、性别、年龄组和身高组的组合,我想找出拥有自行车的人的比例。

最终结果应该如下所示(即每个人只能分配到一个最终子集):

      country gender height_group    age_group   bicycle_proportion counts
      <chr>   <chr>  <fct>           <fct>                    <dbl>  <int>
    1 Canada  F      150.84 - 158.49 18 - 31                  0          2
    2 Canada  F      150.84 - 158.49 31 - 45.2                0.333      3
    3 Canada  F      150.84 - 158.49 62.4 - 78.4              0          2
    4 Canada  F      150.84 - 158.49 78.4 - 99                0          1
    5 Canada  F      158.49 - 169.33 18 - 31                  0          1
    6 Canada  F      158.49 - 169.33 31 - 45.2                1          1

**我知道如何使用 R 编程语言来做到这一点:**

    library(dplyr)

    set.seed(123)
    n <- 100
    country <- sample(c("USA", "Canada", "UK"), n, replace = TRUE)
    gender <- sample(c("M", "F"), n, replace = TRUE)
    age <- sample(18:100, n, replace = TRUE)
    height <- runif(n, min = 150, max = 180)
    owns_bicycle <- sample(c("Yes", "No"), n, replace = TRUE)

    df <- data.frame(country, gender, age, height, owns_bicycle)

    height_breaks <- quantile(df$height, probs = seq(0, 1, by = 1/3))
    age_breaks <- quantile(df$age, probs = seq(0, 1, by = 1/5))

    height_breaks <- round(height_breaks, 2)

    height_labels <- paste0(height_breaks[-length(height_breaks)], " - ", height_breaks[-1])
    age_labels <- paste0(age_breaks[-length(age_breaks)], " - ", age_breaks[-1])

    df$height_group <- cut(df$height, breaks = height_breaks, labels = height_labels, include.lowest = TRUE)
    df$age_group <- cut(df$age, breaks = age_breaks, labels = age_labels, include.lowest = TRUE)

    final = df %>%
      group_by(country, gender, height_group, age_group) %>%
      summarise(bicycle_proportion = mean(owns_bicycle == "Yes"),
                counts = n())

**现在,我正在尝试将其转换

<details>
<summary>英文:</summary>

I am working with **Netezza SQL.**

I have the following table:

    CREATE TABLE MY_TABLE (
        country VARCHAR(50),
        gender CHAR(1),
        age INTEGER,
        height FLOAT,
        owns_bicycle VARCHAR(3)
    );
    
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES (&#39;UK&#39;, &#39;M&#39;, 25, 175.99, &#39;No&#39;);
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES (&#39;UK&#39;, &#39;F&#39;, 63, 163.65, &#39;No&#39;);
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES (&#39;UK&#39;, &#39;F&#39;, 83, 166.01, &#39;Yes&#39;);
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES (&#39;Canada&#39;, &#39;M&#39;, 94, 178.92, &#39;Yes&#39;);
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES (&#39;UK&#39;, &#39;M&#39;, 63, 173.24, &#39;Yes&#39;);
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES (&#39;Canada&#39;, &#39;M&#39;, 87, 156.27, &#39;Yes&#39;);
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES (&#39;Canada&#39;, &#39;F&#39;, 89, 159.26, &#39;No&#39;);
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES (&#39;Canada&#39;, &#39;M&#39;, 61, 179.14, &#39;Yes&#39;);
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES (&#39;UK&#39;, &#39;F&#39;, 49, 167.55, &#39;No&#39;);
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES (&#39;USA&#39;, &#39;F&#39;, 53, 172.82, &#39;No&#39;);
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES (&#39;Canada&#39;, &#39;F&#39;, 62, 161.18, &#39;No&#39;);
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES (&#39;Canada&#39;, &#39;F&#39;, 31, 173.08, &#39;No&#39;);
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES (&#39;USA&#39;, &#39;F&#39;, 33, 166.13, &#39;Yes&#39;);
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES (&#39;Canada&#39;, &#39;F&#39;, 50, 177.42, &#39;No&#39;);
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES (&#39;UK&#39;, &#39;F&#39;, 57, 155.56, &#39;No&#39;);
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES (&#39;USA&#39;, &#39;M&#39;, 57, 158.47, &#39;Yes&#39;);
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES (&#39;UK&#39;, &#39;F&#39;, 27, 152.85, &#39;No&#39;);
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES (&#39;UK&#39;, &#39;F&#39;, 89, 156.31, &#39;Yes&#39;);
    INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
    VALUES (&#39;USA&#39;, &#39;F&#39;, 99, 179.31, &#39;No&#39;);



**My Problem:**

- First, I want to break height into 3 equal sized groups by value of their height (e.g. 0%-33%, 33%-66%,66%-99%). 
- Next, I want to break age into 5 equal sized groups by value of their age (e.g. 0%-20%, 20%-40%, etc.)
- Then, for each unique combination of country, gender, age_group and height_group, I want to find out the percent of who own a bicycle.

The final result should look something like this (i.e. each person should only be assigned to one final subset):

      country gender height_group    age_group   bicycle_proportion counts
      &lt;chr&gt;   &lt;chr&gt;  &lt;fct&gt;           &lt;fct&gt;                    &lt;dbl&gt;  &lt;int&gt;
    1 Canada  F      150.84 - 158.49 18 - 31                  0          2
    2 Canada  F      150.84 - 158.49 31 - 45.2                0.333      3
    3 Canada  F      150.84 - 158.49 62.4 - 78.4              0          2
    4 Canada  F      150.84 - 158.49 78.4 - 99                0          1
    5 Canada  F      158.49 - 169.33 18 - 31                  0          1
    6 Canada  F      158.49 - 169.33 31 - 45.2                1          1

**I know how to do this using the  R programming language:**

    library(dplyr)



    set.seed(123)
    n &lt;- 100
    country &lt;- sample(c(&quot;USA&quot;, &quot;Canada&quot;, &quot;UK&quot;), n, replace = TRUE)
    gender &lt;- sample(c(&quot;M&quot;, &quot;F&quot;), n, replace = TRUE)
    age &lt;- sample(18:100, n, replace = TRUE)
    height &lt;- runif(n, min = 150, max = 180)
    owns_bicycle &lt;- sample(c(&quot;Yes&quot;, &quot;No&quot;), n, replace = TRUE)

    df &lt;- data.frame(country, gender, age, height, owns_bicycle)

    
    height_breaks &lt;- quantile(df$height, probs = seq(0, 1, by = 1/3))
    age_breaks &lt;- quantile(df$age, probs = seq(0, 1, by = 1/5))
    
    height_breaks &lt;- round(height_breaks, 2)
    
    height_labels &lt;- paste0(height_breaks[-length(height_breaks)], &quot; - &quot;, height_breaks[-1])
    age_labels &lt;- paste0(age_breaks[-length(age_breaks)], &quot; - &quot;, age_breaks[-1])
    
    
    df$height_group &lt;- cut(df$height, breaks = height_breaks, labels = height_labels, include.lowest = TRUE)
    df$age_group &lt;- cut(df$age, breaks = age_breaks, labels = age_labels, include.lowest = TRUE)
    
    final = df %&gt;%
      group_by(country, gender, height_group, age_group) %&gt;%
      summarise(bicycle_proportion = mean(owns_bicycle == &quot;Yes&quot;),
                counts = n())


**Now, I am trying to convert this into Netezza SQL.** 

I am not sure how to do this:

    CREATE TABLE height_groups AS
      SELECT
        NTILE(3) OVER (ORDER BY height) AS height_group,
        MIN(height) AS min_height,
        MAX(height) AS max_height
      FROM MY_TABLE;
    
    CREATE TABLE age_groups AS
      SELECT
        NTILE(5) OVER (ORDER BY age) AS age_group,
        MIN(age) AS min_age,
        MAX(age) AS max_age
      FROM MY_TABLE;

**But I don&#39;t think I am doing this correctly.
Can someone please show me how to do this?**



Thanks!


</details>


# 答案1
**得分**: 2

这是一段代码示例,用于实现与您的R代码等效的功能。它使用`percentile_cont`来计算分位数之间的两个边界值,然后可以在条件表达式中使用这些值来确定身高或年龄标签。"交叉连接"只是将这些值添加到每个表行,以便在条件表达式中更容易使用。(这在下面引用的fiddle中进一步显示。)

SELECT
t.country
, t.gender
, CASE WHEN t.height <= height_quantiles.q1 THEN 'short'
WHEN t.height <= height_quantiles.q2 THEN 'medium'
ELSE 'tall'
END AS height_group
, CASE WHEN t.age <= age_quantiles.q1 THEN 'young'
WHEN t.age <= age_quantiles.q2 THEN 'middle_age'
ELSE 'old'
END AS age_group
, AVG(CASE WHEN t.owns_bicycle = 'Yes' THEN 1 ELSE 0 END) AS bicycle_proportion
, COUNT(*) AS counts
FROM MY_TABLE t
CROSS JOIN (
SELECT
PERCENTILE_CONT(0.33) WITHIN GROUP (ORDER BY height) AS q1
, PERCENTILE_CONT(0.67) WITHIN GROUP (ORDER BY height) AS q2
FROM MY_TABLE t
) height_quantiles
CROSS JOIN (
SELECT
PERCENTILE_CONT(0.2) WITHIN GROUP (ORDER BY age) AS q1
, PERCENTILE_CONT(0.4) WITHIN GROUP (ORDER BY age) AS q2
FROM MY_TABLE t
) age_quantiles
GROUP BY
t.country
, t.gender
, height_group
, age_group
ORDER BY
t.country
, t.gender
, height_group
, age_group


这是一个SQL查询,用于根据身高和年龄计算分组,并且还包括其他统计信息。如果您需要更多帮助或有其他问题,请告诉我。
<details>
<summary>英文:</summary>
Here is an example use of some functions to arrive at the equivalent of your R code. It uses `percentile_cont` to arrive at the 2 boundary values between quantiles which can then be used in case expressions to arrive at height or age labels. The &quot;cross join&quot; simply adds these values to each table row for ease of use in the case expressions. (This is displayed further in the referenced fiddle below.)
SELECT
t.country
, t.gender
, CASE WHEN t.height &lt;= height_quantiles.q1 THEN &#39;short&#39; 
WHEN t.height &lt;= height_quantiles.q2 THEN &#39;medium&#39; 
ELSE &#39;tall&#39; 
END AS height_group
, CASE WHEN t.age &lt;= age_quantiles.q1 THEN &#39;young&#39; 
WHEN t.age &lt;= age_quantiles.q2 THEN &#39;middle_age&#39; 
ELSE &#39;old&#39; 
END AS age_group
, AVG(CASE WHEN t.owns_bicycle = &#39;Yes&#39; THEN 1 ELSE 0 END) AS bicycle_proportion
, COUNT(*) AS counts
FROM MY_TABLE t
CROSS JOIN (
SELECT
PERCENTILE_CONT(0.33) WITHIN GROUP (ORDER BY height) AS q1
, PERCENTILE_CONT(0.67) WITHIN GROUP (ORDER BY height) AS q2
FROM MY_TABLE t
) height_quantiles
CROSS JOIN (
SELECT
PERCENTILE_CONT(0.2) WITHIN GROUP (ORDER BY age) AS q1
, PERCENTILE_CONT(0.4) WITHIN GROUP (ORDER BY age) AS q2
FROM MY_TABLE t
) age_quantiles
GROUP BY
t.country
, t.gender
, height_group
, age_group
ORDER BY
t.country
, t.gender
, height_group
, age_group
| country | gender | height\_group | age\_group | bicycle\_proportion | counts |
|:--------|:-------|:-------------|:----------|-------------------:|-------:|
| Canada | F | medium | old | 0.00000000000000000000 | 1 |
| Canada | F | medium | young | 0.00000000000000000000 | 1 |
| Canada | F | short | old | 0.00000000000000000000 | 1 |
| Canada | F | tall | middle\_age | 0.00000000000000000000 | 1 |
| Canada | M | short | old | 1.00000000000000000000 | 1 |
| Canada | M | tall | old | 1.00000000000000000000 | 2 |
| UK | F | medium | middle\_age | 0.00000000000000000000 | 1 |
| UK | F | medium | old | 0.50000000000000000000 | 2 |
| UK | F | short | middle\_age | 0.00000000000000000000 | 1 |
| UK | F | short | old | 1.00000000000000000000 | 1 |
| UK | F | short | young | 0.00000000000000000000 | 1 |
| UK | M | tall | old | 1.00000000000000000000 | 1 |
| UK | M | tall | young | 0.00000000000000000000 | 1 |
| USA | F | medium | middle\_age | 0.00000000000000000000 | 1 |
| USA | F | medium | young | 1.00000000000000000000 | 1 |
| USA | F | tall | old | 0.00000000000000000000 | 1 |
| USA | M | short | middle\_age | 1.00000000000000000000 | 1 |
[fiddle](https://dbfiddle.uk/zimobgtF)
Re: An &quot;adaptable&quot; variant. SQL is not a programming language, it deals with sets of data, so mimicking the R code may be possible if you use a  &quot;procedural extension&quot; to SQL (e.g. plsql in Postgres) but as I&#39;m not able to run anything in Netazza the best I can suggest is something along these lines. 
with CTE as (
SELECT
*
, NTILE(4) OVER (ORDER BY height) AS height_quantile
, NTILE(5) OVER (ORDER BY age) AS age_quantile
FROM my_table
)
, height_labels as (
select
height_quantile hq
, concat(floor(min(round(height::decimal,2))) , &#39; to &#39;
, max(ceiling(round(height::decimal,2)) ) ) height_label
from CTE
group by
height_quantile
)
, age_labels as (
select
age_quantile aq
, concat(min(age ) , &#39; to &#39;
, max(age ) ) age_label
from CTE
group by
age_quantile
)
select
*
from CTE 
inner join height_labels h on cte.height_quantile=h.hq
inner join age_labels    a on cte.age_quantile=a.aq
order by
height, age
| country | gender | age | height | owns\_bicycle | height\_quantile | age\_quantile | hq | height\_label | aq | age\_label |
|:--------|:-------|----:|:-------|:-------------|----------------:|-------------:|---:|:-------------|---:|:----------|
| UK | F | 27 | 152.85 | No | 1 | 1 | 1 | 152 to 159 | 1 | 25 to 33 |
| UK | F | 57 | 155.56 | No | 1 | 3 | 1 | 152 to 159 | 3 | 57 to 63 |
| Canada | M | 87 | 156.27 | Yes | 1 | 4 | 1 | 152 to 159 | 4 | 63 to 89 |
| UK | F | 89 | 156.31 | Yes | 1 | 5 | 1 | 152 to 159 | 5 | 89 to 99 |
| USA | M | 57 | 158.47 | Yes | 1 | 2 | 1 | 152 to 159 | 2 | 49 to 57 |
| Canada | F | 89 | 159.26 | No | 2 | 4 | 2 | 159 to 167 | 4 | 63 to 89 |
| Canada | F | 62 | 161.18 | No | 2 | 3 | 2 | 159 to 167 | 3 | 57 to 63 |
| UK | F | 63 | 163.65 | No | 2 | 3 | 2 | 159 to 167 | 3 | 57 to 63 |
| UK | F | 83 | 166.01 | Yes | 2 | 4 | 2 | 159 to 167 | 4 | 63 to 89 |
| USA | F | 33 | 166.13 | Yes | 2 | 1 | 2 | 159 to 167 | 1 | 25 to 33 |
| UK | F | 49 | 167.55 | No | 3 | 2 | 3 | 167 to 176 | 2 | 49 to 57 |
| USA | F | 53 | 172.82 | No | 3 | 2 | 3 | 167 to 176 | 2 | 49 to 57 |
| Canada | F | 31 | 173.08 | No | 3 | 1 | 3 | 167 to 176 | 1 | 25 to 33 |
| UK | M | 63 | 173.24 | Yes | 3 | 4 | 3 | 167 to 176 | 4 | 63 to 89 |
| UK | M | 25 | 175.99 | No | 3 | 1 | 3 | 167 to 176 | 1 | 25 to 33 |
| Canada | F | 50 | 177.42 | No | 4 | 2 | 4 | 177 to 180 | 2 | 49 to 57 |
| Canada | M | 94 | 178.92 | Yes | 4 | 5 | 4 | 177 to 180 | 5 | 89 to 99 |
| Canada | M | 61 | 179.14 | Yes | 4 | 3 | 4 | 177 to 180 | 3 | 57 to 63 |
| USA | F | 99 | 179.31 | No | 4 | 5 | 4 | 177 to 180 | 5 | 89 to 99 |
[The third fiddle](https://dbfiddle.uk/lnk4XzdJ)
In this approach the labels are generated from the min/max pairs of each NTILE, and then those used in the final output. To simplify the label creation I introduced the functions floor and ceiling.
</details>
# 答案2
**得分**: 1
以下是您要翻译的内容:
这是一个简单的示例
创建表格学生(
ID INT,
名字 VARCHAR(255),
标记 INT
);
插入学生值
(1,'约翰',90),
(2,'简',80),
(3,'迈克',70),
(4,'彼得',60);
选择
ID,
名字,
标记,
PERCENTILE_CONT(标记,0.25)作为第25百分位数,
PERCENTILE_CONT(标记,0.50)作为第50百分位数,
PERCENTILE_CONT(标记,0.75)作为第75百分位数
从学生;
<details>
<summary>英文:</summary>
Heres a simple example
CREATE TABLE students (
id INT,
name VARCHAR(255),
marks INT
);
INSERT INTO students VALUES
(1, &#39;John&#39;, 90),
(2, &#39;Jane&#39;, 80),
(3, &#39;Mike&#39;, 70),
(4, &#39;Peter&#39;, 60);
SELECT
id,
name,
marks,
PERCENTILE_CONT(marks, 0.25) AS 25th_percentile,
PERCENTILE_CONT(marks, 0.50) AS 50th_percentile,
PERCENTILE_CONT(marks, 0.75) AS 75th_percentile
FROM students;
</details>
# 答案3
**得分**: 0
以下是翻译好的部分:
OP在这里 - 这是我自己尝试解决自己问题的方法。
我认为可以分为三个步骤来完成:
- 步骤1:首先,根据身高和年龄计算必要的百分位类别(例如,3组,5组)
- 步骤2:接下来,将原始表中的身高和年龄的值替换为这些类别
- 步骤3:最后,创建一个包含每个类别的最小值/最大值的图例/查找表
我不确定这是否正确(尽管看起来是正确的?)。
**能有人提供对此的意见吗?**
创建了一个名为MY_TABLE的表,其中包括以下列:
- 国家(country)VARCHAR(50)
- 性别(gender)CHAR(1)
- 年龄(age)INTEGER
- 身高(height)FLOAT
- 拥有自行车(owns_bicycle)VARCHAR(3)
插入了一些数据行到MY_TABLE表中。
创建了一个名为height_groups的表,其中包括身高分组信息,如下:
- 国家(country)
- 性别(gender)
- 身高分组(height_group)
- 最小身高(min_height)
- 最大身高(max_height)
创建了一个名为age_groups的表,其中包括年龄分组信息,如下:
- 国家(country)
- 性别(gender)
- 年龄分组(age_group)
- 最小年龄(min_age)
- 最大年龄(max_age)
创建了一个名为final的表,其中包括了以下信息:
- 国家(country)
- 性别(gender)
- 身高分组(height_group)
- 年龄分组(age_group)
- 拥有自行车的比例(bicycle_proportion)
- 计数(counts)
这个表的数据是从height_groups、age_groups和MY_TABLE表中联接而来的,条件是国家、性别、身高和年龄都匹配,并且年龄在指定范围内。bicycle_proportion列计算了拥有自行车的比例。
最后,创建了一个名为lookup的表,其中包括了以下信息:
- 国家(country)
- 性别(gender)
- 身高分组(height_group)
- 年龄分组(age_group)
- 最小身高(min_height)
- 最大身高(max_height)
- 最小年龄(min_age)
- 最大年龄(max_age)
以上是对你的问题的翻译,不包含代码部分。
<details>
<summary>英文:</summary>
OP here - here is my own attempt at solving my own problem. 
I thought I can do this in 3 steps:
- Step 1: First, calculate the necessary percentile categories (e.g. 3 groups, 5 groups) based on height and age
- Step 2: Next, replace the values of height and age in the original table with these categories
- Step 3: Finally, make a legend/lookup table that contains the min/max of each category
I am not sure if this is correct (although it looks correct?). 
**Can someone please provide opinions on this?**
CREATE TABLE MY_TABLE (
country VARCHAR(50),
gender CHAR(1),
age INTEGER,
height FLOAT,
owns_bicycle VARCHAR(3)
);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;UK&#39;, &#39;M&#39;, 25, 175.99, &#39;No&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;UK&#39;, &#39;F&#39;, 63, 163.65, &#39;No&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;UK&#39;, &#39;F&#39;, 83, 166.01, &#39;Yes&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;Canada&#39;, &#39;M&#39;, 94, 178.92, &#39;Yes&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;UK&#39;, &#39;M&#39;, 63, 173.24, &#39;Yes&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;Canada&#39;, &#39;M&#39;, 87, 156.27, &#39;Yes&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;Canada&#39;, &#39;F&#39;, 89, 159.26, &#39;No&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;Canada&#39;, &#39;M&#39;, 61, 179.14, &#39;Yes&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;UK&#39;, &#39;F&#39;, 49, 167.55, &#39;No&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;USA&#39;, &#39;F&#39;, 53, 172.82, &#39;No&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;Canada&#39;, &#39;F&#39;, 62, 161.18, &#39;No&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;Canada&#39;, &#39;F&#39;, 31, 173.08, &#39;No&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;USA&#39;, &#39;F&#39;, 33, 166.13, &#39;Yes&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;Canada&#39;, &#39;F&#39;, 50, 177.42, &#39;No&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;UK&#39;, &#39;F&#39;, 57, 155.56, &#39;No&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;USA&#39;, &#39;M&#39;, 57, 158.47, &#39;Yes&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;UK&#39;, &#39;F&#39;, 27, 152.85, &#39;No&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;UK&#39;, &#39;F&#39;, 89, 156.31, &#39;Yes&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;USA&#39;, &#39;F&#39;, 99, 179.31, &#39;No&#39;);
CREATE TABLE height_groups AS
SELECT
country,
gender,
NTILE(3) OVER (ORDER BY height) AS height_group,
MIN(height) AS min_height,
MAX(height) AS max_height
FROM MY_TABLE
GROUP BY country, gender, height;
CREATE TABLE age_groups AS
SELECT
country,
gender,
NTILE(5) OVER (ORDER BY age) AS age_group,
MIN(age) AS min_age,
MAX(age) AS max_age
FROM MY_TABLE
GROUP BY country, gender, age;
CREATE TABLE final AS
SELECT
h.country,
h.gender,
h.height_group,
a.age_group,
AVG(CASE WHEN t.owns_bicycle = &#39;Yes&#39; THEN 1 ELSE 0 END) AS bicycle_proportion,
COUNT(*) AS counts
FROM height_groups h
JOIN age_groups a ON h.country = a.country AND h.gender = a.gender
JOIN MY_TABLE t ON h.country = t.country AND h.gender = t.gender AND t.height BETWEEN h.min_height AND h.max_height AND t.age BETWEEN a.min_age AND a.max_age
GROUP BY h.country, h.gender, h.height_group, a.age_group;
CREATE TABLE lookup AS
SELECT
h.country,
h.gender,
h.height_group,
a.age_group,
h.min_height,
h.max_height,
a.min_age,
a.max_age
FROM height_groups h
JOIN age_groups a ON h.country = a.country AND h.gender = a.gender;
**Is what I have done correct?**
</details>
# 答案4
**得分**: 0
这是你的代码的翻译:
```sql
OP 再次提问 - 我意识到我的先前答案中年龄组和身高组不一致(例如,有时age_group = 1的最小/最大身高不同)。
以下是我的第二次尝试 - 更加手动的尝试(这次是为5个年龄组和5个身高组):
CREATE TABLE MY_TABLE (
country VARCHAR(50),
gender CHAR(1),
age INTEGER,
height FLOAT,
owns_bicycle VARCHAR(3)
);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES ('UK', 'M', 25, 175.99, 'No');
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES ('UK', 'F', 63, 163.65, 'No');
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES ('UK', 'F', 83, 166.01, 'Yes');
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES ('Canada', 'M', 94, 178.92, 'Yes');
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES ('UK', 'M', 63, 173.24, 'Yes');
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES ('Canada', 'M', 87, 156.27, 'Yes');
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES ('Canada', 'F', 89, 159.26, 'No');
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES ('Canada', 'M', 61, 179.14, 'Yes');
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES ('UK', 'F', 49, 167.55, 'No');
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES ('USA', 'F', 53, 172.82, 'No');
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES ('Canada', 'F', 62, 161.18, 'No');
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES ('Canada', 'F', 31, 173.08, 'No');
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES ('USA', 'F', 33, 166.13, 'Yes');
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES ('Canada', 'F', 50, 177.42, 'No');
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES ('UK', 'F', 57, 155.56, 'No');
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES ('USA', 'M', 57, 158.47, 'Yes');
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES ('UK', 'F', 27, 152.85, 'No');
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES ('UK', 'F', 89, 156.31, 'Yes');
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES ('USA', 'F', 99, 179.31, 'No');
CREATE TABLE age_groups AS
SELECT country, gender, age, height, owns_bicycle,
NTILE(5) OVER (ORDER BY age) AS age_group
FROM MY_TABLE;
CREATE TABLE height_groups AS
SELECT country, gender, age, height, owns_bicycle,
NTILE(5) OVER (ORDER BY height) AS height_group
FROM MY_TABLE;
CREATE TABLE age_height_group_replacements AS
SELECT
a.country,
a.gender,
a.age,
a.height,
a.owns_bicycle,
CASE
WHEN a.age BETWEEN (SELECT MIN(age) FROM age_groups WHERE age_group = 1) AND (SELECT MAX(age) FROM age_groups WHERE age_group = 1) THEN 'Group 1'
WHEN a.age BETWEEN (SELECT MIN(age) FROM age_groups WHERE age_group = 2) AND (SELECT MAX(age) FROM age_groups WHERE age_group = 2) THEN 'Group 2'
WHEN a.age BETWEEN (SELECT MIN(age) FROM age_groups WHERE age_group = 3) AND (SELECT MAX(age) FROM age_groups WHERE age_group = 3) THEN 'Group 3'
WHEN a.age BETWEEN (SELECT MIN(age) FROM age_groups WHERE age_group = 4) AND (SELECT MAX(age) FROM age_groups WHERE age_group = 4) THEN 'Group 4'
ELSE 'Group 5'
END AS age_group_replacement,
CASE
WHEN a.height BETWEEN (SELECT MIN(height) FROM height_groups WHERE height_group = 1) AND (SELECT MAX(height) FROM height_groups WHERE height_group = 1) THEN 'Group 1'
WHEN a.height BETWEEN (SELECT MIN(height) FROM height_groups WHERE height_group = 2) AND (SELECT MAX(height) FROM height_groups WHERE height_group = 2) THEN 'Group 2'
WHEN a.height BETWEEN (SELECT MIN(height) FROM height_groups WHERE height_group = 3) AND (SELECT MAX(height) FROM height_groups WHERE height_group = 3) THEN 'Group 3'
WHEN a.height BETWEEN (SELECT MIN(height) FROM height_groups WHERE height_group = 4) AND (SELECT MAX(height) FROM height_groups WHERE height_group = 4) THEN 'Group 4'
ELSE 'Group 5'
END AS height_group_replacement,
(SELECT MIN(age) FROM age_groups WHERE age_group = (SELECT age_group FROM age_groups WHERE age = a.age)) AS min_age,
(SELECT MAX(age) FROM age_groups WHERE age_group = (SELECT age_group FROM age_groups WHERE age = a.age)) AS max_age,
(SELECT MIN(height) FROM height_groups WHERE height_group = (SELECT height_group FROM height_groups WHERE height = a.height)) AS min_height,
(SELECT MAX(height) FROM height_groups WHERE height_group = (SELECT height_group FROM height_groups WHERE height = a.height)) AS max_height
FROM
MY_TABLE a;
CREATE TABLE BIKE_OWNERSHIP AS SELECT
min_age,
max_age,
min_height,
max_height,
country,
gender,
COUNT(CASE WHEN owns_bicycle = 'Yes' THEN 1 END) * 100.0 / COUNT(*) AS percentage_owns_bicycle
FROM
age_height_group_replacements
GROUP BY
min_age,
max_age,
min_height,
max_height,
country,
gender;
有关此代码的任何评论吗?
谢谢!

请注意,这

英文:

OP again - I realized the age groups and height groups are not consistent in my previous answer (e.g. sometimes age_group = 1 will have different min/max heights).

Here is my second attempt - a more manual attempt (this time for 5 age groups and 5 height groups):

CREATE TABLE MY_TABLE (
country VARCHAR(50),
gender CHAR(1),
age INTEGER,
height FLOAT,
owns_bicycle VARCHAR(3)
);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;UK&#39;, &#39;M&#39;, 25, 175.99, &#39;No&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;UK&#39;, &#39;F&#39;, 63, 163.65, &#39;No&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;UK&#39;, &#39;F&#39;, 83, 166.01, &#39;Yes&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;Canada&#39;, &#39;M&#39;, 94, 178.92, &#39;Yes&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;UK&#39;, &#39;M&#39;, 63, 173.24, &#39;Yes&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;Canada&#39;, &#39;M&#39;, 87, 156.27, &#39;Yes&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;Canada&#39;, &#39;F&#39;, 89, 159.26, &#39;No&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;Canada&#39;, &#39;M&#39;, 61, 179.14, &#39;Yes&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;UK&#39;, &#39;F&#39;, 49, 167.55, &#39;No&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;USA&#39;, &#39;F&#39;, 53, 172.82, &#39;No&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;Canada&#39;, &#39;F&#39;, 62, 161.18, &#39;No&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;Canada&#39;, &#39;F&#39;, 31, 173.08, &#39;No&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;USA&#39;, &#39;F&#39;, 33, 166.13, &#39;Yes&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;Canada&#39;, &#39;F&#39;, 50, 177.42, &#39;No&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;UK&#39;, &#39;F&#39;, 57, 155.56, &#39;No&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;USA&#39;, &#39;M&#39;, 57, 158.47, &#39;Yes&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;UK&#39;, &#39;F&#39;, 27, 152.85, &#39;No&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;UK&#39;, &#39;F&#39;, 89, 156.31, &#39;Yes&#39;);
INSERT INTO MY_TABLE (country, gender, age, height, owns_bicycle)
VALUES (&#39;USA&#39;, &#39;F&#39;, 99, 179.31, &#39;No&#39;);
CREATE TABLE age_groups AS
SELECT country, gender, age, height, owns_bicycle,
NTILE(5) OVER (ORDER BY age) AS age_group
FROM MY_TABLE;
CREATE TABLE height_groups AS
SELECT country, gender, age, height, owns_bicycle,
NTILE(5) OVER (ORDER BY height) AS height_group
FROM MY_TABLE;
CREATE TABLE age_height_group_replacements AS
SELECT
a.country,
a.gender,
a.age,
a.height,
a.owns_bicycle,
CASE
WHEN a.age BETWEEN (SELECT MIN(age) FROM age_groups WHERE age_group = 1) AND (SELECT MAX(age) FROM age_groups WHERE age_group = 1) THEN &#39;Group 1&#39;
WHEN a.age BETWEEN (SELECT MIN(age) FROM age_groups WHERE age_group = 2) AND (SELECT MAX(age) FROM age_groups WHERE age_group = 2) THEN &#39;Group 2&#39;
WHEN a.age BETWEEN (SELECT MIN(age) FROM age_groups WHERE age_group = 3) AND (SELECT MAX(age) FROM age_groups WHERE age_group = 3) THEN &#39;Group 3&#39;
WHEN a.age BETWEEN (SELECT MIN(age) FROM age_groups WHERE age_group = 4) AND (SELECT MAX(age) FROM age_groups WHERE age_group = 4) THEN &#39;Group 4&#39;
ELSE &#39;Group 5&#39;
END AS age_group_replacement,
CASE
WHEN a.height BETWEEN (SELECT MIN(height) FROM height_groups WHERE height_group = 1) AND (SELECT MAX(height) FROM height_groups WHERE height_group = 1) THEN &#39;Group 1&#39;
WHEN a.height BETWEEN (SELECT MIN(height) FROM height_groups WHERE height_group = 2) AND (SELECT MAX(height) FROM height_groups WHERE height_group = 2) THEN &#39;Group 2&#39;
WHEN a.height BETWEEN (SELECT MIN(height) FROM height_groups WHERE height_group = 3) AND (SELECT MAX(height) FROM height_groups WHERE height_group = 3) THEN &#39;Group 3&#39;
WHEN a.height BETWEEN (SELECT MIN(height) FROM height_groups WHERE height_group = 4) AND (SELECT MAX(height) FROM height_groups WHERE height_group = 4) THEN &#39;Group 4&#39;
ELSE &#39;Group 5&#39;
END AS height_group_replacement,
(SELECT MIN(age) FROM age_groups WHERE age_group = (SELECT age_group FROM age_groups WHERE age = a.age)) AS min_age,
(SELECT MAX(age) FROM age_groups WHERE age_group = (SELECT age_group FROM age_groups WHERE age = a.age)) AS max_age,
(SELECT MIN(height) FROM height_groups WHERE height_group = (SELECT height_group FROM height_groups WHERE height = a.height)) AS min_height,
(SELECT MAX(height) FROM height_groups WHERE height_group = (SELECT height_group FROM height_groups WHERE height = a.height)) AS max_height
FROM
MY_TABLE a;
CREATE TABLE BIKE_OWNERSHIP AS SELECT
min_age,
max_age,
min_height,
max_height,
country,
gender,
COUNT(CASE WHEN owns_bicycle = &#39;Yes&#39; THEN 1 END) * 100.0 / COUNT(*) AS percentage_owns_bicycle
FROM
age_height_group_replacements
GROUP BY
min_age,
max_age,
min_height,
max_height,
country,
gender;

Anyone have any comments on this?

Thanks!

Note: you can try to execute this SQL code here: https://sqliteonline.com/

huangapple
  • 本文由 发表于 2023年6月9日 09:50:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/76436707.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定