如何在Spark SQL中只写一次`group by`时按多个维度分组?

huangapple go评论72阅读模式
英文:

How to group by multiple dimensions when `group by` is written only once in Spark SQL?

问题

我有一个名为'example'的表,像这样:

日期 数量
2023-01-01 10
2023-01-01 10
2023-01-02 20
2023-01-03 30
2023-01-03 30
2023-01-04 40
2023-01-05 50
2023-01-06 60
2023-01-07 70
2023-01-07 70
2023-01-08 80
2023-01-09 90
2023-01-10 100

我想按天、周和月分组日期字段。

我的Spark SQL代码:

(
    SELECT 日期, '天' AS 日期类型, sum(数量) AS 总和
    FROM example
    GROUP BY 日期
    ORDER BY 日期
)
UNION ALL
(
    SELECT trunc(日期, 'WEEK') AS 日期, '周' AS 日期类型, sum(数量) AS 总和
    FROM example
    GROUP BY trunc(日期, 'WEEK')
    ORDER BY trunc(日期, 'WEEK')
)
UNION ALL
(
    SELECT trunc(日期, 'MONTH') AS 日期, '月' AS 日期类型, sum(数量) AS 总和
    FROM example
    GROUP BY trunc(日期, 'MONTH')
    ORDER BY trunc(日期, 'MONTH')
);

结果:

日期 日期类型 总和
2023-01-01 20
2023-01-02 20
2023-01-03 60
2023-01-04 40
2023-01-05 50
2023-01-06 60
2023-01-07 140
2023-01-08 80
2023-01-09 90
2023-01-10 100
2022-12-26 20
2023-01-02 450
2023-01-09 190
2023-01-01 660

> 代码重复是软件设计中的一切邪恶的根源。当系统充斥着许多相同或几乎相同的代码片段时,这表明存在粗心、马虎和极不专业的现象。所有软件开发人员的责任是在发现重复代码时予以根除和消除。

我写了几乎相同的代码三次,但日期字段不同,这不符合干净代码的原则。因此,我想删除union all语句,只使用一次group by

问题:是否有办法只使用一次group by,最小化重复代码并生成上述结果?

最终的代码如下:

SELECT xxx AS 日期, 'xxx' AS 日期类型, sum(数量) AS 总和
FROM example
GROUP BY xxx
ORDER BY xxx;

Spark SQL代码和PySpark代码都可以接受。提前感谢您。

英文:

I have a table named 'example' like this:

date num
2023-01-01 10
2023-01-01 10
2023-01-02 20
2023-01-03 30
2023-01-03 30
2023-01-04 40
2023-01-05 50
2023-01-06 60
2023-01-07 70
2023-01-07 70
2023-01-08 80
2023-01-09 90
2023-01-10 100

I want to group date field by day, week and month.

My Spark SQL code:

(
    SELECT date, 'day' AS dateType, sum(num) AS sum
    FROM example
    GROUP BY date
    ORDER BY date
)
UNION ALL
(
    SELECT trunc(date, 'WEEK') AS date, 'week' AS dateType, sum(num) AS sum
    FROM example
    GROUP BY trunc(date, 'WEEK')
    ORDER BY trunc(date, 'WEEK')
)
UNION ALL
(
    SELECT trunc(date, 'MONTH') AS date, 'month' AS dateType, sum(num) AS sum
    FROM example
    GROUP BY trunc(date, 'MONTH')
    ORDER BY trunc(date, 'MONTH')
);

Result:

date dateType sum
2023-01-01 day 20
2023-01-02 day 20
2023-01-03 day 60
2023-01-04 day 40
2023-01-05 day 50
2023-01-06 day 60
2023-01-07 day 140
2023-01-08 day 80
2023-01-09 day 90
2023-01-10 day 100
2022-12-26 week 20
2023-01-02 week 450
2023-01-09 week 190
2023-01-01 month 660

> Duplicate code is the root of all evil in software design. When a system is littered with many snippets of indentical, or nearly identical code, it is indicative of sloppiness, carelessness, and sheer unprofessionalism. It is the guilt-edged responsibility of all software developers to root out and eliminate duplication whenever they find it.

I wrote almost the same code three times, but the date field is different, which does not conform to the principle of clean code. So i want to remove the union all statement and use group by only once.

Question: Is there a way to write group by only once and minimize duplicate code and generate the above result?

The final code like:

SELECT xxx AS date, 'xxx' AS dateType, sum(num) AS sum
FROM example
GROUP BY xxx
ORDER BY xxx;

Both Spark SQL code and PySpark code are acceptable. Thank you in advance.

答案1

得分: 1

你的代码实际上运行良好,我认为。你也可以这样做:

select date, dateType, sum(num) AS sum from
(
    SELECT date, 'day' AS dateType, num
    FROM example

    UNION ALL

    SELECT trunc(date, 'WEEK') AS date, 'week' AS dateType, num
    FROM example

    UNION ALL

    SELECT trunc(date, 'MONTH') AS date, 'month' AS dateType, num
    FROM example
) t 
GROUP BY date, dateType
ORDER BY date, dateType

在这种情况下,你只有一个group by,但你不能摆脱union all

英文:

Your code actually works fine I think. You can also do this :

select date, dateType, sum(num) AS sum from
(
    SELECT date, 'day' AS dateType, num
    FROM example

    UNION ALL

    SELECT trunc(date, 'WEEK') AS date, 'week' AS dateType, num
    FROM example

    UNION ALL

    SELECT trunc(date, 'MONTH') AS date, 'month' AS dateType, num
    FROM example
) t 
GROUP BY date, dateType
ORDER BY date, dateType

You only have one group by un this case but you cannot get rid of the union all

答案2

得分: 0

对每个在 GROUPING SETS 后指定的分组集分组行。例如,GROUP BY GROUPING SETS ((warehouse), (product)) 在语义上等同于对 GROUP BY warehouse 和 GROUP BY product 的结果进行 UNION 运算。此子句是 UNION ALL 的简写形式,其中 UNION ALL 操作符的每个分支执行 GROUPING SETS 子句中指定的每个分组集的聚合。类似地,GROUP BY GROUPING SETS ((warehouse, product), (product), ()) 在语义上等同于对 GROUP BY warehouse、product、GROUP BY product 和全局聚合的结果进行 UNION 运算。

group by 子句中的 grouping set 参数

我使用 GROUPING SETS 语法来尝试减少冗余代码,尽管它看起来不完美。值得一提的是 GROUP BY GROUPING SETS 在语义上等同于 UNION 因此,这里没有性能优化,只是减少冗余代码,并且仍然存在不合适的 SELECT 操作。

SELECT
    coalesce(date, week, month) AS date,
    CASE
        WHEN date IS NOT NULL THEN 'day'
        WHEN week IS NOT NULL THEN 'week'
        WHEN month IS NOT NULL THEN 'month'
    END AS dateType,
    sum
FROM (
    SELECT date, trunc(date, 'WEEK') AS week, trunc(date, 'MONTH') AS month, sum(num) AS sum
    FROM example
    GROUP BY GROUPING SETS(date, trunc(date, 'WEEK'), trunc(date, 'MONTH'))
)
ORDER BY (
    CASE
        WHEN dateType = 'day' THEN 1
        WHEN dateType = 'week' THEN 2
        WHEN dateType = 'month' THEN 3
    END
), date;
英文:

> Groups the rows for each grouping set specified after GROUPING SETS. For example, GROUP BY GROUPING SETS ((warehouse), (product)) is semantically equivalent to union of results of GROUP BY warehouse and GROUP BY product. This clause is a shorthand for a UNION ALL where each leg of the UNION ALL operator performs aggregation of each grouping set specified in the GROUPING SETS clause. Similarly, GROUP BY GROUPING SETS ((warehouse, product), (product), ()) is semantically equivalent to the union of results of GROUP BY warehouse, product, GROUP BY product and global aggregate.

grouping set parameter in group by clause.

I use GROUPING SETS syntax to try to reduce redundant code, although it doesn't look perfect. It is worth mentioning that GROUP BY GROUPING SETS is semantically equivalent to union. Therefore, there is no performance optimization here, just reducing redundant code, and there are still inappropriate SELECT operations.

SELECT
    coalesce(date, week, month) AS date,
    CASE
        WHEN date IS NOT NULL THEN 'day'
        WHEN week IS NOT NULL THEN 'week'
        WHEN month IS NOT NULL THEN 'month'
    END AS dateType,
    sum
FROM (
    SELECT date, trunc(date, 'WEEK') AS week, trunc(date, 'MONTH') AS month, sum(num) AS sum
    FROM example
    GROUP BY GROUPING SETS(date, trunc(date, 'WEEK'), trunc(date, 'MONTH'))
)
ORDER BY (
    CASE
        WHEN dateType = 'day' THEN 1
        WHEN dateType = 'week' THEN 2
        WHEN dateType = 'month' THEN 3
    END
), date;

huangapple
  • 本文由 发表于 2023年1月9日 14:48:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/75053936.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定