2023年1月9日 14:48:20go评论159阅读模式

英文:

How to group by multiple dimensions when `group by` is written only once in Spark SQL?

问题

我有一个名为'example'的表，像这样：

日期	数量
2023-01-01	10
2023-01-01	10
2023-01-02	20
2023-01-03	30
2023-01-03	30
2023-01-04	40
2023-01-05	50
2023-01-06	60
2023-01-07	70
2023-01-07	70
2023-01-08	80
2023-01-09	90
2023-01-10	100

我想按天、周和月分组日期字段。

我的Spark SQL代码：

(
    SELECT 日期, '天' AS 日期类型, sum(数量) AS 总和
    FROM example
    GROUP BY 日期
    ORDER BY 日期
)
UNION ALL
(
    SELECT trunc(日期, 'WEEK') AS 日期, '周' AS 日期类型, sum(数量) AS 总和
    FROM example
    GROUP BY trunc(日期, 'WEEK')
    ORDER BY trunc(日期, 'WEEK')
)
UNION ALL
(
    SELECT trunc(日期, 'MONTH') AS 日期, '月' AS 日期类型, sum(数量) AS 总和
    FROM example
    GROUP BY trunc(日期, 'MONTH')
    ORDER BY trunc(日期, 'MONTH')
);

结果：

日期	日期类型	总和
2023-01-01	天	20
2023-01-02	天	20
2023-01-03	天	60
2023-01-04	天	40
2023-01-05	天	50
2023-01-06	天	60
2023-01-07	天	140
2023-01-08	天	80
2023-01-09	天	90
2023-01-10	天	100
2022-12-26	周	20
2023-01-02	周	450
2023-01-09	周	190
2023-01-01	月	660

> 代码重复是软件设计中的一切邪恶的根源。当系统充斥着许多相同或几乎相同的代码片段时，这表明存在粗心、马虎和极不专业的现象。所有软件开发人员的责任是在发现重复代码时予以根除和消除。

我写了几乎相同的代码三次，但日期字段不同，这不符合干净代码的原则。因此，我想删除union all语句，只使用一次group by。

问题：是否有办法只使用一次group by，最小化重复代码并生成上述结果？

最终的代码如下：

SELECT xxx AS 日期, 'xxx' AS 日期类型, sum(数量) AS 总和
FROM example
GROUP BY xxx
ORDER BY xxx;

Spark SQL代码和PySpark代码都可以接受。提前感谢您。

英文:

I have a table named 'example' like this:

date	num
2023-01-01	10
2023-01-01	10
2023-01-02	20
2023-01-03	30
2023-01-03	30
2023-01-04	40
2023-01-05	50
2023-01-06	60
2023-01-07	70
2023-01-07	70
2023-01-08	80
2023-01-09	90
2023-01-10	100

I want to group date field by day, week and month.

My Spark SQL code:

(
    SELECT date, &#39;day&#39; AS dateType, sum(num) AS sum
    FROM example
    GROUP BY date
    ORDER BY date
)
UNION ALL
(
    SELECT trunc(date, &#39;WEEK&#39;) AS date, &#39;week&#39; AS dateType, sum(num) AS sum
    FROM example
    GROUP BY trunc(date, &#39;WEEK&#39;)
    ORDER BY trunc(date, &#39;WEEK&#39;)
)
UNION ALL
(
    SELECT trunc(date, &#39;MONTH&#39;) AS date, &#39;month&#39; AS dateType, sum(num) AS sum
    FROM example
    GROUP BY trunc(date, &#39;MONTH&#39;)
    ORDER BY trunc(date, &#39;MONTH&#39;)
);

Result:

date	dateType	sum
2023-01-01	day	20
2023-01-02	day	20
2023-01-03	day	60
2023-01-04	day	40
2023-01-05	day	50
2023-01-06	day	60
2023-01-07	day	140
2023-01-08	day	80
2023-01-09	day	90
2023-01-10	day	100
2022-12-26	week	20
2023-01-02	week	450
2023-01-09	week	190
2023-01-01	month	660

> Duplicate code is the root of all evil in software design. When a system is littered with many snippets of indentical, or nearly identical code, it is indicative of sloppiness, carelessness, and sheer unprofessionalism. It is the guilt-edged responsibility of all software developers to root out and eliminate duplication whenever they find it.

I wrote almost the same code three times, but the date field is different, which does not conform to the principle of clean code. So i want to remove the union all statement and use group by only once.

Question: Is there a way to write group by only once and minimize duplicate code and generate the above result?

The final code like:

SELECT xxx AS date, &#39;xxx&#39; AS dateType, sum(num) AS sum
FROM example
GROUP BY xxx
ORDER BY xxx;

Both Spark SQL code and PySpark code are acceptable. Thank you in advance.

答案1

得分: 1

你的代码实际上运行良好，我认为。你也可以这样做：

select date, dateType, sum(num) AS sum from
(
    SELECT date, 'day' AS dateType, num
    FROM example

    UNION ALL

    SELECT trunc(date, 'WEEK') AS date, 'week' AS dateType, num
    FROM example

    UNION ALL

    SELECT trunc(date, 'MONTH') AS date, 'month' AS dateType, num
    FROM example
) t 
GROUP BY date, dateType
ORDER BY date, dateType

在这种情况下，你只有一个group by，但你不能摆脱union all。

英文:

Your code actually works fine I think. You can also do this :

select date, dateType, sum(num) AS sum from
(
    SELECT date, &#39;day&#39; AS dateType, num
    FROM example

    UNION ALL

    SELECT trunc(date, &#39;WEEK&#39;) AS date, &#39;week&#39; AS dateType, num
    FROM example

    UNION ALL

    SELECT trunc(date, &#39;MONTH&#39;) AS date, &#39;month&#39; AS dateType, num
    FROM example
) t 
GROUP BY date, dateType
ORDER BY date, dateType

You only have one group by un this case but you cannot get rid of the union all

答案2

得分: 0

对每个在 GROUPING SETS 后指定的分组集分组行。例如，GROUP BY GROUPING SETS ((warehouse), (product)) 在语义上等同于对 GROUP BY warehouse 和 GROUP BY product 的结果进行 UNION 运算。此子句是 UNION ALL 的简写形式，其中 UNION ALL 操作符的每个分支执行 GROUPING SETS 子句中指定的每个分组集的聚合。类似地，GROUP BY GROUPING SETS ((warehouse, product), (product), ()) 在语义上等同于对 GROUP BY warehouse、product、GROUP BY product 和全局聚合的结果进行 UNION 运算。

group by 子句中的 grouping set 参数。

我使用 GROUPING SETS 语法来尝试减少冗余代码，尽管它看起来不完美。值得一提的是 GROUP BY GROUPING SETS 在语义上等同于 UNION。 因此，这里没有性能优化，只是减少冗余代码，并且仍然存在不合适的 SELECT 操作。

SELECT
    coalesce(date, week, month) AS date,
    CASE
        WHEN date IS NOT NULL THEN 'day'
        WHEN week IS NOT NULL THEN 'week'
        WHEN month IS NOT NULL THEN 'month'
    END AS dateType,
    sum
FROM (
    SELECT date, trunc(date, 'WEEK') AS week, trunc(date, 'MONTH') AS month, sum(num) AS sum
    FROM example
    GROUP BY GROUPING SETS(date, trunc(date, 'WEEK'), trunc(date, 'MONTH'))
)
ORDER BY (
    CASE
        WHEN dateType = 'day' THEN 1
        WHEN dateType = 'week' THEN 2
        WHEN dateType = 'month' THEN 3
    END
), date;

英文:

> Groups the rows for each grouping set specified after GROUPING SETS. For example, GROUP BY GROUPING SETS ((warehouse), (product)) is semantically equivalent to union of results of GROUP BY warehouse and GROUP BY product. This clause is a shorthand for a UNION ALL where each leg of the UNION ALL operator performs aggregation of each grouping set specified in the GROUPING SETS clause. Similarly, GROUP BY GROUPING SETS ((warehouse, product), (product), ()) is semantically equivalent to the union of results of GROUP BY warehouse, product, GROUP BY product and global aggregate.

grouping set parameter in group by clause.

I use GROUPING SETS syntax to try to reduce redundant code, although it doesn't look perfect. It is worth mentioning that GROUP BY GROUPING SETS is semantically equivalent to union. Therefore, there is no performance optimization here, just reducing redundant code, and there are still inappropriate SELECT operations.

SELECT
    coalesce(date, week, month) AS date,
    CASE
        WHEN date IS NOT NULL THEN &#39;day&#39;
        WHEN week IS NOT NULL THEN &#39;week&#39;
        WHEN month IS NOT NULL THEN &#39;month&#39;
    END AS dateType,
    sum
FROM (
    SELECT date, trunc(date, &#39;WEEK&#39;) AS week, trunc(date, &#39;MONTH&#39;) AS month, sum(num) AS sum
    FROM example
    GROUP BY GROUPING SETS(date, trunc(date, &#39;WEEK&#39;), trunc(date, &#39;MONTH&#39;))
)
ORDER BY (
    CASE
        WHEN dateType = &#39;day&#39; THEN 1
        WHEN dateType = &#39;week&#39; THEN 2
        WHEN dateType = &#39;month&#39; THEN 3
    END
), date;

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在Spark SQL中只写一次`group by`时按多个维度分组？

问题

答案1

答案2

选择组成每天最大聚合值的个别行。

“半音声” 日语字符的转换

在pySpark中计算非唯一列表元素的累积和。

Efficient way to compute several thousands of averages from time segments of one single TimeSeries DataFrame

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论