英文:
How to get min and max from a timeseries data in SQL/Snowflake?
问题
Here's the translated code portion:
我的数据如下,时间戳按升序排列。
INSERT INTO timeseries (timestamp, value)
VALUES
('2022-01-01 00:00:00', 0.89),
('2022-01-01 10:01:00', 6.89),
('2022-01-02 10:01:21', 10.99),
('2022-01-02 10:07:00', 11.89),
('2022-01-02 12:01:00', 0.89),
('2022-01-02 13:07:00', 6.39),
('2022-01-02 14:00:00', 0.69),
('2022-01-03 14:02:00', 5.39),
('2022-01-03 15:04:00', 6.89),
('2022-01-03 15:00:00', 7.3),
('2022-01-03 15:10:00', 1.89),
('2022-01-03 15:50:00', 0.8);
我要求获取最小值和最大值,以便在值超过5时生成分钟差异。在上述数据中,形成了三个数据集。
最小值 最大值
2022-01-01 10:01:00 2022-01-02 10:07:00
2022-01-02 13:07:00 2022-01-02 13:07:00
2022-01-03 14:02:00 2022-01-03 15:00:00
我尝试使用`row_number`来获取顺序,但由于有三个集合,我需要添加列以在分区子句中使用,但我对此没有任何想法?
WITH CTE AS (
SELECT CASE WHEN VALUE > 5 THEN 'ON' ELSE 'OFF' END STATUS, TIMESTAMP, VALUE
FROM TIMESERIES)
SELECT ROW_NUMBER() OVER(PARTITION BY STATUS ORDER BY TIMESTAMP) RN, TIMESTAMP, VALUE FROM CTE
ORDER BY TIMESTAMP;
这为我提供了所有大于5的行号,但我需要基于数据集来获取行号。
希望这可以帮助你开始或找到解决方案。顺便说一下,在Snowflake尝试此操作,通用SQL也适用。
<details>
<summary>英文:</summary>
my data is something like this, timestamp is ordered in asc.
INSERT INTO timeseries (timestamp, value)
VALUES
('2022-01-01 00:00:00', 0.89),
('2022-01-01 10:01:00', 6.89),
('2022-01-02 10:01:21', 10.99),
('2022-01-02 10:07:00', 11.89),
('2022-01-02 12:01:00', 0.89),
('2022-01-02 13:07:00', 6.39),
('2022-01-02 14:00:00', 0.69),
('2022-01-03 14:02:00', 5.39),
('2022-01-03 15:04:00', 6.89),
('2022-01-03 15:00:00', 7.3),
('2022-01-03 15:10:00', 1.89),
('2022-01-03 15:50:00', 0.8);
my ask is to get min and max, so i can produce difference in minutes whenever value is going above 5.
in above, there are three data sets formed
min max
2022-01-01 10:01:00 2022-01-02 10:07:00
2022-01-02 13:07:00 2022-01-02 13:07:00
2022-01-03 14:02:00 2022-01-03 15:00:00
I tried row_number to get sequence but since there are three sets there, i need to attribute column so i can use in partition clause but i am not getting any ideas for that?
WITH CTE AS (
SELECT CASE WHEN VALUE>5 THEN 'ON' ELSE 'OFF' END STATUS , TIMESTAMP, VALUE
FROM TIMESERIES)
SELECT ROW_NUMBER() OVER(PARTITION BY STATUS ORDER BY TIMESTAMP) RN,TIMESTAMP,VALUE FROM CTE
ORDER BY TIMESTAMP;
this is giving me row_number for all >5 but I need based on sets of data..
any help to start or a solution is highly appreciated.
BDW, trying this in snowflake but general SQL is also fine.
</details>
# 答案1
**得分**: 1
标准间隔和岛屿:
```sql
with flagged as (
select *,
case when
lag(val, 1, 0.0) over (order by ts) <= 5
and val > 5 then 1 end as flag
from T
), grouped as (
select *,
count(flag) over (order by ts) as grp
from flagged
)
select
min(ts),
max(case when val > 5 then ts end)
from grouped
group by grp
having max(val) > 5
order by grp;
```
https://dbfiddle.uk/aGE2n6nk
<details>
<summary>英文:</summary>
Standard gaps and islands:
with flagged as (
select *,
case when
lag(val, 1, 0.0) over (order by ts) <= 5
and val > 5 then 1 end as flag
from T
), grouped as (
select *,
count(flag) over (order by ts) as grp
from flagged
)
select
min(ts),
max(case when val > 5 then ts end)
from grouped
group by grp
having max(val) > 5
order by grp;
https://dbfiddle.uk/aGE2n6nk
</details>
# 答案2
**得分**: 0
以下是翻译好的代码部分:
```sql
使用 Snowflake 的半结构化功能(数组)提供的解决方案:
with ranges as (
select
case when lag(value,1,0.0) over (order by timestamp) < 5 and value >=5 then min(timestamp) end st_time,
case when lead(value,1,0.0) over (order by timestamp) < 5 and value >=5 then max(timestamp) end end_time
from timeseries
group by value, timestamp),
ranges_arrays as (
select
array_agg(st_time) st_time_arr,
array_agg(end_time) end_time_arr,
array_size(st_time_arr) row_count,
array_generate_range( 0 , row_count ) idx_arr
from ranges)
select
st_time_arr[idx]::timestamp st_time,
end_time_arr[idx]::timestamp end_time
from ranges_arrays, (select value::integer idx from ranges_arrays, lateral flatten(input=>idx_arr))
;
```
<details>
<summary>英文:</summary>
Heres a solution using Snowflakes semi-structured functionality (arrays):
with ranges as (
select
case when lag(value,1,0.0) over (order by timestamp) < 5 and value >=5 then min(timestamp) end st_time,
case when lead(value,1,0.0) over (order by timestamp) < 5 and value >=5 then max(timestamp) end end_time
from timeseries
group by value, timestamp),
ranges_arrays as (
select
array_agg(st_time) st_time_arr,
array_agg(end_time) end_time_arr,
array_size(st_time_arr) row_count,
array_generate_range( 0 , row_count ) idx_arr
from ranges)
select
st_time_arr[idx]::timestamp st_time,
end_time_arr[idx]::timestamp end_time
from ranges_arrays, (select value::integer idx from ranges_arrays, lateral flatten(input=>idx_arr))
;
</details>
# 答案3
**得分**: 0
Here is the translated code:
```sql
如果我们使用shawnt00提供的良好VALUES,在CTE中:
with T(ts, val) as (
select * from values
('2022-01-01 00:00:00'::datetime, 0.89),
('2022-01-01 10:01:00'::datetime, 6.89),
('2022-01-02 10:01:21'::datetime, 10.99),
('2022-01-02 10:07:00'::datetime, 11.89),
('2022-01-02 12:01:00'::datetime, 0.89),
('2022-01-02 13:07:00'::datetime, 6.39),
('2022-01-02 14:00:00'::datetime, 0.69),
('2022-01-03 14:02:00'::datetime, 5.39),
('2022-01-03 15:04:00'::datetime, 6.89),
('2022-01-03 15:06:00'::datetime, 7.3),
('2022-01-03 15:10:00'::datetime, 1.89),
('2022-01-03 15:50:00'::datetime, 0.8)
)
然后,我们可以使用[CONDITIONAL_CHANGE_EVENT][1]来查找条件“大于5”何时切换状态:
select
ts,
val > 5 as gt_five,
CONDITIONAL_CHANGE_EVENT(gt_five) over(order by ts) as ct
from T
这将产生以下结果:
TS| GT_FIVE| CT
--|--|--
2022-01-01 00:00:00.000| FALSE| 0
2022-01-01 10:01:00.000| TRUE| 1
2022-01-02 10:01:21.000| TRUE| 1
2022-01-02 10:07:00.000| TRUE| 1
2022-01-02 12:01:00.000| FALSE| 2
2022-01-02 13:07:00.000| TRUE| 3
2022-01-02 14:00:00.000| FALSE| 4
2022-01-03 14:02:00.000| TRUE| 5
2022-01-03 15:04:00.000| TRUE| 5
2022-01-03 15:06:00.000| TRUE| 5
2022-01-03 15:10:00.000| FALSE| 6
2022-01-03 15:50:00.000| FALSE| 6
我们可以使用`gt_five`来过滤掉不需要的间隙,然后通过CONDITIONAL_CHANGE_EVENT的输出进行分组以获取最小/最大值。因此,SQL可以是:
WITH flagged_and_stripped as (
select
ts,
val > 5 as gt_five,
CONDITIONAL_CHANGE_EVENT(gt_five) over(order by ts) as ct
from T
QUALIFY gt_five
)
select
min(ts) as min
,max(ts) as max
from flagged_and_stripped
group by ct
order by 1;
产生以下结果:
MIN|MAX
--|--
2022-01-01 10:01:00.000| 2022-01-02 10:07:00.000
2022-01-02 13:07:00.000| 2022-01-02 13:07:00.000
2022-01-03 14:02:00.000| 2022-01-03 15:06:00.000
因此,复杂示例代码如下:
```sql
with T(ts, val) as (
select * from values
('2022-01-01 00:00:00'::datetime, 0.89),
('2022-01-01 10:01:00'::datetime, 6.89),
('2022-01-02 10:01:21'::datetime, 10.99),
('2022-01-02 10:07:00'::datetime, 11.89),
('2022-01-02 12:01:00'::datetime, 0.89),
('2022-01-02 13:07:00'::datetime, 6.39),
('2022-01-02 14:00:00'::datetime, 0.69),
('2022-01-03 14:02:00'::datetime, 5.39),
('2022-01-03 15:04:00'::datetime, 6.89),
('2022-01-03 15:06:00'::datetime, 7.3),
('2022-01-03 15:10:00'::datetime, 1.89),
('2022-01-03 15:50:00'::datetime, 0.8)
), flagged_and_stripped as (
select
ts,
val > 5 as gt_five,
CONDITIONAL_CHANGE_EVENT(gt_five) over(order by ts) as ct
from T
QUALIFY gt_five
)
select
min(ts) as min
,max(ts) as max
from flagged_and_stripped
group by ct
order by 1;
```
[1]: https://docs.snowflake.com/en/sql-reference/functions/conditional_change_event
```
<details>
<summary>英文:</summary>
If we use the nice VALUES provided by shawnt00, in a CTE
```
with T(ts, val) as (
select * from values
('2022-01-01 00:00:00'::datetime, 0.89),
('2022-01-01 10:01:00'::datetime, 6.89),
('2022-01-02 10:01:21'::datetime, 10.99),
('2022-01-02 10:07:00'::datetime, 11.89),
('2022-01-02 12:01:00'::datetime, 0.89),
('2022-01-02 13:07:00'::datetime, 6.39),
('2022-01-02 14:00:00'::datetime, 0.69),
('2022-01-03 14:02:00'::datetime, 5.39),
('2022-01-03 15:04:00'::datetime, 6.89),
('2022-01-03 15:06:00'::datetime, 7.3),
('2022-01-03 15:10:00'::datetime, 1.89),
('2022-01-03 15:50:00'::datetime, 0.8)
)
```
we can then use [CONDITIONAL_CHANGE_EVENT][1] to find when the conditional "greater than 5" swaps state:
```
select
ts,
val > 5 as gt_five,
CONDITIONAL_CHANGE_EVENT(gt_five) over(order by ts) as ct
from T
```
this gives:
TS| GT_FIVE| CT
--|--|--
2022-01-01 00:00:00.000| FALSE| 0
2022-01-01 10:01:00.000| TRUE| 1
2022-01-02 10:01:21.000| TRUE| 1
2022-01-02 10:07:00.000| TRUE| 1
2022-01-02 12:01:00.000| FALSE| 2
2022-01-02 13:07:00.000| TRUE| 3
2022-01-02 14:00:00.000| FALSE| 4
2022-01-03 14:02:00.000| TRUE| 5
2022-01-03 15:04:00.000| TRUE| 5
2022-01-03 15:06:00.000| TRUE| 5
2022-01-03 15:10:00.000| FALSE| 6
2022-01-03 15:50:00.000| FALSE| 6
we can use the `gt_five` to filter out the unwanted gaps with a QUALIFY, and then group via the output of the CONDITIONAL_CHANGE_EVENT to get the min/max. Thus the sql can be:
```
WITH flagged_and_stripped as (
select
ts,
val > 5 as gt_five,
CONDITIONAL_CHANGE_EVENT(gt_five) over(order by ts) as ct
from T
QUALIFY gt_five
)
select
min(ts) as min
,max(ts) as max
from flagged_and_stripped
group by ct
order by 1;
```
giving:
MIN |MAX
--|--
2022-01-01 10:01:00.000| 2022-01-02 10:07:00.000
2022-01-02 13:07:00.000 |2022-01-02 13:07:00.000
2022-01-03 14:02:00.000| 2022-01-03 15:06:00.000
thus the complex example code is:
```
with T(ts, val) as (
select * from values
('2022-01-01 00:00:00'::datetime, 0.89),
('2022-01-01 10:01:00'::datetime, 6.89),
('2022-01-02 10:01:21'::datetime, 10.99),
('2022-01-02 10:07:00'::datetime, 11.89),
('2022-01-02 12:01:00'::datetime, 0.89),
('2022-01-02 13:07:00'::datetime, 6.39),
('2022-01-02 14:00:00'::datetime, 0.69),
('2022-01-03 14:02:00'::datetime, 5.39),
('2022-01-03 15:04:00'::datetime, 6.89),
('2022-01-03 15:06:00'::datetime, 7.3),
('2022-01-03 15:10:00'::datetime, 1.89),
('2022-01-03 15:50:00'::datetime, 0.8)
), flagged_and_stripped as (
select
ts,
val > 5 as gt_five,
CONDITIONAL_CHANGE_EVENT(gt_five) over(order by ts) as ct
from T
QUALIFY gt_five
)
select
min(ts) as min
,max(ts) as max
from flagged_and_stripped
group by ct
order by 1;
```
[1]: https://docs.snowflake.com/en/sql-reference/functions/conditional_change_event
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论