在Athena或Spark中的”FIRST_VALUE”

huangapple go评论65阅读模式
英文:

FIRST_VALUE in Athena or Spark

问题

以下是你要的翻译内容:

select id
      ,id2
      ,FIRST_VALUE(CASE WHEN app THEN date0 ELSE NULL END) IGNORE NULLS  OVER (PARTITION BY id ORDER BY date0) as date_result
from (
select 1 id, 22 as id2, false app, Date'2019-03-13' as date0
union
select 1 id, 23 as id2, true app, Date'2019-03-14' as date0
union
select 1 id, 23 as id2, true app, Date'2019-03-15' as date0
)

上述查询在Athena中返回如下:

id id2 date_result
1 22
1 23 2019-03-14
1 23 2019-03-14

但我期望的结果如下,因为我们忽略了null值并按id对date_result进行分区:

id id2 date_result
1 22 2019-03-14
1 23 2019-03-14
1 23 2019-03-14

请问我在FIRST_VALUE中做错了什么?在Athena和Spark中实现这个结果的最佳方法是什么?谢谢。

英文:
select id
      ,id2
      ,FIRST_VALUE(CASE WHEN app THEN date0 ELSE NULL END) IGNORE NULLS  OVER (PARTITION BY id ORDER BY date0) as date_result
from (
select 1 id, 22 as id2, false app, Date'2019-03-13' as date0
union
select 1 id, 23 as id2, true app, Date'2019-03-14' as date0
union
select 1 id, 23 as id2, true app, Date'2019-03-15' as date0
)

Above query is returning like below in Athena

id id2 date_result
1 22
1 23 2019-03-14
1 23 2019-03-14

But I was expecting like below since we do ignore nulls and partition by id for date_result

id id2 date_result
1 22 2019-03-14
1 23 2019-03-14
1 23 2019-03-14

Could you please let me know what I am doing wrong in first_value? what is the best way to achieve this result in both Athena and spark? Thanks

I have added it in the description

答案1

得分: 0

> 你能告诉我在 first_value 函数中我做错了什么吗?

默认的窗口函数1的窗口范围是从无限制的前导行到当前行:

>> 如果没有指定 frame_end,将使用默认值 CURRENT ROW。

>> 如果未指定窗口范围,则使用默认窗口范围 RANGE UNBOUNDED PRECEDING。

如果你想要在整个分区内查找数值,你需要指定窗口范围,例如:

with dataset(id, id2, app, date0) as (
    values (1, 22, false, Date'2019-03-13'),
    (1, 23, true ,Date'2019-03-14'),
    (1, 23, true ,Date'2019-03-15')
)

select id
     , id2
     , FIRST_VALUE(if(app, date0)) IGNORE NULLS
        OVER (PARTITION BY id ORDER BY date0 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as date_result
from dataset;
英文:

> Could you please let me know what I am doing wrong in first_value?

default frame for windows functions is unbounded preceding - current row:

>> If frame_end is not specified, a default value of CURRENT ROW is used.

>> If no frame is specified, a default frame of RANGE UNBOUNDED PRECEDING is used.

If you want to find value across the whole partition you need to specify the frame, for example:

with dataset(id, id2, app, date0) as (
    values (1, 22, false, Date'2019-03-13'),
    (1, 23, true ,Date'2019-03-14'),
    (1, 23, true ,Date'2019-03-15')
)

select id
     , id2
     , FIRST_VALUE(if(app, date0)) IGNORE NULLS
        OVER (PARTITION BY id ORDER BY date0 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as date_result
from dataset;

huangapple
  • 本文由 发表于 2023年2月10日 04:48:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/75404283.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定