count (distinct …) VS count from (select distinct …) VS group by

huangapple go评论56阅读模式
英文:

count (distinct ...) VS count from (select distinct ...) VS group by

问题

在Snowflake上,我在从相同的表中计算不同值的唯一值时,会根据计算方法得到不同的结果。我曾经认为它们是等效的。鉴于存在差异,首先我想知道在哪些情况下这些策略不能互换,其次,如何判断哪种策略返回了正确的数量。

我包括了我用来测试的查询:

select 'count_distinct_subquery' as strat, count(*) from (
  select distinct 
  plan_code,
  fis_we_dt,
  sku_no,
  pog_segment_name,
  shelf_no,
  position_id
  from src 
)
union all
select 'count_distinct' as strat, count(
  distinct 
  plan_code,
  fis_we_dt,
  sku_no,
  pog_segment_name,
  shelf_no,
  position_id
)
from src 
union all 
select 'group_by_subquery' as strat, count(*) from (
  select *
  from src 
  group by 
  plan_code,
  fis_we_dt,
  sku_no,
  pog_segment_name,
  shelf_no,
  position_id
)

输出结果如图所示。

英文:

On snowflake, I'm getting different results depending on how I count the distinct values from the same table. I used to think them as equivalent. Given the discrepancies, first I'd like to know in which scenarios these strategies can not be interchanged, and second, how to tell what strategy is returning the right number.

I include the query I'm using to test this

select 'count_distinct_subquery' as strat,count(*) from (
  select distinct 
  plan_code,
  fis_we_dt,
  sku_no,
  pog_segment_name,
  shelf_no,
  position_id
  from src 
)
union all
select 'count_distinct' as strat,count(
  distinct 
  plan_code,
  fis_we_dt,
  sku_no,
  pog_segment_name,
  shelf_no,
  position_id
)
from src 
union all 
select 'group_by_subquery' as strat, count(*) from (
  select *
  from src 
  group by 
  plan_code,
  fis_we_dt,
  sku_no,
  pog_segment_name,
  shelf_no,
  position_id
)

The output is as in the image
count (distinct …) VS count from (select distinct …) VS group by

答案1

得分: 1

第二个版本的 count (distinct expr1, ...) 会跳过 NULL 值。

CREATE OR REPLACE TABLE src
AS 
SELECT NULL AS plan_code, 1 AS fis_we_dt;

select 'count_distinct_subquery' as strat,count(*) from (
  select distinct 
  plan_code,
  fis_we_dt
  from src 
)
union all
select 'count_distinct' as strat,count(
  distinct 
  plan_code,
  fis_we_dt
)
from src 
union all 
select 'group_by_subquery' as strat, count(*) from (
  select *
  from src 
  group by 
  plan_code,
  fis_we_dt
);

COUNT(*),表示函数应返回不包含任何 NULL 值的行数。请参阅 COUNT(*)聚合函数和 NULL 值 以获取示例和更多信息。

count (distinct …) VS count from (select distinct …) VS group by

英文:

The second version count (distinct expr1, ...) skips NULLs.

CREATE OR REPLACE TABLE src
AS 
SELECT NULL AS plan_code, 1 AS fis_we_dt;

select 'count_distinct_subquery' as strat,count(*) from (
  select distinct 
  plan_code,
  fis_we_dt
  from src 
)
union all
select 'count_distinct' as strat,count(
  distinct 
  plan_code,
  fis_we_dt
)
from src 
union all 
select 'group_by_subquery' as strat, count(*) from (
  select *
  from src 
  group by 
  plan_code,
  fis_we_dt
);

count (distinct …) VS count from (select distinct …) VS group by

>COUNT(*)
>
> COUNT( [ DISTINCT ] <expr1> [ , <expr2> ... ] )
>
> Alias.*, which indicates that the function should return the number of rows that do not contain any NULLs. See Examples for an example.

> Aggregate Functions and NULL Values
>
>Some aggregate functions can be passed more than one column. For example:
>
> SELECT COUNT(col1, col2) FROM table1;
>
> In these instances, the aggregate function ignores a row if any individual column is NULL.

答案2

得分: 0

根据Lukasz的回答:

with data(col1, col2) as (
    select * from values
    (1, 10),
    (1, 10),
    (1, null),
    (null, null),
    (null, null)
), unions as (
    select 
        'count_distinct_subquery' as strat,
        count(*) as count
    from (
        select distinct col1, col2
        from data 
    )
    
    union all
    
    select 
        'count_distinct' as strat,
        count(distinct col1, col2)
    from data
    
    union all 
    
    select 
        'group_by_subquery' as strat, 
        count(*) 
    from (
        select *
        from data
        group by 1,2
    )
)
select * from unions;

结果如下:

STRAT COUNT
count_distinct_subquery 3
count_distinct 1
group_by_subquery 3

第一个和最后一个是相同的。子选择的DISTINCT和GROUP BY会找出排列组合,并将NULL视为一个独立的元素,然后计算出现的行数,即3。

中间的部分是要求没有NULL的行数,即1。

expr1
这应该是要么:

列名,可以是合格的名称(例如 database.schema.table.column_name)。

Alias.*,表示函数应返回不包含任何NULL的行数。请参考示例以了解示例。

通过从第一个和第三个查询中提取子查询,然后将其包装在count(distinct col1, col2)而不是count(*)中,可以进一步看出这一点:

    select 
        'group_by_subquery_count_cols' as strat, 
        count(distinct col1, col2)
    from (
        select *
        from data
        group by 1,2
    )

    union all

    select 
        'count_distinct_subquery_count_cols' as strat,
        count(distinct col1, col2)
    from (
        select distinct col1, col2
        from data 
    )

现在我们再次得到1,而不是3,来自相同的数据。

STRAT COUNT
count_distinct_subquery 3
count_distinct 1
group_by_subquery 3
group_by_subquery_count_cols 1
count_distinct_subquery_count_cols 1
英文:

As per Lukasz answer:

with data(col1, col2) as (
    select * from values
    (1, 10),
    (1, 10),
    (1, null),
    (null, null),
    (null, null)
), unions as (
    select 
        &#39;count_distinct_subquery&#39; as strat,
        count(*) as count
    from (
        select distinct col1, col2
        from data 
    )
    
    union all
    
    select 
        &#39;count_distinct&#39; as strat,
        count(distinct col1, col2)
    from data
    
    union all 
    
    select 
        &#39;group_by_subquery&#39; as strat, 
        count(*) 
    from (
        select *
        from data
        group by 1,2
    )
)
select * from unions;

gives:

STRAT COUNT
count_distinct_subquery 3
count_distinct 1
group_by_subquery 3

the first and last are the same thing. The sub-select DISTINCT and GROUP BY are find the permutations, AND they respect NULLs as things, and then you count the number of rows present, which is 3.

The middle is asking for the count of row with no-nulls, which is 1.

> expr1
> This should be either:
>
> A column name, which can be a qualified name (e.g. > database.schema.table.column_name).
>
> Alias.*, which indicates that the function should return the number of rows that do not contain any NULLs. See Examples for an example.

This can be further seen by taking the sub-selects from the first and third queries, and wrap those in a count(distinct col1, col2) instead of the count(*)

    select 
        &#39;group_by_subquery_count_cols&#39; as strat, 
        count(distinct col1, col2)
    from (
        select *
        from data
        group by 1,2
    )

    union all

    select 
        &#39;count_distinct_subquery_count_cols&#39; as strat,
        count(distinct col1, col2)
    from (
        select distinct col1, col2
        from data 
    )

and now we get 1's again not three's, from the same data.

STRAT COUNT
count_distinct_subquery 3
count_distinct 1
group_by_subquery 3
group_by_subquery_count_cols 1
count_distinct_subquery_count_cols 1

huangapple
  • 本文由 发表于 2023年2月14日 04:16:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/75440766.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定