英文:
count (distinct ...) VS count from (select distinct ...) VS group by
问题
在Snowflake上,我在从相同的表中计算不同值的唯一值时,会根据计算方法得到不同的结果。我曾经认为它们是等效的。鉴于存在差异,首先我想知道在哪些情况下这些策略不能互换,其次,如何判断哪种策略返回了正确的数量。
我包括了我用来测试的查询:
select 'count_distinct_subquery' as strat, count(*) from (
select distinct
plan_code,
fis_we_dt,
sku_no,
pog_segment_name,
shelf_no,
position_id
from src
)
union all
select 'count_distinct' as strat, count(
distinct
plan_code,
fis_we_dt,
sku_no,
pog_segment_name,
shelf_no,
position_id
)
from src
union all
select 'group_by_subquery' as strat, count(*) from (
select *
from src
group by
plan_code,
fis_we_dt,
sku_no,
pog_segment_name,
shelf_no,
position_id
)
输出结果如图所示。
英文:
On snowflake, I'm getting different results depending on how I count the distinct values from the same table. I used to think them as equivalent. Given the discrepancies, first I'd like to know in which scenarios these strategies can not be interchanged, and second, how to tell what strategy is returning the right number.
I include the query I'm using to test this
select 'count_distinct_subquery' as strat,count(*) from (
select distinct
plan_code,
fis_we_dt,
sku_no,
pog_segment_name,
shelf_no,
position_id
from src
)
union all
select 'count_distinct' as strat,count(
distinct
plan_code,
fis_we_dt,
sku_no,
pog_segment_name,
shelf_no,
position_id
)
from src
union all
select 'group_by_subquery' as strat, count(*) from (
select *
from src
group by
plan_code,
fis_we_dt,
sku_no,
pog_segment_name,
shelf_no,
position_id
)
答案1
得分: 1
第二个版本的 count (distinct expr1, ...)
会跳过 NULL 值。
CREATE OR REPLACE TABLE src
AS
SELECT NULL AS plan_code, 1 AS fis_we_dt;
select 'count_distinct_subquery' as strat,count(*) from (
select distinct
plan_code,
fis_we_dt
from src
)
union all
select 'count_distinct' as strat,count(
distinct
plan_code,
fis_we_dt
)
from src
union all
select 'group_by_subquery' as strat, count(*) from (
select *
from src
group by
plan_code,
fis_we_dt
);
COUNT(*),表示函数应返回不包含任何 NULL 值的行数。请参阅 COUNT(*) 和 聚合函数和 NULL 值 以获取示例和更多信息。
英文:
The second version count (distinct expr1, ...)
skips NULLs.
CREATE OR REPLACE TABLE src
AS
SELECT NULL AS plan_code, 1 AS fis_we_dt;
select 'count_distinct_subquery' as strat,count(*) from (
select distinct
plan_code,
fis_we_dt
from src
)
union all
select 'count_distinct' as strat,count(
distinct
plan_code,
fis_we_dt
)
from src
union all
select 'group_by_subquery' as strat, count(*) from (
select *
from src
group by
plan_code,
fis_we_dt
);
>COUNT(*)
>
> COUNT( [ DISTINCT ] <expr1> [ , <expr2> ... ] )
>
> Alias.*, which indicates that the function should return the number of rows that do not contain any NULLs. See Examples for an example.
> Aggregate Functions and NULL Values
>
>Some aggregate functions can be passed more than one column. For example:
>
> SELECT COUNT(col1, col2) FROM table1;
>
> In these instances, the aggregate function ignores a row if any individual column is NULL.
答案2
得分: 0
根据Lukasz的回答:
with data(col1, col2) as (
select * from values
(1, 10),
(1, 10),
(1, null),
(null, null),
(null, null)
), unions as (
select
'count_distinct_subquery' as strat,
count(*) as count
from (
select distinct col1, col2
from data
)
union all
select
'count_distinct' as strat,
count(distinct col1, col2)
from data
union all
select
'group_by_subquery' as strat,
count(*)
from (
select *
from data
group by 1,2
)
)
select * from unions;
结果如下:
STRAT | COUNT |
---|---|
count_distinct_subquery | 3 |
count_distinct | 1 |
group_by_subquery | 3 |
第一个和最后一个是相同的。子选择的DISTINCT和GROUP BY会找出排列组合,并将NULL视为一个独立的元素,然后计算出现的行数,即3。
中间的部分是要求没有NULL的行数,即1。
expr1
这应该是要么:列名,可以是合格的名称(例如 database.schema.table.column_name)。
Alias.*,表示函数应返回不包含任何NULL的行数。请参考示例以了解示例。
通过从第一个和第三个查询中提取子查询,然后将其包装在count(distinct col1, col2)
而不是count(*)
中,可以进一步看出这一点:
select
'group_by_subquery_count_cols' as strat,
count(distinct col1, col2)
from (
select *
from data
group by 1,2
)
union all
select
'count_distinct_subquery_count_cols' as strat,
count(distinct col1, col2)
from (
select distinct col1, col2
from data
)
现在我们再次得到1,而不是3,来自相同的数据。
STRAT | COUNT |
---|---|
count_distinct_subquery | 3 |
count_distinct | 1 |
group_by_subquery | 3 |
group_by_subquery_count_cols | 1 |
count_distinct_subquery_count_cols | 1 |
英文:
As per Lukasz answer:
with data(col1, col2) as (
select * from values
(1, 10),
(1, 10),
(1, null),
(null, null),
(null, null)
), unions as (
select
'count_distinct_subquery' as strat,
count(*) as count
from (
select distinct col1, col2
from data
)
union all
select
'count_distinct' as strat,
count(distinct col1, col2)
from data
union all
select
'group_by_subquery' as strat,
count(*)
from (
select *
from data
group by 1,2
)
)
select * from unions;
gives:
STRAT | COUNT |
---|---|
count_distinct_subquery | 3 |
count_distinct | 1 |
group_by_subquery | 3 |
the first and last are the same thing. The sub-select DISTINCT and GROUP BY are find the permutations, AND they respect NULLs as things, and then you count the number of rows present, which is 3.
The middle is asking for the count of row with no-nulls, which is 1.
> expr1
> This should be either:
>
> A column name, which can be a qualified name (e.g. > database.schema.table.column_name).
>
> Alias.*, which indicates that the function should return the number of rows that do not contain any NULLs. See Examples for an example.
This can be further seen by taking the sub-selects from the first and third queries, and wrap those in a count(distinct col1, col2)
instead of the count(*)
select
'group_by_subquery_count_cols' as strat,
count(distinct col1, col2)
from (
select *
from data
group by 1,2
)
union all
select
'count_distinct_subquery_count_cols' as strat,
count(distinct col1, col2)
from (
select distinct col1, col2
from data
)
and now we get 1's again not three's, from the same data.
STRAT | COUNT |
---|---|
count_distinct_subquery | 3 |
count_distinct | 1 |
group_by_subquery | 3 |
group_by_subquery_count_cols | 1 |
count_distinct_subquery_count_cols | 1 |
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论