检查在分组后哪些列具有不同的值。

huangapple go评论54阅读模式
英文:

check which columns have different values after grouping by

问题

我有这样的数据集:

host_id binary_id service binary_date region binary_path ......
1       2         hello   05-06-2012  美国     /bin/usr
1       2         hello   08-09-2022  美国     /bin/usr
2       3         another 08-09-2026  英国     /bin/java

host_id,binary_id,service 是我感兴趣的组合,binary_date region binary_path 是我在这个表中有的许多列中的子集,该表大约有50列。
我想检查在按 host_id,binary_id,service 分组后,这50列中哪些列具有不同的值。例如,在这种情况下,我会得到 binary_date 列,因为这是唯一在分组后具有不同值的列。

预期结果应该是:

host_id binary_id service binary_date region binary_path ......
1       2         hello   05-06-2012  null   null
1       2         hello   08-09-2022  null   null

你可以看到,在这种情况下,只有 binary_date 包含不同的值,其他列 region,binary_path 是 null,因为在组中的行之间值完全相同。

英文:

I have a dataset like this:

host_id binary_id service binary_date region binary_path ......
1       2         hello   05-06-2012  US     /bin/usr
1       2         hello   08-09-2022  US     /bin/usr
2       3         another 08-09-2026  UK     /bin/java

host_id,binary_id,service is the group I'm interested in, binary_date region binary_path is a subset of many columns I have in this table, the table has approximately 50 columns.
I would like to check which of the 50 columns have different values after grouping by host_id,binary_id,service. So for example in this scenario I would get the binary_date column since this is the only column which has different values after grouped on.

Expected result would be:

host_id binary_id service binary_date region binary_path ......
1       2         hello   05-06-2012  null   null
1       2         hello   08-09-2022  null   null

You can see that in this case only the binary_date contain the different values, the other columns region,binary_path are null because the values are exactly the same across the rows in the group.

答案1

得分: 1

这是一种方法:

row number() 返回一个分区内的唯一标识符;这个唯一编号用于确定一行在同一分区内是否具有不同的值,如果每个分区的计数等于行号的总数,就表示发生了更改,否则返回空值。

with cte as (
   select *, count(1) over (partition by host_id, binary_id, service) as count_,
             row_number() over (partition by host_id, binary_id, service, binary_date) as rn_binary_date,
             row_number() over (partition by host_id, binary_id, service, region) as rn_region,
             row_number() over (partition by host_id, binary_id, service, binary_path) as rn_binary_path
   from mytable
)
select host_id, binary_id, service, 
      case 
        when count_ = sum(rn_binary_date) over(partition by host_id, binary_id, service) 
        then binary_date 
      end as binary_date,
      case 
        when count_ = sum(rn_region) over(partition by host_id, binary_id, service)
        then region 
      end as region,
      case 
        when count_ = sum(rn_binary_path) over(partition by host_id, binary_id, service)
        then binary_path 
      end as binary_path
from cte
where count_ > 1

结果:

host_id	binary_id	service	    binary_date	region	binary_path
1	    2	        hello	    2012-06-05	null	null
1	    2	        hello	    2022-09-08	null	null

演示在此处

英文:

This is a way to do it :

row number() returns a unique id within a partition; this unique number is used to determine whether or not a row has different values within the same partition, if count per partition equals total of row numbers, a change has occurred, otherwise, null is returned.

with cte as (
   select *, count(1) over (partition by host_id, binary_id, service) as count_,
             row_number() over (partition by host_id, binary_id, service, binary_date) as rn_binary_date,
             row_number() over (partition by host_id, binary_id, service, region) as rn_region,
             row_number() over (partition by host_id, binary_id, service, binary_path) as rn_binary_path
   from mytable
)
select host_id, binary_id, service, 
      case 
        when count_ = sum(rn_binary_date) over(partition by host_id, binary_id, service) 
        then binary_date 
      end as binary_date,
      case 
        when count_ = sum(rn_region) over(partition by host_id, binary_id, service)
        then region 
      end as region,
      case 
        when count_ = sum(rn_binary_path) over(partition by host_id, binary_id, service)
        then binary_path 
      end as binary_path
from cte
where count_ > 1

Result :

host_id	binary_id	service	    binary_date	region	binary_path
1	    2	        hello	    2012-06-05	null	null
1	    2	        hello	    2022-09-08	null	null

Demo here

huangapple
  • 本文由 发表于 2023年6月8日 10:37:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/76428238.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定