选择随机行,使得一组列的组合是唯一的。

huangapple go评论82阅读模式
英文:

Select random rows such that combination of a subgroup of columns is unique

问题

my_table 表中的列 abcde 中,我想要随机选择 N 行,以确保列 abc 的组合是唯一的。

我尝试使用分组解决这个问题:

select a, b, c, min(d), min(e) from my_table
group by a, b, c

这种方法不起作用,因为它不会返回实际存在于 my_table 表中的行,因为列 de 的最小值可能分布在不同的行中。此外,min 不是一个适当的函数,因为它不会返回随机行。

我该如何实现这个目标?

我正在使用构建在 Presto 上并运行标准 ANSI SQL 的 Amazon Athena。

英文:

From my_table with columns a,b,c,d and e I want to select N rows randomly such that the combinations of columns a,b and c are unique.

I tried solving this with a group by:

select a, b, c, min(d), min(e) from my_table
group by a, b, c

This approach cannot work because it does not return actual existing rows from my_table since the minimum can be in different rows for the columns d and e. Furthermore min is not an appropriate function because it does not return random rows.

How could I achieve it?

I am working with amazon athena which is built on Presto and runs standard ANSI SQL.

答案1

得分: 1

你没有告诉我你正在使用哪个数据库管理系统,所以无法提供完整的解决方案。

但策略将是:

  1. 使用窗口函数对具有相同a、b、c的行进行分组。在Oracle中,这可能是这样的:SELECT a, b, c, (row_number() over (partition by a, b, c order by a, b, c)) id_row
  2. 使用另一个选择语句从第一个选择中选择行,使用random()函数——但你必须查看你的数据库管理系统文档,找出要使用哪个函数。
英文:

You didn't tell which DBMS you are working with, so it's not possible to give a complete solution.

But the strategy would be to

  1. Use windowing functions for grouping rows having the same a, b, c. In Oracle this would be something like ``SELECT a, b, c, (row_number() over (partition by a, b, c order by a, b, c)) id_row
  2. Use another select that picks rows from the first select, using a random() function -- but you have to look at your DBMS documentation to find out which function to use

答案2

得分: 1

我没有安装Presto,只有Trino(但Athena引擎v3应该基于Trino),在Trino中,你可以尝试使用min_by/max_by函数,结合random进行“排序”,以及一些ROW魔法:

select a, b, c, 
    min_by((d, e), random()).* as (d, e)
from my_table
group by a, b, c

另一种适用于v2和v3引擎的方法是使用row_number窗口函数进行分区,再次使用random进行排序:

select a, b, c, d, e 
from(
    select *,
       row_number() over(partition by a, b, c order by random()) rn
    from dataset)
where rn = 1
英文:

I do not have Presto installed, only Trino (but Athena engine v3 should be based on Trino) and in Trino you can try using min_by/max_by functions combining them with random for "ordering" and some ROW magic:

select a, b, c, 
    min_by((d, e), random()).* as (d, e)
from my_table
group by a, b, c

Another approach which should work with v2 and v3 engines is using row_number window function over partition and again use random for ordering:

select a, b, c, d, e 
from(
    select *,
       row_number() over(partition by a, b, c order by random()) rn
    from dataset)
where rn = 1

huangapple
  • 本文由 发表于 2023年3月7日 16:28:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/75659538.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定