英文:
Select random rows such that combination of a subgroup of columns is unique
问题
从 my_table 表中的列 a,b,c,d 和 e 中,我想要随机选择 N 行,以确保列 a,b 和 c 的组合是唯一的。
我尝试使用分组解决这个问题:
select a, b, c, min(d), min(e) from my_table
group by a, b, c
这种方法不起作用,因为它不会返回实际存在于 my_table 表中的行,因为列 d 和 e 的最小值可能分布在不同的行中。此外,min 不是一个适当的函数,因为它不会返回随机行。
我该如何实现这个目标?
我正在使用构建在 Presto 上并运行标准 ANSI SQL 的 Amazon Athena。
英文:
From my_table with columns a,b,c,d and e I want to select N rows randomly such that the combinations of columns a,b and c are unique.
I tried solving this with a group by:
select a, b, c, min(d), min(e) from my_table
group by a, b, c
This approach cannot work because it does not return actual existing rows from my_table since the minimum can be in different rows for the columns d and e. Furthermore min is not an appropriate function because it does not return random rows.
How could I achieve it?
I am working with amazon athena which is built on Presto and runs standard ANSI SQL.
答案1
得分: 1
你没有告诉我你正在使用哪个数据库管理系统,所以无法提供完整的解决方案。
但策略将是:
- 使用窗口函数对具有相同a、b、c的行进行分组。在Oracle中,这可能是这样的:
SELECT a, b, c, (row_number() over (partition by a, b, c order by a, b, c)) id_row。 - 使用另一个选择语句从第一个选择中选择行,使用random()函数——但你必须查看你的数据库管理系统文档,找出要使用哪个函数。
英文:
You didn't tell which DBMS you are working with, so it's not possible to give a complete solution.
But the strategy would be to
- Use windowing functions for grouping rows having the same a, b, c. In Oracle this would be something like ``SELECT a, b, c, (row_number() over (partition by a, b, c order by a, b, c)) id_row
- Use another select that picks rows from the first select, using a random() function -- but you have to look at your DBMS documentation to find out which function to use
答案2
得分: 1
我没有安装Presto,只有Trino(但Athena引擎v3应该基于Trino),在Trino中,你可以尝试使用min_by/max_by函数,结合random进行“排序”,以及一些ROW魔法:
select a, b, c,
min_by((d, e), random()).* as (d, e)
from my_table
group by a, b, c
另一种适用于v2和v3引擎的方法是使用row_number窗口函数进行分区,再次使用random进行排序:
select a, b, c, d, e
from(
select *,
row_number() over(partition by a, b, c order by random()) rn
from dataset)
where rn = 1
英文:
I do not have Presto installed, only Trino (but Athena engine v3 should be based on Trino) and in Trino you can try using min_by/max_by functions combining them with random for "ordering" and some ROW magic:
select a, b, c,
min_by((d, e), random()).* as (d, e)
from my_table
group by a, b, c
Another approach which should work with v2 and v3 engines is using row_number window function over partition and again use random for ordering:
select a, b, c, d, e
from(
select *,
row_number() over(partition by a, b, c order by random()) rn
from dataset)
where rn = 1
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论