英文:
Select random rows such that combination of a subgroup of columns is unique
问题
从 my_table
表中的列 a
,b
,c
,d
和 e
中,我想要随机选择 N
行,以确保列 a
,b
和 c
的组合是唯一的。
我尝试使用分组解决这个问题:
select a, b, c, min(d), min(e) from my_table
group by a, b, c
这种方法不起作用,因为它不会返回实际存在于 my_table
表中的行,因为列 d
和 e
的最小值可能分布在不同的行中。此外,min
不是一个适当的函数,因为它不会返回随机行。
我该如何实现这个目标?
我正在使用构建在 Presto 上并运行标准 ANSI SQL 的 Amazon Athena。
英文:
From my_table
with columns a
,b
,c
,d
and e
I want to select N
rows randomly such that the combinations of columns a
,b
and c
are unique.
I tried solving this with a group by:
select a, b, c, min(d), min(e) from my_table
group by a, b, c
This approach cannot work because it does not return actual existing rows from my_table
since the minimum can be in different rows for the columns d
and e
. Furthermore min
is not an appropriate function because it does not return random rows.
How could I achieve it?
I am working with amazon athena which is built on Presto and runs standard ANSI SQL.
答案1
得分: 1
你没有告诉我你正在使用哪个数据库管理系统,所以无法提供完整的解决方案。
但策略将是:
- 使用窗口函数对具有相同a、b、c的行进行分组。在Oracle中,这可能是这样的:
SELECT a, b, c, (row_number() over (partition by a, b, c order by a, b, c)) id_row
。 - 使用另一个选择语句从第一个选择中选择行,使用random()函数——但你必须查看你的数据库管理系统文档,找出要使用哪个函数。
英文:
You didn't tell which DBMS you are working with, so it's not possible to give a complete solution.
But the strategy would be to
- Use windowing functions for grouping rows having the same a, b, c. In Oracle this would be something like ``SELECT a, b, c, (row_number() over (partition by a, b, c order by a, b, c)) id_row
- Use another select that picks rows from the first select, using a random() function -- but you have to look at your DBMS documentation to find out which function to use
答案2
得分: 1
我没有安装Presto,只有Trino(但Athena引擎v3应该基于Trino),在Trino中,你可以尝试使用min_by
/max_by
函数,结合random
进行“排序”,以及一些ROW
魔法:
select a, b, c,
min_by((d, e), random()).* as (d, e)
from my_table
group by a, b, c
另一种适用于v2和v3引擎的方法是使用row_number
窗口函数进行分区,再次使用random
进行排序:
select a, b, c, d, e
from(
select *,
row_number() over(partition by a, b, c order by random()) rn
from dataset)
where rn = 1
英文:
I do not have Presto installed, only Trino (but Athena engine v3 should be based on Trino) and in Trino you can try using min_by
/max_by
functions combining them with random
for "ordering" and some ROW
magic:
select a, b, c,
min_by((d, e), random()).* as (d, e)
from my_table
group by a, b, c
Another approach which should work with v2 and v3 engines is using row_number
window function over partition and again use random
for ordering:
select a, b, c, d, e
from(
select *,
row_number() over(partition by a, b, c order by random()) rn
from dataset)
where rn = 1
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论