英文:
BigQuery maximum overlap of array elements across rows
问题
我有以下数据:
row_id | user | data |
---|---|---|
1 | Mark | [1,2,3] |
2 | Mark | [1,2,4] |
3 | Mark | [1] |
4 | Mark | [3,4] |
5 | Ben | [1,2,3] |
6 | Ben | [2] |
我想要以下输出:
user | max_intersection |
---|---|
Mark | [1,2] |
Ben | [2] |
换句话说,我希望每个用户只有一行数据,该行数据包含至少在2行中存在的数组中的最大元素数量。
例如,对于Mark来说,[1,2]在第1行和第2行都出现。第3行中没有出现[2]并不重要。[3]不是答案,因为该特定行只有一个元素重叠,而[1,2]包含2个元素。[1,2,3]也不是答案,因为虽然每个元素在用户的某个地方至少出现两次,但重叠不是在两行之间发生,而是分散在更多行中。
如果Mark在第3行上有[1,3],使得[1,2]和[1,3]都是正确的,我会尝试获取[1,2],因为id号较低。但这不是最重要的。
我尝试根据这个问题中的答案进行调整,但在最大重叠方面遇到了困难。
英文:
I have the following data:
row_id | user | data |
---|---|---|
1 | Mark | [1,2,3] |
2 | Mark | [1,2,4] |
3 | Mark | [1] |
4 | Mark | [3,4] |
5 | Ben | [1,2,3] |
6 | Ben | [2] |
I want the following output
user | max_intersection |
---|---|
Mark | [1,2] |
Ben | [2] |
In other words, I want one row for each user which contains the maximum number of elements in an array that exist in at least 2 rows.
I.e. for Mark [1,2] occur both in row 1 and 2. That [2] doesn't occur in row 3 doesn't matter. [3] is not the answer because that particular row only has an intersection on [3], which is only 1 element overlap, whereas [1,2] contains 2 elements. [1,2,3] is not the answer because though each element occurs at least twice somewhere for the user, the overlap does not occur between two rows but is spread out over more.
If Mark had [1,3] on row 3, making both [1,2] and [1,3] correct, I'd try to get [1,2] because the id number is lower. But it's not the most important
I've tried to adapt the answer from this question but struggling with the maximum overlap.
答案1
得分: 2
考虑以下方法(BigQuery标准SQL)
with temp as (
select row_id, user, el
from your_table, unnest(data) el
)
select t1.user, array_agg(t1.el) as max_intersection
from temp t1
join temp t2
on t1.user = t2.user
and t1.row_id < t2.row_id
and t1.el = t2.el
group by user, t1.row_id, t2.row_id
qualify row_number() over(partition by user order by count(*) desc) = 1
如果应用于你问题中的示例数据,输出结果如下:
英文:
Consider below approach (BigQuery Standard SQL)
with temp as (
select row_id, user, el
from your_table, unnest(data) el
)
select t1.user, array_agg(t1.el) as max_intersection
from temp t1
join temp t2
on t1.user = t2.user
and t1.row_id < t2.row_id
and t1.el = t2.el
group by user, t1.row_id, t2.row_id
qualify row_number() over(partition by user order by count(*) desc) = 1
if to apply to sample data as in your question - output is
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论