BigQuery跨行数组元素的最大重叠

huangapple go评论90阅读模式
英文:

BigQuery maximum overlap of array elements across rows

问题

我有以下数据:

row_id user data
1 Mark [1,2,3]
2 Mark [1,2,4]
3 Mark [1]
4 Mark [3,4]
5 Ben [1,2,3]
6 Ben [2]

我想要以下输出:

user max_intersection
Mark [1,2]
Ben [2]

换句话说,我希望每个用户只有一行数据,该行数据包含至少在2行中存在的数组中的最大元素数量。

例如,对于Mark来说,[1,2]在第1行和第2行都出现。第3行中没有出现[2]并不重要。[3]不是答案,因为该特定行只有一个元素重叠,而[1,2]包含2个元素。[1,2,3]也不是答案,因为虽然每个元素在用户的某个地方至少出现两次,但重叠不是在两行之间发生,而是分散在更多行中。

如果Mark在第3行上有[1,3],使得[1,2]和[1,3]都是正确的,我会尝试获取[1,2],因为id号较低。但这不是最重要的。

我尝试根据这个问题中的答案进行调整,但在最大重叠方面遇到了困难。

英文:

I have the following data:

row_id user data
1 Mark [1,2,3]
2 Mark [1,2,4]
3 Mark [1]
4 Mark [3,4]
5 Ben [1,2,3]
6 Ben [2]

I want the following output

user max_intersection
Mark [1,2]
Ben [2]

In other words, I want one row for each user which contains the maximum number of elements in an array that exist in at least 2 rows.

I.e. for Mark [1,2] occur both in row 1 and 2. That [2] doesn't occur in row 3 doesn't matter. [3] is not the answer because that particular row only has an intersection on [3], which is only 1 element overlap, whereas [1,2] contains 2 elements. [1,2,3] is not the answer because though each element occurs at least twice somewhere for the user, the overlap does not occur between two rows but is spread out over more.

If Mark had [1,3] on row 3, making both [1,2] and [1,3] correct, I'd try to get [1,2] because the id number is lower. But it's not the most important

I've tried to adapt the answer from this question but struggling with the maximum overlap.

答案1

得分: 2

考虑以下方法(BigQuery标准SQL)

with temp as (
  select row_id, user, el
  from your_table, unnest(data) el
)
select t1.user, array_agg(t1.el) as max_intersection
from temp t1
join temp t2
on t1.user = t2.user
and t1.row_id < t2.row_id
and t1.el = t2.el
group by user, t1.row_id, t2.row_id
qualify row_number() over(partition by user order by count(*) desc) = 1

如果应用于你问题中的示例数据,输出结果如下:

BigQuery跨行数组元素的最大重叠

英文:

Consider below approach (BigQuery Standard SQL)

with temp as (
  select row_id, user, el
  from your_table, unnest(data) el
)
select t1.user, array_agg(t1.el) as max_intersection
from temp t1
join temp t2
on t1.user = t2.user
and t1.row_id &lt; t2.row_id
and t1.el = t2.el
group by user, t1.row_id, t2.row_id
qualify row_number() over(partition by user order by count(*) desc) = 1    

if to apply to sample data as in your question - output is

BigQuery跨行数组元素的最大重叠

huangapple
  • 本文由 发表于 2023年8月8日 23:33:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/76861074.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定