2023年8月8日 23:33:14go评论145阅读模式

英文:

BigQuery maximum overlap of array elements across rows

问题

我有以下数据：

row_id	user	data
1	Mark	`[1,2,3]`
2	Mark	`[1,2,4]`
3	Mark	`[1]`
4	Mark	`[3,4]`
5	Ben	`[1,2,3]`
6	Ben	`[2]`

我想要以下输出：

user	max_intersection
Mark	[1,2]
Ben	[2]

换句话说，我希望每个用户只有一行数据，该行数据包含至少在2行中存在的数组中的最大元素数量。

例如，对于Mark来说，[1,2]在第1行和第2行都出现。第3行中没有出现[2]并不重要。[3]不是答案，因为该特定行只有一个元素重叠，而[1,2]包含2个元素。[1,2,3]也不是答案，因为虽然每个元素在用户的某个地方至少出现两次，但重叠不是在两行之间发生，而是分散在更多行中。

如果Mark在第3行上有[1,3]，使得[1,2]和[1,3]都是正确的，我会尝试获取[1,2]，因为id号较低。但这不是最重要的。

我尝试根据这个问题中的答案进行调整，但在最大重叠方面遇到了困难。

英文:

I have the following data:

row_id	user	data
1	Mark	`[1,2,3]`
2	Mark	`[1,2,4]`
3	Mark	`[1]`
4	Mark	`[3,4]`
5	Ben	`[1,2,3]`
6	Ben	`[2]`

I want the following output

user	max_intersection
Mark	[1,2]
Ben	[2]

In other words, I want one row for each user which contains the maximum number of elements in an array that exist in at least 2 rows.

I.e. for Mark [1,2] occur both in row 1 and 2. That [2] doesn't occur in row 3 doesn't matter. [3] is not the answer because that particular row only has an intersection on [3], which is only 1 element overlap, whereas [1,2] contains 2 elements. [1,2,3] is not the answer because though each element occurs at least twice somewhere for the user, the overlap does not occur between two rows but is spread out over more.

If Mark had [1,3] on row 3, making both [1,2] and [1,3] correct, I'd try to get [1,2] because the id number is lower. But it's not the most important

I've tried to adapt the answer from this question but struggling with the maximum overlap.

答案1

得分: 2

考虑以下方法（BigQuery标准SQL）

with temp as (
  select row_id, user, el
  from your_table, unnest(data) el
)
select t1.user, array_agg(t1.el) as max_intersection
from temp t1
join temp t2
on t1.user = t2.user
and t1.row_id < t2.row_id
and t1.el = t2.el
group by user, t1.row_id, t2.row_id
qualify row_number() over(partition by user order by count(*) desc) = 1

如果应用于你问题中的示例数据，输出结果如下：

英文:

Consider below approach (BigQuery Standard SQL)

with temp as (
  select row_id, user, el
  from your_table, unnest(data) el
)
select t1.user, array_agg(t1.el) as max_intersection
from temp t1
join temp t2
on t1.user = t2.user
and t1.row_id &lt; t2.row_id
and t1.el = t2.el
group by user, t1.row_id, t2.row_id
qualify row_number() over(partition by user order by count(*) desc) = 1

if to apply to sample data as in your question - output is

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

BigQuery跨行数组元素的最大重叠

问题

答案1

如何在SQLite表中从数组中的每个位置删除字符串元素？

基于列中的数值计算增加值

春季启动查询DTO

一个包含1个元素的数组占用的内存是否与该1个元素的内存相同？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。