英文:
Calculus on Postgres jsonb array of objects
问题
我有一个包含 jsonb
类型列的Postgres表,看起来像这样:
id | array_obj
--- ---
id_0 | [ {"Key": "k1", "Value": v1 }, {"Key": "k2", "Value": v2 }, ... ]
我的目标是计算类似于 "与键k1关联的值+与键k3关联的值" 这样的内容。我设法通过展开数组 jsonb_to_recordset
然后使用分组来总结所需的内容,但我觉得这不是很令人满意,因为我认为不应该需要展开/汇总模式(毕竟,我只想 "按行" 计算一些内容)。当前查询如下所示:
SELECT
id,
sum(val) filter (where k in ('k1', 'k3')) as result,
sum(case when k in ('k1', 'k2') then 1 else 0 end) as countkeys
FROM
my_table, jsonb_to_recordset(array_obj) as expanded(k text, val int)
GROUP BY
id
(countkeys
用于处理存在'k1'但不存在'k2'的情况,因此总和不是预期值,因此我检查countkeys
是否等于我对键进行求和的数量)。
执行计划如下(带有LIMIT 50):
Limit (cost=0.56..256.99 rows=50 width=104) (actual time=0.217..4.965 rows=50 loops=1)
Buffers: shared hit=166
-> GroupAggregate (cost=0.56..17512338.31 rows=3414645 width=104) (actual time=0.216..4.959 rows=50 loops=1)
Group Key: my_table.id
Buffers: shared hit=166
-> Nested Loop (cost=0.56..11485489.89 rows=341464500 width=72) (actual time=0.104..3.396 rows=5301 loops=1)
Buffers: shared hit=166
-> Index Scan using xxx on my_table (cost=0.56..4656199.88 rows=3414645 width=498) (actual time=0.011..0.159 rows=51 loops=1)
Filter: (col = 0)
Rows Removed by Filter: 64
Buffers: shared hit=119
-> Function Scan on jsonb_to_recordset expanded (cost=0.00..1.00 rows=100 width=40) (actual time=0.042..0.047 rows=104 loops=51)
Buffers: shared hit=47
Planning:
Buffers: shared hit=198
Planning Time: 0.552 ms
Execution Time: 5.099 ms
所以基本上我只是尝试构建一种 "运算符",我可以使用它来 "查找与键k相关联的值",使用我拥有的数据类型。我已经尝试过文档中列出的运算符,但尚未找到解决方法...
请注意,这只是第一步,我的最终目标是计算多个聚合值(比如 result_1 = value(key_1) + value(key_3)
,result_2 = value(key_7) + value(key_124)
等),所以我正在寻找一个不独立计算结果的解决方案。
响应 @SelVazi 的评论,这是一个典型的行:
id | [{"Key":"af_m1","Value":9772},{"Key":"af_m2","Value":7413},{"Key":"af_m3","Value":2359}]
假设我想计算 af_m1 + af_m3 = 9772 + 2359 = 12131
。请注意,某些行可能没有具有键 af_m1
或 af_m3
的对象,在这种情况下,我们应该跳过该行(但我认为我可以在了解如何计算我想要的内容后再处理这个问题)。
英文:
I have a Postgres table that has a column of type jsonb
and that looks like:
id | array_obj
--- ---
id_0 | [ {"Key": "k1", "Value": v1 }, {"Key": "k2", "Value": v2 }, ... ]
My goal is to compute something like "value associated to key k1 + value associated to key k3" for instance. I managed to do this by expanding the array jsonb_to_recordset
and then using a group-by to sum what I needed but this is not very satisfying as I feel like there should be no need for this expansion / aggregation pattern (after all I just want to compute something "row-wise"). Currently the query looks like:
SELECT
id,
sum(val) filter (where k in ('k1', 'k3')) as result,
sum(case when k in ('k1', 'k2') then 1 else 0 end) as countkeys
FROM
my_table, jsonb_to_recordset(array_obj) as expanded(k text, val int)
GROUP BY
id
(countkeys
is here to address cases when there is 'k1' but no 'k2', hence the sum is not really what's expected, so I then chekc that countkeys equals the number of keys I sum on).
The execution plan is the following (with LIMIT 50):
Limit (cost=0.56..256.99 rows=50 width=104) (actual time=0.217..4.965 rows=50 loops=1)
Buffers: shared hit=166
-> GroupAggregate (cost=0.56..17512338.31 rows=3414645 width=104) (actual time=0.216..4.959 rows=50 loops=1)
Group Key: my_table.id
Buffers: shared hit=166
-> Nested Loop (cost=0.56..11485489.89 rows=341464500 width=72) (actual time=0.104..3.396 rows=5301 loops=1)
Buffers: shared hit=166
-> Index Scan using xxx on my_table (cost=0.56..4656199.88 rows=3414645 width=498) (actual time=0.011..0.159 rows=51 loops=1)
Filter: (col = 0)
Rows Removed by Filter: 64
Buffers: shared hit=119
-> Function Scan on jsonb_to_recordset expanded (cost=0.00..1.00 rows=100 width=40) (actual time=0.042..0.047 rows=104 loops=51)
Buffers: shared hit=47
Planning:
Buffers: shared hit=198
Planning Time: 0.552 ms
Execution Time: 5.099 ms
So basically I'm just trying to build some kind of "operator" that I could use for "find the value associated to key k" with the kind of data I have. I've tried with the operators listed in the documentation but haven't figured it out yet...
Note that this is a first step, my end goal is to compute several aggregates (say result_1 = value(key_1) + value(key_3)
, result_2 = value(key_7) + value(key_124)
, etc.) so I'm looking for a solution that does not compute results independantly.
In response to the comment of @SelVazi, here is a typical line:
> id | [{"Key":"af_m1","Value":9772},{"Key":"af_m2","Value":7413},{"Key":"af_m3","Value":2359}]
And say I'd like to compute af_m1 + af_m3 = 9772 + 2359 = 12131
. Note that some line may not have any object with key af_m1
or af_m3
in that case we should skip the line (but that's OK I think I can handle it in a second time when I know how to compute what I want first).
答案1
得分: 1
I can save some time, about 30%, by doing the summation in a subquery. That way it doesn't need to be broken apart, mingled, and then resegregated with a GROUP BY. The original query uses parallel query which my variant cannot do, so the original is even using more CPU while it is underperforming.
explain (analyze, buffers)
SELECT
id,
(select sum("Value") filter (where "Key" in ('k1', 'k3')) as result from
jsonb_to_recordset(array_obj) as expanded("Key" text, "Value" double precision)
) as result
from my_table;
It would be much faster yet though to just store the data in a sensible format, either up-front or in a materialized view or just a transformed copy.
英文:
I can save some time, about 30%, by doing the summation in a subquery. That way it doesn't need to be broken apart, mingled, and then resegregated with a GROUP BY. The original query uses parallel query which my variant cannot do, so the original is even using more CPU while it is underperforming.
explain (analyze, buffers)
SELECT
id,
(select sum("Value") filter (where "Key" in ('k1', 'k3')) as result from
jsonb_to_recordset(array_obj) as expanded("Key" text, "Value" double precision)
) as result
from my_table;
It would be much faster yet though to just store the data in a sensible format, either up-front or in a materialized view or just a transformed copy.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论