Postgres jsonb数组对象上的微积分

huangapple go评论63阅读模式
英文:

Calculus on Postgres jsonb array of objects

问题

我有一个包含 jsonb 类型列的Postgres表,看起来像这样:

id   | array_obj
---    ---
id_0 | [ {"Key": "k1", "Value": v1 }, {"Key": "k2", "Value": v2 }, ... ]

我的目标是计算类似于 "与键k1关联的值+与键k3关联的值" 这样的内容。我设法通过展开数组 jsonb_to_recordset 然后使用分组来总结所需的内容,但我觉得这不是很令人满意,因为我认为不应该需要展开/汇总模式(毕竟,我只想 "按行" 计算一些内容)。当前查询如下所示:

SELECT
  id,
  sum(val) filter (where k in ('k1', 'k3')) as result,
  sum(case when k in ('k1', 'k2') then 1 else 0 end) as countkeys
FROM 
  my_table, jsonb_to_recordset(array_obj) as expanded(k text, val int)
GROUP BY
  id

countkeys 用于处理存在'k1'但不存在'k2'的情况,因此总和不是预期值,因此我检查countkeys是否等于我对键进行求和的数量)。

执行计划如下(带有LIMIT 50):

Limit  (cost=0.56..256.99 rows=50 width=104) (actual time=0.217..4.965 rows=50 loops=1)
  Buffers: shared hit=166
  ->  GroupAggregate  (cost=0.56..17512338.31 rows=3414645 width=104) (actual time=0.216..4.959 rows=50 loops=1)
        Group Key: my_table.id
        Buffers: shared hit=166
        ->  Nested Loop  (cost=0.56..11485489.89 rows=341464500 width=72) (actual time=0.104..3.396 rows=5301 loops=1)
              Buffers: shared hit=166
              ->  Index Scan using xxx on my_table (cost=0.56..4656199.88 rows=3414645 width=498) (actual time=0.011..0.159 rows=51 loops=1)
                    Filter: (col = 0)
                    Rows Removed by Filter: 64
                    Buffers: shared hit=119
              ->  Function Scan on jsonb_to_recordset expanded  (cost=0.00..1.00 rows=100 width=40) (actual time=0.042..0.047 rows=104 loops=51)
                    Buffers: shared hit=47
Planning:
  Buffers: shared hit=198
Planning Time: 0.552 ms
Execution Time: 5.099 ms

所以基本上我只是尝试构建一种 "运算符",我可以使用它来 "查找与键k相关联的值",使用我拥有的数据类型。我已经尝试过文档中列出的运算符,但尚未找到解决方法...

请注意,这只是第一步,我的最终目标是计算多个聚合值(比如 result_1 = value(key_1) + value(key_3)result_2 = value(key_7) + value(key_124)等),所以我正在寻找一个不独立计算结果的解决方案。


响应 @SelVazi 的评论,这是一个典型的行:

id | [{"Key":"af_m1","Value":9772},{"Key":"af_m2","Value":7413},{"Key":"af_m3","Value":2359}]

假设我想计算 af_m1 + af_m3 = 9772 + 2359 = 12131。请注意,某些行可能没有具有键 af_m1af_m3 的对象,在这种情况下,我们应该跳过该行(但我认为我可以在了解如何计算我想要的内容后再处理这个问题)。

英文:

I have a Postgres table that has a column of type jsonb and that looks like:

id   | array_obj
---    ---
id_0 | [ {"Key": "k1", "Value": v1 }, {"Key": "k2", "Value": v2 }, ... ]

My goal is to compute something like "value associated to key k1 + value associated to key k3" for instance. I managed to do this by expanding the array jsonb_to_recordset and then using a group-by to sum what I needed but this is not very satisfying as I feel like there should be no need for this expansion / aggregation pattern (after all I just want to compute something "row-wise"). Currently the query looks like:

SELECT
  id,
  sum(val) filter (where k in ('k1', 'k3')) as result,
  sum(case when k in ('k1', 'k2') then 1 else 0 end) as countkeys
FROM 
  my_table, jsonb_to_recordset(array_obj) as expanded(k text, val int)
GROUP BY
  id

(countkeys is here to address cases when there is 'k1' but no 'k2', hence the sum is not really what's expected, so I then chekc that countkeys equals the number of keys I sum on).

The execution plan is the following (with LIMIT 50):

Limit  (cost=0.56..256.99 rows=50 width=104) (actual time=0.217..4.965 rows=50 loops=1)
  Buffers: shared hit=166
  ->  GroupAggregate  (cost=0.56..17512338.31 rows=3414645 width=104) (actual time=0.216..4.959 rows=50 loops=1)
        Group Key: my_table.id
        Buffers: shared hit=166
        ->  Nested Loop  (cost=0.56..11485489.89 rows=341464500 width=72) (actual time=0.104..3.396 rows=5301 loops=1)
              Buffers: shared hit=166
              ->  Index Scan using xxx on my_table (cost=0.56..4656199.88 rows=3414645 width=498) (actual time=0.011..0.159 rows=51 loops=1)
                    Filter: (col = 0)
                    Rows Removed by Filter: 64
                    Buffers: shared hit=119
              ->  Function Scan on jsonb_to_recordset expanded  (cost=0.00..1.00 rows=100 width=40) (actual time=0.042..0.047 rows=104 loops=51)
                    Buffers: shared hit=47
Planning:
  Buffers: shared hit=198
Planning Time: 0.552 ms
Execution Time: 5.099 ms

So basically I'm just trying to build some kind of "operator" that I could use for "find the value associated to key k" with the kind of data I have. I've tried with the operators listed in the documentation but haven't figured it out yet...

Note that this is a first step, my end goal is to compute several aggregates (say result_1 = value(key_1) + value(key_3), result_2 = value(key_7) + value(key_124), etc.) so I'm looking for a solution that does not compute results independantly.


In response to the comment of @SelVazi, here is a typical line:
> id | [{"Key":"af_m1","Value":9772},{"Key":"af_m2","Value":7413},{"Key":"af_m3","Value":2359}]

And say I'd like to compute af_m1 + af_m3 = 9772 + 2359 = 12131. Note that some line may not have any object with key af_m1 or af_m3 in that case we should skip the line (but that's OK I think I can handle it in a second time when I know how to compute what I want first).

答案1

得分: 1

I can save some time, about 30%, by doing the summation in a subquery. That way it doesn't need to be broken apart, mingled, and then resegregated with a GROUP BY. The original query uses parallel query which my variant cannot do, so the original is even using more CPU while it is underperforming.

explain (analyze, buffers)
SELECT
id,
(select sum("Value") filter (where "Key" in ('k1', 'k3')) as result from
jsonb_to_recordset(array_obj) as expanded("Key" text, "Value" double precision)
) as result
from my_table;

It would be much faster yet though to just store the data in a sensible format, either up-front or in a materialized view or just a transformed copy.

英文:

I can save some time, about 30%, by doing the summation in a subquery. That way it doesn't need to be broken apart, mingled, and then resegregated with a GROUP BY. The original query uses parallel query which my variant cannot do, so the original is even using more CPU while it is underperforming.

explain (analyze, buffers)
SELECT
  id, 
  (select sum("Value") filter (where "Key" in ('k1', 'k3')) as result from 
     jsonb_to_recordset(array_obj) as expanded("Key" text, "Value" double precision)
  ) as result
from my_table;

It would be much faster yet though to just store the data in a sensible format, either up-front or in a materialized view or just a transformed copy.

huangapple
  • 本文由 发表于 2023年2月24日 03:47:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/75549647.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定