如何根据应用于两列的值范围条件来优化数据聚合。

huangapple go评论86阅读模式
英文:

how to optimize data aggregation based on value range conditions applied on two columns

问题

I have a particles table in PostgreSQL 10.19 that looks like this:

  1. CREATE TABLE particles (
  2. particle_diameter REAL,
  3. particle_speed REAL
  4. );
  5. INSERT INTO particles VALUES
  6. (0.35, 0.74),
  7. (0.57, 2.63),
  8. (0.27, 1.05),
  9. (0.65, 2.33);

What I want is to aggregate my data by diameter range (i.e. each particles that have a diameter between for example 0 and 0.2 mm, 0.2 and 0.4 mm, etc.) and speed range. I want to generate a series of diameter range, and a series of speed range and finally count the number of particles for each duo diameter range and speed range.

So far I managed to reach the desired result using this query:

  1. WITH speed_series AS (
  2. SELECT generate_series(-1, 19.8, 0.2) AS speed_from
  3. ), speed_range AS (
  4. SELECT speed_from, (speed_from + 0.2) AS speed_to FROM speed_series
  5. ), diameter_series AS (
  6. SELECT generate_series(0, 9.8, 0.2) AS diameter_from
  7. ), diameter_range AS (
  8. SELECT
  9. diameter_from, (diameter_from + 0.2) AS diameter_to, speed_from, speed_to
  10. FROM diameter_series, speed_range
  11. )
  12. SELECT
  13. diameter_from,
  14. diameter_to,
  15. speed_from,
  16. speed_to,
  17. (SELECT
  18. COUNT(particle_diameter)
  19. FROM particles
  20. WHERE particle_diameter BETWEEN diameter_from AND diameter_to
  21. AND particle_speed BETWEEN speed_from AND speed_to
  22. )
  23. FROM diameter_range;

On a relatively small dataset (~30k records) this query took more than a minute to execute. So my question is:

Is there a way to rewrite this query to be more efficient and less time-consuming?

英文:

I have a particles table in PostgreSQL 10.19 that looks like this:

  1. CREATE TABLE particles (
  2. particle_diameter REAL,
  3. particle_speed REAL
  4. );
  5. INSERT INTO particles VALUES
  6. (0.35, 0.74),
  7. (0.57, 2.63),
  8. (0.27, 1.05),
  9. (0.65, 2.33);

What I want is to aggregate my data by diameter range (i.e. each particles that have a diameter between for example 0 and 0.2 mm, 0.2 and 0.4 mm, etc.) and speed range. I want to generate a serie of diameter range, and a serie of speed range and finally count the number of particles for each duo diameter range and speed range.

So far I managed to reach the desired result using this query:

  1. WITH speed_series AS (
  2. SELECT generate_series(-1, 19.8, 0.2) AS speed_from
  3. ), speed_range AS (
  4. SELECT speed_from, (speed_from + 0.2) AS speed_to FROM speed_series
  5. ), diameter_series AS (
  6. SELECT generate_series(0, 9.8, 0.2) AS diameter_from
  7. ), diameter_range AS (
  8. SELECT
  9. diameter_from, (diameter_from + 0.2) AS diameter_to, speed_from, speed_to
  10. FROM diameter_series, speed_range
  11. )
  12. SELECT
  13. diameter_from,
  14. diameter_to,
  15. speed_from,
  16. speed_to,
  17. (SELECT
  18. COUNT(particle_diameter)
  19. FROM particles
  20. WHERE particle_diameter BETWEEN diameter_from AND diameter_to
  21. AND particle_speed BETWEEN speed_from AND speed_to
  22. )
  23. FROM diameter_range;

You can explore it on this :
db<>fiddle

On a relatively small dataset (~30k records) this query took more than a minute to execute. So my question is:

Is there a way to rewrite this query to be more efficient and less time consuming?

答案1

得分: 1

  1. 我会在这里尝试使用width_bucket()floor()函数:
  2. with b as (
  3. select width_bucket(particle_diameter, 0, 10, 50) pd,
  4. width_bucket(particle_speed, -1, 20, 105) ps
  5. from particles)
  6. select pd * .2 - .2 diam_from, pd * .2 diam_to,
  7. ps * .2 - 1.2 speed_from, ps * .2 - 1 speed_to,
  8. count(1) cnt
  9. from b group by pd, ps
  10. 当然,如果你需要这些零行,你可以与生成的序列连接:
  11. with b as (
  12. select width_bucket(particle_diameter, 0, 10, 50) pd,
  13. width_bucket(particle_speed, -1, 20, 105) ps
  14. from particles)
  15. select d diam_from, d+.2 diam_to, s speed_from, s+.2 speed_to, coalesce(t.cnt, 0) cnt
  16. from (select generate_series(0, 9.8, 0.2)) as dm(d)
  17. cross join (select generate_series(-1, 19.8, 0.2)) AS sp(s)
  18. left join (
  19. select pd * .2 - .2 diam_from, ps * .2 - 1.2 speed_from, count(1) cnt
  20. from b group by pd, ps) t
  21. on dm.d = t.diam_from and sp.s = t.speed_from
  22. [dbfiddle demo](https://dbfiddle.uk/k1zekvTd)
英文:

I would try width_bucket() or floor() here:

  1. with b as (
  2. select width_bucket(particle_diameter, 0, 10, 50) pd,
  3. width_bucket(particle_speed, -1, 20, 105) ps
  4. from particles)
  5. select pd * .2 - .2 diam_from, pd * .2 diam_to,
  6. ps * .2 - 1.2 speed_from, ps * .2 - 1 speed_to,
  7. count(1) cnt
  8. from b group by pd, ps

Of course you can join with generated series if you need these zero-rows:

  1. with b as (
  2. select width_bucket(particle_diameter, 0, 10, 50) pd,
  3. width_bucket(particle_speed, -1, 20, 105) ps
  4. from particles)
  5. select d diam_from, d+.2 diam_to, s speed_from, s+.2 speed_to, coalesce(t.cnt, 0) cnt
  6. from (select generate_series(0, 9.8, 0.2)) as dm(d)
  7. cross join (select generate_series(-1, 19.8, 0.2)) AS sp(s)
  8. left join (
  9. select pd * .2 - .2 diam_from, ps * .2 - 1.2 speed_from, count(1) cnt
  10. from b group by pd, ps) t
  11. on dm.d = t.diam_from and sp.s = t.speed_from

dbfiddle demo

huangapple
  • 本文由 发表于 2023年6月27日 17:42:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/76563572.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定