2023年2月23日 19:30:28go评论78阅读模式

英文:

Very slow query performance in aws postgresql for a table with 4 billion rows

问题

我们有一个关系表，用于存储用户活动。像下面这样的查询需要77秒！

FROM &quot;site_activity&quot;
WHERE
    (
        NOT &quot;site_activity&quot;.&quot;is_deleted&quot;
        AND &quot;site_activity&quot;.&quot;user_id&quot; = 68812389
        AND NOT (
            &quot;site_activity&quot;.&quot;kind&quot; IN (
                &#39;updated&#39;,
                &#39;duplicated&#39;,
                &#39;reapplied&#39;
            )
        )
        AND NOT (
            &quot;site_activity&quot;.&quot;content_type_id&quot; = 14
            AND &quot;site_activity&quot;.&quot;kind&quot; = &#39;created&#39;
        )
    )
ORDER BY
    &quot;site_activity&quot;.&quot;created_at&quot; DESC,
    &quot;site_activity&quot;.&quot;id&quot; DESC
LIMIT  9;

查询计划如下：

                                     查询计划
--------------------------------------------------------------------------------------------
Limit
    (成本=17750.72..27225.75 行=9 宽度=16)
    (实际时间=199501.336..199501.338 行=9 循环=1)
  输出: id, created_at
  缓冲区: 共享 命中=4502362 读取=693523 写入=37273
  I/O 时间: 读取=190288.205 写入=446.870
  ->  增量排序
      (成本=17750.72..2003433582.97 行=1902974 宽度=16)
      (实际时间=199501.335..199501.336 行=9 循环=1)
        输出: id, created_at
        排序键: site_activity.created_at DESC, site_activity.id DESC
        预排序键: site_activity.created_at
        完全排序组: 1  排序方法: 快速排序  平均内存: 25kB  峰值内存: 25kB
        缓冲区: 共享 命中=4502362 读取=693523 写入=37273
        I/O 时间: 读取=190288.205 写入=446.870
        ->  使用 site_activity_created_at_company_id_idx 反向索引扫描 public.site_activity
            (成本=0.58..2003345645.30 行=1902974 宽度=16)
            (实际时间=198971.283..199501.285 行=10 循环=1)
              输出: id, created_at
              过滤: (
                (NOT site_activity.is_deleted) AND (site_activity.user_id = 68812389)
                AND ((site_activity.kind)::text &lt;&gt; ALL (&#39;{updated,duplicated,reapplied}&#39;::text[]))
                AND ((site_activity.content_type_id &lt;&gt; 14) OR ((site_activity.kind)::text &lt;&gt; &#39;created&#39;::text))
              )
              过滤器删除的行: 14735308
              缓冲区: 共享 命中=4502353 读取=693523 写入=37273
              I/O 时间: 读取=190288.205 写入=446.870
设置: 有效缓存大小 = &#39;261200880kB&#39;,
          有效 I/O 同时性 = &#39;400&#39;,
          jit = &#39;off&#39;,
          最大并行工作者 = &#39;24&#39;,
          随机页面成本 = &#39;1.5&#39;,
          工作内存 = &#39;64MB&#39;
计划:
  缓冲区: 共享 命中=344
计划时间: 6.429 ms
执行时间: 199501.365 ms
(22 行)

时间: 199691.997 ms (03:19.692)

表格信息

表中包含超过40亿行。

表结构如下：

表 "public.site_activity"
    列      |           类型           | 校对 | 可空 |               默认
----------------+--------------------------+-----+------+----------------------------------------------
id              | bigint                   |           | not null | nextval('site_activity_id_seq'::regclass)
created_at      | timestamp with time zone |           | not null |
modified_at     | timestamp with time zone |           | not null |
is_deleted      | boolean                  |           | not null |
object_id       | bigint                   |           | not null |
kind            | character varying(32)    |           | not null |
context         | text                     |           | not null |
company_id      | integer                  |           | not null |
content_type_id | integer                  |           | not null |
user_id         | integer                  |           |          |
索引:
    "site_activity_pkey" PRIMARY KEY, btree (id)
    "site_activity_modified_at_idx" btree (modified_at)
    "site_activity_company_id_idx" btree (company_id)
    "site_activity_created_at_company_id_idx" btree (created_at, company_id)
    "site_activity_object_id_idx" btree (object_id)
    "site_activity_content_type_id_idx" btree (content_type_id)
    "site_activity_kind_idx" btree (kind)
    "site_activity_kind_idx1" btree (kind varchar_pattern_ops)
    "site_activity_user_id_idx" btree (user_id)
外键约束:
    "site_activity_company_id_fk_site_company_id" FOREIGN KEY (company_id)
        REFERENCES site_company(id) DEFERRABLE INITIALLY DEFERRED
    "site_activity_content_type_id_fk_django_co" FOREIGN KEY (content_type_id)
        REFERENCES django_content_type(id) DEFERRABLE INITIALLY DEFERRED
    "site_activity_user_id_fk_site_user_id" FOREIGN KEY (user_id)
        REFERENCES site_user(id) DEFERRABLE INITIALLY DEFERRED

a. kind 被视为一个 enum。在数据库中我们将其存储为 varchar。但在应用程序（Python）中，我们将其视为枚举。因此，其值是固定的，大约有100个值。

b. content_type_id 大约有

英文:

The Problem

We have a relational table where we store user activity. A query like the following takes 77 seconds!

FROM &quot;site_activity&quot;
WHERE
    (
        NOT &quot;site_activity&quot;.&quot;is_deleted&quot;
        AND &quot;site_activity&quot;.&quot;user_id&quot; = 68812389
        AND NOT (
            &quot;site_activity&quot;.&quot;kind&quot; IN (
                &#39;updated&#39;,
                &#39;duplicated&#39;,
                &#39;reapplied&#39;
            )
        )
        AND NOT (
            &quot;site_activity&quot;.&quot;content_type_id&quot; = 14
            AND &quot;site_activity&quot;.&quot;kind&quot; = &#39;created&#39;
        )
    )
ORDER BY
    &quot;site_activity&quot;.&quot;created_at&quot; DESC,
    &quot;site_activity&quot;.&quot;id&quot; DESC
LIMIT  9;

The query plan looks like this

                                     QUERY PLAN
--------------------------------------------------------------------------------------------
Limit
(cost=17750.72..27225.75 rows=9 width=16)
(actual time=199501.336..199501.338 rows=9 loops=1)
Output: id, created_at
Buffers: shared hit=4502362 read=693523 written=37273
I/O Timings: read=190288.205 write=446.870
-&gt;  Incremental Sort
(cost=17750.72..2003433582.97 rows=1902974 width=16)
(actual time=199501.335..199501.336 rows=9 loops=1)
Output: id, created_at
Sort Key: site_activity.created_at DESC, site_activity.id DESC
Presorted Key: site_activity.created_at
Full-sort Groups: 1  Sort Method: quicksort  Average Memory: 25kB  Peak Memory: 25kB
Buffers: shared hit=4502362 read=693523 written=37273
I/O Timings: read=190288.205 write=446.870
-&gt;  Index Scan Backward using site_activity_created_at_company_id_idx on public.site_activity
(cost=0.58..2003345645.30 rows=1902974 width=16)
(actual time=198971.283..199501.285 rows=10 loops=1)
Output: id, created_at
Filter: (
(NOT site_activity.is_deleted) AND (site_activity.user_id = 68812389)
AND ((site_activity.kind)::text &lt;&gt; ALL (&#39;{updated,duplicated,reapplied}&#39;::text[]))
AND ((site_activity.content_type_id &lt;&gt; 14) OR ((site_activity.kind)::text &lt;&gt; &#39;created&#39;::text))
)
Rows Removed by Filter: 14735308
Buffers: shared hit=4502353 read=693523 written=37273
I/O Timings: read=190288.205 write=446.870
Settings: effective_cache_size = &#39;261200880kB&#39;,
effective_io_concurrency = &#39;400&#39;,
jit = &#39;off&#39;,
max_parallel_workers = &#39;24&#39;,
random_page_cost = &#39;1.5&#39;,
work_mem = &#39;64MB&#39;
Planning:
Buffers: shared hit=344
Planning Time: 6.429 ms
Execution Time: 199501.365 ms
(22 rows)
Time: 199691.997 ms (03:19.692)

Table Facts

It contains a little more than 4 billion rows.

The table structure is

                                            Table &quot;public.site_activity&quot;
Column      |           Type           | Collation | Nullable |                   Default
----------------+--------------------------+-----------+----------+----------------------------------------------
id              | bigint                   |           | not null | nextval(&#39;site_activity_id_seq&#39;::regclass)
created_at      | timestamp with time zone |           | not null |
modified_at     | timestamp with time zone |           | not null |
is_deleted      | boolean                  |           | not null |
object_id       | bigint                   |           | not null |
kind            | character varying(32)    |           | not null |
context         | text                     |           | not null |
company_id      | integer                  |           | not null |
content_type_id | integer                  |           | not null |
user_id         | integer                  |           |          |
Indexes:
&quot;site_activity_pkey&quot; PRIMARY KEY, btree (id)
&quot;site_activity_modified_at_idx&quot; btree (modified_at)
&quot;site_activity_company_id_idx&quot; btree (company_id)
&quot;site_activity_created_at_company_id_idx&quot; btree (created_at, company_id)
&quot;site_activity_object_id_idx&quot; btree (object_id)
&quot;site_activity_content_type_id_idx&quot; btree (content_type_id)
&quot;site_activity_kind_idx&quot; btree (kind)
&quot;site_activity_kind_idx1&quot; btree (kind varchar_pattern_ops)
&quot;site_activity_user_id_idx&quot; btree (user_id)
Foreign-key constraints:
&quot;site_activity_company_id_fk_site_company_id&quot; FOREIGN KEY (company_id)
REFERENCES site_company(id) DEFERRABLE INITIALLY DEFERRED
&quot;site_activity_content_type_id_fk_django_co&quot; FOREIGN KEY (content_type_id)
REFERENCES django_content_type(id) DEFERRABLE INITIALLY DEFERRED
&quot;site_activity_user_id_fk_site_user_id&quot; FOREIGN KEY (user_id)
REFERENCES site_user(id) DEFERRABLE INITIALLY DEFERRED

a. kind is treated as an enum. In db we store it as varchar. But in the application (python) we treat it as enum. So the values are fixed. There are around 100 values in it.

b.content_type_id has around 80 values.

This is the distribution of values,

a. context is actually JSON with a max 8Mb size.

a. 3 content_type_id values holds 92% of the rows. They are 14 and 19.

a. 3 kind consumes 75% rows. These are created, updated and sent.

a. The combination of kind and content_type_id creates 460 values. Among them, 1 combination contains 35% of rows and we exclude them in the query all time.
The replica instance has type db.r5.12xlarge. 24 cores, 48 vCPUs, 384GB Mem, storage type io1.

Question

How do we handle if the table grows to 100 billion? In the current projection, this can happen in the next 3-5 years.
Is NoSQL a good solution? Note we are not accessing the documents with only id or kind.

Notes

The facts that I presented might bias the solution to replication in the same host and then later sharding over multiple hosts. But if there is some other solution that can keep up to the 100 billion mark, we should be good.
We don't have to use AWS. But preferred.

答案1

得分: 6

The current plan is to scan the rows already ordered by "created_at" (using an index) and then stop once it finds 10 (plus maybe a few rows to account for ties) passing the rest of the conditions. It thinks it will do this very quickly, after only about 1/73,000 of the table (27225.75 / 2003433582.97). but in fact it had to scan much more than that (14735308 / 4000000000, or 1/270 of the table). So it grossly misestimated that part. I don't know if it misestimated it because the number of rows meeting the conditions was estimated incorrectly (It thought there would be 1902974, we don't know how many there actually were, since it stopped early and so stopped counting them) or because it assumed the matching rows would be evenly dispersed over the index, when they were not.

The best index for you will probably be on (user_id, created_at). That way you get to jump to just the part of the index which has the correct user_id (which I assume is where the vast majority of your selectivity comes from) and then still walk that part already in order by "created_at". And you can drop the original index just on (user_id), as the new one will be good for anything that the old one is good for. You could also add "is_deleted" between the other two columns in that index, as it will not spoil the ordering property and will provide some additional selectivity (but probably not much). Any other columns added there will spoil the ordering property, however.

英文:

The current plan is to scan the rows already ordered by "created_at" (using an index) and then stop once it finds 10 (plus maybe a few rows to account for ties) passing the rest of of the conditions. It thinks it will do this very quickly, after only about 1/73,000 of the table (27225.75 / 2003433582.97). but in fact it had to scan much more than that (14735308 / 4000000000, or 1/270 of the table). So it grossly misestimated that part. I don't know if it misestimated it because the number of rows meeting the conditions was estimated incorrectly (It thought there would be 1902974, we don't know how many there actually were, since it stopped early and so stopped counting them) or because it assumed the matching rows would be evenly dispersed over the index, when they were not.

答案2

得分: 5

下面是翻译好的内容：

查询

首先，格式化WHERE子句，以便更容易理解。可以归结为：

FROM   site_activity s
WHERE  s.user_id = 68812389
AND    NOT s.is_deleted
AND    s.kind <> ALL ('{updated,duplicated,reapplied}'::text[])
AND    (content_type_id <> 14 OR s.kind <> 'created')
ORDER  BY s.created_at DESC, s.id DESC
LIMIT  9;

索引

您提到您总是排除这两个条件的行。因此，这部分多列索引将是最佳选择：

CREATE INDEX ON public.site_activity (user_id, created_at, id)
WHERE  NOT is_deleted
AND   (content_type_id <> 14 OR kind <> 'created')

只有在有许多具有相同的(user_id, created_at)的行时，才有意义添加id。否则，从索引中删除id。

从索引中排除表的大部分不相关部分可以为这样的大型索引节省开销。但您可能会防止对索引中涉及的任何列的更改进行HOT更新，包括WHERE子句中的列。

只有当索引的过滤条件明显是查询中过滤条件的明显子集时，才能使用该索引。

表定义

优化表定义可能会有所帮助，比如：

表 public.site_activity
列名        |     类型     | 可为空 | 默认值
----------------+--------------+--------+----------------------------------------------
id              | bigint       | 非空   | nextval('site_activity_id_seq'::regclass)
user_id         | int          | 非空   |  -- NOT NULL ??
kind            | smallint     | 非空   |  -- "大约100个值"
content_type_id | smallint     | 非空   |  -- "大约80个值"
created_at      | timestamptz  | 非空   |
modified_at     | timestamptz  | 非空   |
object_id       | bigint       | 非空   |
company_id      | int          | 非空   |
is_deleted      | bool         | 非空   |
context         | text         | 非空   |

最重要的是，现在kind占用2字节，而不是33字节或更多。详情请参阅：

varchar(n)的开销是多少？

此外，通过重新排列列的顺序，可以节省大量空间。详情请参阅：

在PostgreSQL中计算和节省空间

大列context（"最大为8MB"）通常会存储在TOAST表中，对于大多数行而言，元组将缩小到其一半大小。这对大多数操作都有影响。

我怀疑您的一些索引可能是可有可无的。

英文:

Query

Start by formatting the WHERE clause to make it easier to understand. Comes down to:

FROM   site_activity s
WHERE  s.user_id = 68812389
AND    NOT s.is_deleted
AND    s.kind &lt;&gt; ALL (&#39;{updated,duplicated,reapplied}&#39;::text[])
AND    (content_type_id &lt;&gt; 14 OR kind &lt;&gt; &#39;created&#39;)
ORDER  BY s.created_at DESC, s.id DESC
LIMIT  9;

Index

You commented you always exclude rows for these two conditions. So this partial, multicolumn index would be the optimum:

CREATE INDEX ON public.site_activity (user_id, created_at, id)
WHERE  NOT is_deleted
AND   (content_type_id &lt;&gt; 14 OR kind &lt;&gt; &#39;created&#39;)

Adding id only makes sense if there are many rows with the same (user_id, created_at). Else drop id from the index.

Excluding large, irrelevant parts of the table from the index can pay for such big indexes. (But you may prevent HOT updates for changes on any of the columns involved in the index, including the ones in the WHERE clause.)

The index can only be used while its filters are an obvious subset of the filters in the query.

Table definition

It would pay to optimize your table definition. Like:

    Column      |     Type     | Nullable |                   Default
----------------+--------------+----------+----------------------------------------------
id              | bigint       | not null | nextval(&#39;site_activity_id_seq&#39;::regclass)
user_id         | int          | not null |  -- NOT NULL ??
kind            | smallint     | not null |  -- &quot;around 100 values&quot;
content_type_id | smallint     | not null |  -- &quot;around 80 values&quot;
created_at      | timestamptz  | not null |
modified_at     | timestamptz  | not null |
object_id       | bigint       | not null |
company_id      | int          | not null |
is_deleted      | bool         | not null |
context         | text         | not null |

Most importantly, kind now occupies 2 bytes instead of 33 bytes or more. See:

What is the overhead for varchar(n)?

Plus substantial savings from rearranging the column order. See:

https://stackoverflow.com/questions/2966524/calculating-and-saving-space-in-postgresql/7431468#7431468

The big column context ("with a max 8Mb size") will typically be stored out-of-line in a TOAST table for most rows, so the tuples to work with shrink to half their size. This makes a difference for most operations.

And I suspect that some of your indexes may be expendable.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

非常慢的查询性能在 AWS 的 PostgreSQL 中，对于一个有 40 亿行的表。

问题

问题

表格信息

The Problem

Table Facts

Question

Notes

答案1

答案2

查询

索引

表定义

Query

Index

Table definition

在SQL Server中使用多个排序列的条件性排序。

is "Wait for callback" pattern in AWS step functions the only way to perform async tasks? Can I use a webhook instead?

java.sql.SQLException: 内存不足以进行排序，请考虑增加服务器排序缓冲区大小

Sql选择 – 返回的总行数

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论