英文:
Disable postgres index updation temporarily and update the indexes manually later for insert statement performance
问题
我大约有1300个CSV文件,每个文件几乎有40,000行,我已经编写了一个Python代码来读取文件并将所有40,000条目转换为单个插入语句,以插入到Postgres数据库中。
伪代码如下:
for file in tqdm(files, desc="Processing files"):
rows = ReadFile(file) # 从文件中读取所有40,000行
q = GenerateQuery(rows) # 将所有行转换为单个批量插入语句
InsertData(q) # 执行生成的查询以插入数据
代码没问题,但存在性能问题。当我在空表中运行代码时,每秒处理2到3次迭代,但在处理15到20个文件后,每秒处理10到12次迭代,然后性能呈指数级下降,每处理10到15个文件,迭代时间就会不断增加,甚至达到40到50秒/迭代,这太搞笑了。根据我的理解,我提出了以下假设:
由于一开始表为空,更新表索引非常容易,因此用于更新表中索引的时间几乎为零,但随着记录的增加,用于更新表中具有10,000,000+记录的索引变得越来越困难。
我的问题是,我能否暂时禁用表的索引更新,以便在完全数据转储后,我将通过调用Postgres中的某些查询手动更新索引,至于现在我并不知道是否真的存在这样的查询。
英文:
i have about 1300 CSV files with almost 40k of rows in each file, i have written a python code to read the file and convert all the 40k entries into single insert statement to insert in Postgres database.
The psudocode is following
for file in tqdm(files, desc="Processing files"):
rows = ReadFile(file) # read all 40k rows from the file
q = GenerateQuery(rows) # convert all rows into single bulk insert statement
InsertData(q) # Execute the generated query to insert data
The code is fine but there are performance issues, when I start the code with empty table it takes around 2 to 3it/s, but after 15 to 20 files it takes 10 to 12it/s, and then the performance drops exponentially with each 10 to 15 files processing, the per iteration time keeps on increasing even it reaches 40 to 50s/it, it's hilarious, according to my understanding I have developed the following hypothesis
Since in start table is empty its very easy to update table indexes, so it takes no time for 40k bulk insert records to update indexes but with growing records it become harder and even harder to update indexes in the table with 10m+ records.
My Question is can I temporarily disable index updation of the table, so that after complete data dump I will then manually update the indexes by calling some query in postgres which for now I don't know if it really exists.
答案1
得分: 2
以下是翻译好的内容:
-
如果没有任何限制,比如在索引列上的唯一性,你可以在插入元组之前删除索引。在插入完成后,你可以轻松地重新创建索引。
-
如果有足够的内存,你可以先读取所有的CSV文件,然后使用SELECT语句创建一个表。这个过程将并行执行,从而大大减少总体时间。如果不需要将数据加载到现有表中,并且可能需要使用Postgres的psycopg2驱动程序,这种方法可能会有所帮助。
-
你可以使用Postgres的本地方法从CSV文件中加载元组,这方法称为COPY。如果我没记错的话,从Postgres的14-15版本开始支持COPY的并行执行。
-
在任何情况下,你都可以创建一个非记录日志表(https://www.postgresql.org/docs/current/sql-createtable.html)或将现有表设置为非记录日志表(ALTER TABLE foo SET UNLOGGED;),这可以提供50-200%的速度提升,即使没有并行插入。不过,要谨慎操作。如果表中已有数据并且出现问题,你可能会丢失所有数据。如果是创建新表,这不是问题,因为你可以重复复制操作。无论如何,在复制完成后,建议将表重新设置为记录日志表(ALTER TABLE foo SET LOGGED;)。这个过程可能需要一些时间。
英文:
There are several options:
-
If you have no restrictions, such as uniqueness on the indexed columns, you can drop the indexes before inserting tuples. You can easily create the indexes after the insertion is completed.
-
If you have enough RAM, you can first read all the CSVs and then create a table using a select statement. This process will be performed in parallel, resulting in much less overall time. This approach can be helpful if you don't need to load the data into an existing table and may require the use of psycopg2 drivers for Postgres.
-
You can use the Postgres native method to load tuples from a CSV file called COPY. If I am not mistaken, the parallel execution of COPY is supported in versions 14-15 of Postgres.
-
In any scenario, you can create an unlogged table (https://www.postgresql.org/docs/current/sql-createtable.html) or set an existing table as unlogged (ALTER TABLE foo SET UNLOGGED;), which can provide a speed increase of 50-200% even without parallel insertion. However, be cautious. If there is existing data in the table and something goes wrong, you may lose all the data. If you are creating a new table, this is not an issue since you can repeat the copy operation. Regardless, after the copy is completed, it's recommended to make the table logged again (ALTER TABLE foo SET LOGGED;). This process may take some time.
答案2
得分: 0
不能禁用索引。如果表在开始加载时为空或几乎为空,您可以通过在开始之前删除索引,然后在之后重新创建它们来节省时间。然而,INSERT
性能应该随时间保持恒定。很难猜测可能的原因。您可以尝试使用auto_explain来捕获慢插入的EXPLAIN (ANALYZE, BUFFERS)
输出,这可能会给您一些线索。
对于大批量加载,使用COPY
将获得最佳性能。
英文:
No, you cannot disable indexes. If the table is empty or almost empty when you start loading, you can win by dropping the indexes before you start and creating them again afterwards. However, INSERT
performance should stay constant over time. It is difficult to guess what might be the cause. You can try using auto_explain to capture EXPLAIN (ANALYZE, BUFFERS)
output for a slow insert, that may give you a clue.
For bulk loads, you will get the best performance with COPY
.
答案3
得分: 0
不,PostgreSQL没有这样的功能(但GIN索引具有fastupdate,有点类似)。但由于表最初是空的,首先不创建索引,然后在加载后创建它们将产生相同的结果(这就是pg_dump将安排发生的事情)。
维护索引通常涉及跳转到任意叶页并将这些页标记为脏页。一旦所有索引的组合大小不再适合于shared_buffers(当然,一旦它们不再适合于所有RAM),这将涉及大量的随机IO,导致性能崩溃。
英文:
No, PostgreSQL does not have such a feature (but GIN indexes do have fastupdate, which is kind of similar). But since the table starts out empty, just not having the indexes at first then creating them after the load would give the identical outcome (and this is what pg_dump would arrange to happen).
Maintaining indexes generally involves jumping to arbitrary leaf pages and dirtying those pages. Once the combined size of all the indexes no longer fit in shared_buffers (and certainly once they no longer fit in all of RAM), this is going to involve a lot of random IO, leading to a performance collapse.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论