2023年5月18日 01:00:27go评论141阅读模式

英文:

PostgreSQL sending large XML values is too slow

问题

以下是翻译好的部分：

我们有一张大约有1万行的表，具有以下结构：

item_id：文本（主键）
xml_1：XML
xml_2：XML
country：文本

运行以下查询需要大约9到10秒：

SELECT * FROM info_table
WHERE item_id IN ('item1', 'item2', ... -> 'item2000')

我们的每个SELECT查询都由大约2,000个左右的项目ID（字符串）数组组成，查询非常简单，我们正在寻求优化它（如果可能）。
每行的XML大小约为100KB

如果有帮助的话，我们的查询是在使用Knex的Node.JS中进行的，如下所示：

client.select('*').from('info_table').where('item_id', 'in', ids)

服务器使用由GCP Cloud SQL托管的PostgreSQL 14，具有2vCPU、8GB内存和100GB SSD

EXPLAIN（ANALYZE、BUFFERS）的结果如下：

Seq Scan on epg_test  (cost=4.85..740.17 rows=1939 width=601) (actual time=0.168..3.432 rows=1837 loops=1)
  Filter: (epg_id = ANY (Array of 2000 IDs)
  Rows Removed by Filter: 6051
  Buffers: shared hit=617
Planning:
  Buffers: shared hit=130
Planning Time: 1.999 ms
Execution Time: 3.590 ms

有什么建议吗？

英文:

We have a table with around 10k rows, with the following scheme:

item_id: TEXT (Primary Key)
xml_1: XML
xml_2: XML
country: TEXT

Running the following query takes around 9 to 10 seconds:

SELECT * FROM info_table
WHERE item_id IN (&#39;item1&#39;,&#39;item2&#39;,&#39;...&#39; -&gt; &#39;item2000&#39;)

Each of our SELECT query is composed of an array of around 2,000+- of items ids (Strings), the query is extremely simple and we are looking to optimize it (if possible).
The size of the XML of each row is around 100Kb

If it helps our query is being done in Node.JS using Knex, such as:

client.select(&#39;*&#39;).from(&#39;info_table&#39;).where(&#39;item_id&#39;,&#39;in&#39;,ids)

The server is using PostgreSQL 14 hosted by GCP Cloud SQL with 2vCPU, 8GB Memory and 100GB SSD

Results of EXPLAIN (ANALYZE, BUFFERS):

Seq Scan on epg_test  (cost=4.85..740.17 rows=1939 width=601) (actual time=0.168..3.432 rows=1837 loops=1)
  Filter: (epg_id = ANY (Array of 2000 IDs)
  Rows Removed by Filter: 6051
  Buffers: shared hit=617
Planning:
  Buffers: shared hit=130
Planning Time: 1.999 ms
Execution Time: 3.590 ms

Any ideas of what we can do?

答案1

得分: 1

从你的问题和评论中可以清楚地看出，你的查询在PostgreSQL中的复杂度很低，只需要不到4毫秒。因此，索引或其他SQL调优不是解决方案的一部分。

显然，你正在返回一个大型结果集，大约有0.2 GiB左右。而且，你大约在十秒左右完成。这意味着你的吞吐量是20 MiB/秒，这是非常优秀的。特别是如果你是从GCP的某个位置的服务器检索数据到你所在地的机器上的话。（请记住，20兆字节每秒相当于每秒超过160兆比特。这是从一台机器推送到另一台机器的相当大的带宽。）

你如何让这个大数据传输更快地完成呢？

更多带宽。你需要和运维人员讨论这个问题。或者将运行查询的机器移动到与数据库机器更接近的网络位置。
在传输过程中对数据进行压缩。XML通常是可以被压缩的（从信息论的角度来看，它几乎是病态的冗长）。Node.js的PostgreSQL驱动程序（以及knex）有一个已经废弃的sslcompression连接字符串标志，它会对客户端和服务器之间的网络流量应用无损压缩。这可能会有所帮助。

或者，你可以尝试通过已设置了 -C -- 压缩协议 -- 标志的ssh会话隧道传输数据库连接。
在数据库中对数据进行压缩。如果你这样做，请确保将压缩后的XML存储在具有二进制数据类型的列中。

总的来说，用十秒的时间处理这么多数据似乎并不是特别不合理。

英文:

From your question and comments, it's clear that the PostgreSQL complexity of your query is minimal. It takes under 4ms. Therefore, indexing or other SQL tuning isn't part of your solution.

It's also clear that you're returning a large result set, amounting to something like 0.2GiB. And, you're doing it in ten seconds or so. That means your throughput is 20MiB/sec which is excellent. This is especially true if you're retrieving it into a machine on your premises from a server located somewhere in GCP. (Keep in mind that 20megaBYTES a second takes upwards of 160megaBITS per second. That's a significant amount of bandwidth to push from one machine to another.)

How can you get this big data transfer to complete faster?

More bandwidth. That you have to take up with your operations people. Or by moving the machine running the query closer on the net to the database machine.
Compressing the data in transit. XML is generally quite compressible (information-theoretically it's almost pathologically verbose). The PostgreSQL driver for nodejs (and knex) has a deprecated sslcompression connection-string flag that will apply lossless compression to the client-server network traffic. That might help.

Or, you may be able to tunnel your database connection through an ssh session set up with the -C -- compressed protocol -- flag.
Compressing the data at rest in your database. If you do this make sure you store the compressed xml in columns with a binary data type.

All that being said, ten seconds to process that much data doesn't seem terribly unreasonable.

答案2

得分: 0

谢谢大家的反馈，这非常有帮助。

通过在将数据插入我们的PostgreSQL之前压缩XML数据，我们成功将查询时间缩短到2秒，从而大幅减小了数据大小。

我们使用了"zlib"和GZIP来压缩XML。

英文:

Thank you everyone for the input, it was very helpful.

We managed to get the query down to 2 seconds by compressing the XML data before inserting to our PostgreSQL resulting in a massive reduction in size generally.

We compressed the XML using "zlib" and GZIP.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

PostgreSQL发送大型XML值太慢。

问题

答案1

答案2

Electron React – 在 App 组件渲染之前如何获取本地 JSON 文件数据？

选择所有与最后发送的消息有关的对话。

Combining two dictionaries and creating a list of dictionaries with updated credentials

可以使用discord.js与discord-rpc，由discord-rpc来驱动bot的 RPC吗？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论