2023年7月18日 02:26:17go评论107阅读模式

英文:

Python (PyMySQL) getting random 2003 error, timing out on connection to remote MySQL server

问题

我有一个奇怪的问题，已经花了几天的时间，但没有运气。希望在这里有人能指点我一个方向，或者曾经遇到类似情况。

所以，发生的情况是我在GCP上有两个运行MySQL 8的Debian虚拟机，称之为“处理服务器”和“历史服务器”。处理服务器每天处理大约100万条消息，并将每条消息插入历史服务器的大约6000个表中（使用Python和PyMySQL）。这已经正常运行了将近一个月，没有任何问题，突然之间每隔0-30分钟，我会从历史服务器收到连接超时的错误信息：

"OperationalError(2003, "Can't connect to MySQL server on 'history.server' (timed out)")"。

我正在使用PMM，已经查看了这个时间段内的所有度量和日志，但没有发现与MySQL有关的任何内容（没有明显的奇怪峰值或下降，没有明显的超出限制的情况等）。

在查看了MySQL和所有可用的日志后，寻找任何相关性，我认为这与网络或服务器问题有关，但我不知道该从何处查找。我正在使用在Debian上运行的GCP虚拟机，我认为在此问题发生时没有任何更改。我在我的虚拟机上查看的各种日志中也没有明显的问题。我没有看到任何资源峰值（磁盘、内存、CPU）。这太奇怪了，因为似乎我的Python连接器偶尔会随机无法连接到服务器。我知道这只是一个小概率事件，但是否有人有可能查找的方向？我很乐意提供更多可能有帮助的信息。

我在“处理服务器”上创建了一个测试脚本，使用与失败的相同查询命中“历史服务器”，我可以让它在0-10分钟内给我这个超时连接错误，只需不断循环查询。如果我让相同的程序打开并关闭一个与MySQL的连接而不执行存储过程，我会看到它会保持几秒钟，但不足以触发相同的超时，通常是这样的。

不知道该从哪里查找。我希望MySQL中有一个地方可以看到被拒绝/失败的连接，并尝试将其与任何相关性相关联。我有可用的图形或日志没有显示任何有意义的模式。

系统数据：

历史服务器

操作系统
- Debian GNU/Linux 11 (bullseye)
内存
- 12GB
内核
- 2 vCPU
- 每核2个线程
硬盘
- 750GB "均衡持久性磁盘"
- 每实例读IOPS：3,000
- 每实例写IOPS：3,000
- 每实例读吞吐量：140
- 每实例写吞吐量：140
主机服务器
- GCP计算引擎
数据转储：
- 表信息：链接
- 全局状态：链接
- 全局变量：链接
- 进程列表：链接
- 状态：链接
- Innodb状态：链接

处理服务器

操作系统
- Debian GNU/Linux 10 (buster)
内存
- 32GB
内核
- 8 vCPU
- 每核2个线程
硬盘
- 1,000GB "SSD持久性磁盘"
- 每实例读IOPS：15,000
- 每实例写IOPS：9,000
- 每实例读吞吐量：240
- 每实例写吞吐量：204
主机服务器
- GCP计算引擎
数据转储：
- 表信息：链接
- 全局状态：链接
- 全局变量：链接
- 进程列表：链接
- 状态：链接
- Innodb状态：链接

我已经仔细查看了所有可用的系统和MySQL日志，使用了Percona的管理和监控工具来查找峰值或下降、达到最大或最小值，并使用“top”观察资源，以及使用GCP的云计算监控，但都没有找到解决方法。我没有看到我的查询超时的时间戳和这些来源中的任何重要内容之间的关联。没有明显的模式。超时之间的时间间隔也没有模式（超时之间的时间间隔在0到30分钟之间，但没有一致的模式）。

英文:

I have a weird one that I've spent a few days on without any luck. Coming here in the hopes that someone can point me in a direction or has seen something similar.

So, what is happening is that I have two Debian VMs on GCP running MySQL 8, call them 'Processing Server' and 'History Server'. Processing Server processes ~1 million messages a day, and inserts each messages into one of ~6000 tables on History Server (using python w/ pyMySQL). This has run fine for almost a month w/o an issue, and then suddenly every 0-30 minutes I get connection time outs from History Server

"OperationalError(2003, "Can't connect to MySQL server on 'history.server' (timed out)")".

I am using PMM and have dug through all my metrics/logs around this time and find nothing related to MySQL (no weird spikes or dips in anything obvious, no maxed out this or that..).

After digging through MySQL and all available logs, looking for any correlations, I am thinking it is related to a network or a server issue but I have no clue where to look. I am using GCP virtual machines running Debian and I don't think that anything had changed around the time of this issue. I also don't see anything obvious in the various logs I've looked at on my VMs. I dont see any resource spikes (disk, ram, cpu). It's so weird because it seems like my python connector just randomly can't connect to the server occasionally. I know this is a long shot but does anyone have a direction that they might look in? Happy to provide any more information that might be helpful.

I created a test script on 'Processing Server' that hits History Server with the same query that is failing, I can get that to give me this timeout connection error w/in 0-10 minutes, just spamming the query in a loop. If I make that same program open and close a connection to mysql w/o executing the stored proc I see when it holds up for a couple of seconds, but not long enough to trigger the same timeout, generally.

No clue where to look. I wish there was somewhere in mysql where I could see rejected/failed connections and try to correlate it to ANYTHING. No graphs or logs that I have available to me seem to show any meaningful patterns.

Systems Data:

History Server

- OS
    - Debian GNU/Linux 11 (bullseye)
- RAM
    - 12GB
- Cores
    - 2vCPU
    - 2 Threads/Core
- DISK
    - 750GB &quot;Balanced persistent disk&quot;
    - Read IOPS per instance       : 3,000
    - Write IOPS per instance      : 3,000
    - Read throughput per instance : 140
    - Write throughput per instance: 140
- Host Server
    - GCP Compute Engine
- Data Dumps:
    - Table Info: https://justpaste.it/3qapn
    - Global Status: https://justpaste.it/dfm6j
    - Global Variables: https://justpaste.it/87sl8
    - Process List: https://justpaste.it/3xh3f
    - Status: https://justpaste.it/1oka4
    - Innodb Status: https://justpaste.it/3wdck

Processing Server

- OS
    - Debian GNU/Linux 10 (buster)
- RAM
    - 32GB
- Cores
    - 8vCPU
    - 2 Threads/Core
- DISK
    - 1,000GB &quot;SSD persistent disk&quot;
    - Read IOPS per instance       : 15,000
    - Write IOPS per instance      : 9,000
    - Read throughput per instance : 240
    - Write throughput per instance: 204
- Host Server
    - GCP Compute Engine
- Data Dumps:
    - Table Info: https://justpaste.it/8o85k
    - Global Status: https://justpaste.it/7eukf
    - Global Variables: https://justpaste.it/df3z4
    - Process List: https://justpaste.it/bl4u2
    - Status: https://justpaste.it/4b0dj
    - Innodb Status: https://justpaste.it/91z2g

I have poured through all available system & MySQL logs, used Percona's Management and Monitoring tools to look for spikes or dips, maxed or min'd out values, and watched resources using 'top' as well as GCP's cloud compute monitoring, all with no luck. I see no correlation between timestamps of my query timeouts and anything significant from these sources. There are no obvious patterns. There is also no pattern in the timing between timeouts (anywhere between 0 and 30 minutes between timeouts, but nothing consistent).

答案1

得分: 1

每秒速率 = RPS

关于您的历史服务器数据库标志，建议考虑以下内容。

connect_timeout=20  # 从10秒限制增加到提高连接的容忍度
innodb_parallel_read_threads=2  # 由于只有2个核心，从4减少
table_open_cache_instances=2  # 由于只有2个核心，从16减少
innodb_lru_scan_depth=100  # 从1024减少以保存用于功能的90% CPU周期
innodb_log_buffer_size=67108864  # 从16M减少以降低innodb_os_log_written的RPS，从30,938减少
read_rnd_buffer_size=16384  # 从256K减少以降低handler_read_rnd_next的RPS，从505减少

还有其他机会可以通过额外的更改来提高性能。

观察结果，
innodb_flush_method为O_DIRECT_NO_FSYNC，而通常我们观察到O_DIRECT
对于您的工作负载，拥有更多核心将有助于完成任务。
com_create_table报告在67小时内创建了2,995,059个表，每秒12个似乎太多了。在这两天多内没有报告com_drop_tables。

英文:

Rate Per Second = RPS

Suggestions to consider for your History Server Database flags.

connect_timeout=20  # from 10 second limit for more tolerance to complete connect
innodb_parallel_read_threads=2  # from 4 because you only have 2 cores 
table_open_cache_instances=2  # from 16 because you only have 2 cores
innodb_lru_scan_depth=100  # from 1024 to conserve 90% CPU cycles used for function
innodb_log_buffer_size=67108864  # from 16M to reduce innodb_os_log_written RPS of 30,938
read_rnd_buffer_size=16384  # from 256K to reduce handler_read_rnd_next RPS of 505

There are other opportunities to improve performance with additional changes.

Observations,
innodb_flush_method is O_DIRECT_NO_FSYNC and we normally observe O_DIRECT
For your workload, more cores would be helpful in completing tasks.
com_create_table reported 2,995,059 createded in 67 hours, 12 per second seems to be extreme. No reported com_drop_tables in these 2+ days.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python（PyMySQL）在连接到远程MySQL服务器时随机出现2003错误，连接超时。

问题

Systems Data:

答案1

Google Cloud Platform，Golang灵活环境仅支持自定义域名的HTTPS。

避免在唯一约束条件下出现插入错误。

发布带有模式的消息到 Pub/Sub 主题

Importing to phpMyAdmin is returning: #1071 – Specified key was too long; max key length is 1000 bytes. What can I do to fix this?

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。