Compute Engine – 午夜切换期间 CPU 利用率和负载平均值出现峰值

huangapple go评论57阅读模式
英文:

Compute Engine - CPU Utilization & Load average spikes at mid-night during day switch over

问题

CPU利用率和虚拟实例(计算引擎)的负载平均值在午夜切换时上升。我们使用运行在Ubuntu 20.04 LTS上的8/6核VM实例。此外,午夜时我们的流量不多。自过去8个月以来,CPU利用率和负载平均值经常升至100%和150+,导致网站停机1或2分钟,直到另一个VM启动以处理这个峰值。CPU利用率/负载平均值的峰值在5分钟内结束。在CPU负载峰值期间,磁盘吞吐量和磁盘IOPS也有所增加,但看起来VM可以处理。

在午夜经常收到网站宕机/高流量警报(通过短信和电子邮件)非常令人恼火。我已经检查并确保以下事项:

  • Ubuntu在午夜时没有进行日志轮转,因为VM控制台确认日志轮转在峰值发生之前完成。
  • MySQL也响应正常,VM发送到MySQL云服务器的每秒查询正常。
  • 我们在午夜时没有通过API访问Google服务,例如翻译API,这可能需要较长时间才能完成。

我怀疑可能是云团队在日切换时执行了一些维护工作,导致负载平均值升高。VM的控制台图表显示Google服务下的新连接峰值,平均从70/s变为200/s。我已附上VM的可观测性快照。

寻求一些帮助来解决这个问题。

英文:

CPU Utilization & Load average on Virtual Instance (Compute Engine) spikes during midnight when day switches. We use 8/6 Core VM instance running on Ubuntu 20.04 LTS. Further, during midnight we don't have much traffic. Regularly, since last 8 months CPU Utilization & Load Average shoots to 100% & 150+ respectively which takes the website down for 1 or 2 minutes till the another VM shoots up to handle the spike. Spike in CPU Utilization/Load Average gets over within 5 minutes. During CPU load spike, spike in Disk throughput & Disk IOPS is also visible but that looks manageable by VM.
It is pretty annoying to receive Website Down/High Traffic alerts (via SMS and Emails) during midnight regularly. I have checked and made sure that -

  • It is not log rotation on Ubuntu during midnight as VM console confirms that Log Rotation gets completed before spike occurs.
  • MySQL is also responsive and queries per seconds VM sends to MySQL Cloud server remains normal.
  • No Google services through API is accessed during midnight by us e.g. Translation API which might take long to finish.

I doubt it might be some house keeping done by Cloud Team on day switch over which might be causing high Load Average. Console Graph for VM shows spike in New connection under Google Services which changes average 70/s to 200/s. I have attached VM observability snapshot.

Looking for some help to resolve the issue.

Compute Engine – 午夜切换期间 CPU 利用率和负载平均值出现峰值

答案1

得分: 2

感谢John hanley提供的领导!调用GA4 API导致CPU利用率和负载平均值急剧上升。GA4 API不属于Google Cloud的一部分,也不提供任何检查使用情况的手段,因此在检查对Google Services API的神秘调用时被跳过。

我们通过GA4 API注册/记录Analytics 4中的事件,而在日切换期间,我们拥有大量这些事件。尽管事件的大量不会对虚拟机造成任何压力,但调用GA4 API会耗尽虚拟机的所有资源。我们禁用了对GA4 API的调用,问题得到解决。删除GA4 API调用后的虚拟机可观察性快照如下所示。

Compute Engine – 午夜切换期间 CPU 利用率和负载平均值出现峰值

英文:

Thanks to John hanley for giving lead! Calling GA4 API was causing spike in CPU Utilization & Load average. GA4 API is not part of Google Cloud and doesn't provide any mean to check the usage so it got skipped while checking for mysterious calls to Google Services API.

We register/log Events in Analytics 4 through GA4 API and during day switch over we have high volume of them. Although, high volume of events don't cause any stress on VM but calling GA4 API was exhausting all resources of VM. We disabled calling GA4 API and issue got fixed. VM observability snapshot after call to GA4 API is removed is attached below.

Compute Engine – 午夜切换期间 CPU 利用率和负载平均值出现峰值

huangapple
  • 本文由 发表于 2023年5月18日 01:52:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/76274931.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定