英文:
Intermittent Azure App Service "hang" issue - Thread Pool starvation?
问题
Update 2023-05-04
我们创建了一个逻辑应用,每30秒ping一次基本诊断端点并记录结果。我们发现端点通常需要大约300-400毫秒运行,然后会突然出现一个高峰,返回时间长达50秒!
当我们分析日志时,我们发现ThreadPool.PendingWorkItemCount
大约有100个项目。在“正常”运行期间,PendingWorkItemCount始终为零。
因此,看起来我们正在经历某种形式的线程池耗尽。
是否有办法追踪这些线程的来源?例如,如果有某种后台进程或定期更新的过期缓存,我们如何追踪这个问题?
ThreadPool对象提供了非常少的允许我们详细检查的公共方法/属性。
示例诊断:
{
"start": "2023-05-04T12:17:03.0518943Z",
"end": "2023-05-04T12:17:06.6382781Z",
"threadCount": 8,
"pendingWorkItemCount": 32,
"workerThreads": 32762,
"completionPortThreads": 1000,
"maxWorkerThreads": 32767,
"maxCompletionPortThreads": 1000
}
Original Issue
我们在Azure App Services的一个应用中遇到了一个奇怪的问题。在一天的各个不可预测的时刻,该应用会突然出现大约30-50秒的停顿,不会处理任何请求。就像我们在等待冷启动一样。
这是一个ASP.NET MVC .NET 7应用程序(C#)单体应用。它具有DI服务层,但这不是基于API的 - 所有内容都包含在一个应用程序中。它广泛使用Azure Redis,并具有Azure SQL后端。它还广泛使用Azure Storage(表、Blob和队列)。
该应用程序在整个过程中都使用异步等待模式。应该几乎没有同步调用或明显阻塞线程的情况。我们找不到任何锁定任何资源一段时间的情况。
它实际上不需要调用任何第三方API,并且我们不太使用外部CDN。我们所需要的几乎都在所描述的架构内。
MVC应用程序运行在P2V2
(210 vCPU,7GB RAM)上,并扩展到两个实例(启用了会话关联)。
Redis实例是P1 Premium
(6 GB缓存)。
Azure SQL是Standard S4
(200 DTU),在英国南部(R/W)和英国西部(R/O)之间进行地理复制。在我们的应用程序中,我们同时使用连接字符串。只读查询定向到UK West,Upsert/删除操作定向到UK South,从而实现SQL服务器的“负载均衡”。
在“正常”运行期间,该应用程序非常快,如低ms范围。然而,由于无法识别的原因,每天多次(可能5次)应用程序突然在两个实例上“停顿”了长达50秒。在此期间,浏览器旋转,似乎没有发生任何事情。然后突然之间请求得到处理,恢复了出色的性能。就像应用程序正在“冷启动”一样,但事实并非如此 - 几秒钟之前我们还在正常使用它。
在这些时期,我们尽可能检查尽可能多的诊断源,但没有找到任何指向这种突然停顿的迹象,例如:
- 两台机器上的App Service CPU指标不会超过15%
- 内存使用没有突然增加
- SQL服务器DTU%在这些时期通常在5-15%,在R/W和R/O服务器上都是如此
- Redis内存使用没有突然增加,通常在200MB左右
- Redis服务器负载通常在5-6%
- Azure Storage数据的入口或出口没有急剧增加
- Application Insights中没有任何有趣的东西
- 没有错误、警告等方面的突然增加
- 诊断事件日志中没有有趣的东西
- 没有超时或其他我们能找到的延迟问题
- 没有后台、定期或定时更新/CRON作业运行
- 数据库查询经过优化并具有良好的索引
- 健康检查保持在100%
- 根据Azure日志,实例没有重新启动。正常运行时间保持在100%
在这个阶段,我们所有的架构都远超过了我们现阶段的要求。
除了这些一天中的瞬间性高峰之外,我们找不到其他明显的架构部分,比如防火墙等。
这个问题似乎“内部”发生在MVC、.NET或App Service本身。我们无法在开发中本地复制这个问题,也无法预测它在生产中何时会发生。
我们已经考虑过GC收集或潜在的数据库连接池重启等问题,但找不到任何数据表明这些问题存在。
是否有可能Application Insights本身会导致问题?它是否定期转储或刷新数据/缓存?感觉像是平台、托管或框架中的某些东西引起了这个问题。
我们有点困惑。这令人沮丧,因为除了这些一天中的瞬间高峰之外,应用程序运行得非常好,速度非常快。
我已经
英文:
Update 2023-05-04
We created a logic app that pings a basic diagnostics endpoint every 30 seconds and logs the result. What we find is that the endpoint usually takes around 300-400ms to run, and then we see a sudden spike where it can take up to 50 seconds to return!
When we analyse the logs, we find ThreadPool.PendingWorkItemCount
returns around 100 items. During "normal" operation, PendingWorkItemCount is always zero.
So it appears we're experiencing some form of thread pool exhaustion.
Is there any way to trace where these threads are coming from? For example, if there's some kind of background process or expired cache that gets updated periodically, how can we trace this?
The ThreadPool object providers very few public methods/properties that allow us to examine this in detail.
Example diagnostics:
{
"start": "2023-05-04T12:17:03.0518943Z",
"end": "2023-05-04T12:17:06.6382781Z",
"threadCount": 8,
"pendingWorkItemCount": 32,
"workerThreads": 32762,
"completionPortThreads": 1000,
"maxWorkerThreads": 32767,
"maxCompletionPortThreads": 1000
}
Original Issue
We're experiencing a strange issue with one of our Azure App Services. At various unpredictable points in the day the app will suddenly appear to hang for around 30-50 seconds, where no requests get serviced. It's as if we're waiting on a cold start.
It's an ASP.NET MVC .NET 7 application (C#) monolith. It has a DI service layer, but this isn't API based - all contained within one application. It uses Azure Redis extensively and has an Azure SQL backend. It also uses Azure Storage (Tables, Blobs and Queues) extensively.
The app uses the async-await pattern throughout. There should be virtually no synchronous calls or anything that obviously blocks a thread. We cannot find anything that 'locks' any resource for any period of time.
It doesn't really need to call any third party APIs, and we don't tend to use external CDNs much. Everything we need is pretty much inside the architecture described.
The MVC app is running on P2V2
(210 vCPU, 7GB RAM) and scaled out to two instances (session affinity on).
Redis instance is P1 Premium
(6 GB cache).
Azure SQL is Standard S4
(200 DTUs), geo replicated between UK South (R/W) and UK West (R/O). In our application, we use both connection strings. Read-only queries are directed to UK West and Upsert/Deletion operations are directed to UK South, thereby "load-balancing" the SQL server.
During "normal" operation, the application is extremely quick, like in the low ms range. However, for no identifiable reason several times per day (perhaps 5 times) the application suddenly "hangs" on both instances for up to 50 seconds. During this time, the browser spins and nothing appears to be happening. Then all of a sudden the requests are serviced and it goes back to great performance. It's as if the app is "cold booting" but it's not - we were using it perfectly well seconds before.
During these periods, we check as many diagnostic sources as we can, but have found nothing to point towards this sudden hang, for example:
- App Service CPU metrics on both machines don't go above 15%
- No sudden spike in memory usage
- SQL server DTU% typically 5-15% during these periods on both R/W and R/O servers
- No spike in Redis memory usage and in the region of only 200MB
- Redis server load typically 5-6%
- No spikes in Ingress or Egress in Azure Storage data
- Nothing of any interest in Application Insights
- No spikes in errors, warnings, etc.
- Nothing of interest in diagnostics event logs
- No timeouts or any other latency issues that we can find
- No background, scheduled or timed updates/CRON jobs running
- Database queries are optimised and well indexed
- Health checks remain at 100%
- Instances are not rebooting, according to Azure logs. Uptime remains at 100%
All the pieces of architecture are well over-specced for our requirements at this stage.
There are no other obvious pieces of architecture that we can put our finger on, such as firewalls, etc.
The issue feels "internal" to MVC, .NET or the App Service itself. We cannot replicate the issue locally in development and we cannot predict when it will happen on production.
We've considered GC collection or potential database connection pool recycling, etc. but cannot find any data to suggest these things are issues.
Is it possible that Application Insights itself is causing the issue? Does it periodically dump or flush data/caches? It feels like something in the platform, hosting or framework is causing this.
We're a bit stumped. It's frustrating because other than these momentary spikes throughout the day, the app is running really well and super quick.
I've raised an issue with Azure Support and await their feedback but has anybody else has similar experiences with similar architectures? Do you have any suggestions that we could look at, any logs/diagnostics we could consider adding to trace where this issue may be coming from?
答案1
得分: 1
问题现在已经解决。我们进行了三项更改:
-
SignalR - 我们在应用服务配置中没有激活WebSockets选项。直到我们这样做之前,我们看到了大量的请求。有关更多信息,请参阅此问题:https://stackoverflow.com/questions/76175415/high-number-of-signalr-requests-compared-to-rest-of-application/76182424#76182424
-
在我们的一个副本中,SQL服务器层次存在不平衡。尽管英国南部和英国西部都是S4,但我们有一个第三个副本设置为S1。因此,我们删除了这个第三个副本,因为它是不必要的。
-
我们决定在Azure刀片中关闭Application Insights。
一旦进行了这三项更改,问题立即得到解决。
不幸的是,由于商业约束,我们无法再花更多时间来调查问题并分离出哪个更改解决了问题。希望这可以让有类似问题的人有进一步探讨的方向。
英文:
The problem has now been solved. We changed three things:
-
SignalR - we hadn't activated the WebSockets option in App Service configuration. We were seeing a large number of requests until we did this. See this issue for more information: https://stackoverflow.com/questions/76175415/high-number-of-signalr-requests-compared-to-rest-of-application/76182424#76182424
-
There was an imbalance in SQL server tiers on one of our replicas. Whilst UK South and UK West were both S4, we had a third replica set at S1. So we removed this third replica, as it was not needed.
-
We decided to switch Application Insights off in the Azure blade.
As soon as these three changes were made, the issue resolved immediately.
Unfortunately, due to commercial constraints, we couldn't afford any more time to investigate the issue and isolate which of these changed fixed the problem. Hopefully this might give somebody with a similar problem something to look into further.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论