最佳方法避免 Azure 应用服务 SNAT 端口耗尽而无需 NAT 网关。

huangapple go评论97阅读模式
英文:

What is the best way to avoid Azure App Service SNAT Port Exhaustion without NAT gateway

问题

部分我们的运行在 .Net 6 上的应用服务存在间歇性的连接问题。
在 Azure 门户中使用故障排除工具后,发现特定实例(由于扩展,我们有多个实例)的应用服务计划被限制在 128 个 SNAT 端口,而其他一些实例可以正常使用 300 个。

我该如何解决这个特定实例的问题?

此外,我了解到使用 NAT 网关可以通过创建更多的 SNAT 端口来解决问题,但会带来额外的费用。

我想通过代码更改来解决这个问题,我已经尝试了一些常见的方法,例如限制 HttpClient 或 HttpMessageHandler 为单例,但仍然看到数百个端口的使用。

我们怀疑这是因为我们的应用程序与许多共享相同负载均衡器(因此具有相同的 IP)的下游应用程序进行通信的性质所致,但这些应用程序具有许多不同的自定义域。如果可能的话,我想找到一种方法来让所有这些请求重用端口,或者任何可以减少端口使用的方法。

英文:

Some of our App Service running on .Net 6 is having intermittent connectivity issues.
After following through the troubleshooting tool in Azure portal one specific instance(We have more than 1 instance due to scaling) of the App Service Plan is being capped at 128 SNAT port, yet some other instance can use 300 fine.

How do I resolve the problem for this specific instance?

Furthermore, I understand having NAT gateway can resolve the problem by creating more SNAT ports, but it incur additional cost.

I would like to fix this with code changes, I have tried common ways that people suggesting limiting HttpClient or even HttpMessageHandler into a singleton, but we still see hundreds of port usage.

We suspect this is due to the nature that our application talk with a lot of downstream applications that share the same load balancer(so same IP), yet with many different custom domain. I would like to find a way to get all those request reuse ports if possible, or whatever way that can reduce the port usage.

答案1

得分: 0

尽管这个答案似乎只涉及.NET,但实际上类似的方法也可以应用于其他语言/运行时。

如何更好地理解它

是的,您需要在进行任何更改之前先了解它,不要盲目地调整东西,而不完全了解您正在处理的内容!
最佳的故障排除指南始于微软文档:
https://learn.microsoft.com/en-us/azure/app-service/troubleshoot-intermittent-outbound-connection-errors
其中有一个链接指向这份文档,我认为这是最好的描述什么是SNAT的文档:https://4lowtherabbit.github.io/blogs/2019/10/SNAT/。虽然写于2019年,但似乎并不过时,它提到了端口分配的旧算法和新算法。旧的是160,新的是预分配的128。

故障排除指南确实说了你有128个预分配的端口,之后可能会遇到问题,这实际上是在说只有128个是有保证的。

高级摘要

基本上,应用服务计划使用Azure负载均衡器进行外部网络请求。分享SNAT端口的算法是相同的,详细信息在这里列出:https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-outbound-connections
它说当您的VM池大小为201-400时,负载均衡器的SNAT端口为128。

这可能意味着Azure正在尝试将其应用服务划分为几乎400个计划,它们共享相同的负载均衡器。因此,每当您不走运,Azure选择在一个繁忙的“盒子”中创建一个应用服务计划时,您可能无法使用超过128个SNAT端口,但如果您在运行在较不繁忙的“盒子”上的实例上走运,您可能会看到您的应用程序可以轻松地使用超过128个,对于我们来说,除了故障排除工具说检测到SNAT端口耗尽并且故障非常有限外,没有任何问题,我们的应用程序可以轻松使用300多个。

如果您试图推迟解决方案的短期解决方法

因此,短期解决方法可能是不断扩展和缩小您的应用服务,直到破坏您不走运的实例,然后您暂时摆脱困境,或者您可能会遇到更不走运的实例,一切都被限制在128个。

没有NAT网关的长期代码级修复

首先,在代码级别修复问题可能并不总是可能的,但您始终可以尝试分析应用程序的行为,以了解它是如何使用的。而且,您不一定需要在云中检查这个问题,TCP Viewer(https://learn.microsoft.com/en-us/sysinternals/downloads/tcpview)是一个帮助您在本地了解套接字使用情况的好工具,虽然不完全相同,但如果您能够减少套接字使用,就会减少SNAT端口。

技巧一:调整连接池

实际上,HttpClient的影响较小,因为所有连接池都是在HttpMessageHandler/HttpClientHandler内部完成的。此外,在涉及到.Net Framework和.Net Core时,如何处理Http连接池是不同的。有一篇非常好的文章详细解释了这一点,可以在这里找到:https://www.stevejgordon.co.uk/httpclient-connection-pooling-in-dotnet-core

但无论如何,您都有一些选项来控制连接池,例如更改设置,如PooledConnectionLifetime、PooledConnectionIdleTimeout和MaxConnectionsPerServer。在更改MaxConnectionsPerServer时要非常小心,因为它可能会对您的代码性能造成一些影响。值得一提的是,在.Net Framework中,默认值为2,在Core中为无限制。

我个人发现PooledConnectionIdleTimeout最有用,风险最小,这个设置的思路是尽量减少在可以的情况下重新建立http连接,但无论默认值是多少(再次注意在完全框架和Core之间的差异),它都不是根据128个SNAT端口而选择的。它是根据在您的代码运行在具有65536个套接字的操作系统时选择的。因此,当您的可用池缩小到低于2%时(128 / 65536),似乎将此设置更改为较小的值不是一个可怕的想法,2%的默认值可能太过于过度,使用良好的可观测工具检查您的流量,查看出站流量并找出一个值。(在我们的情况下,我选择了5秒)。

我只检查了开源的dotnet运行时,管理连接池的代码使用域作为连接池键的一部分。因此,即使您的下游服务实际上位于相同的负载均衡器和相同的IP后面,也无法更改代码使它们共享相同的池。我没有检查完整的框架实现,但我想象它会非常相似。

技巧二:如果下游支持的话,默认情况下使用HTTP/2

我把这个放在最后,但它是我见过的最强大的改变。当在互联网上搜索时,我在HTTP/2和SNAT端口之间完全没有找到信息。这也是我选择在这里键

英文:

Although this answer seems to about.NET only, in fact similar approach can apply to other languages/runtimes.

How to understand it better

Yes you need to understand it before change anything, don't blindly tweak things without fully understand what you are dealing with!

The best troubleshooting guide starts from Microsoft document:
https://learn.microsoft.com/en-us/azure/app-service/troubleshoot-intermittent-outbound-connection-errors
Within it, there is a link to this documentation which I find is the best to describe what SNAT is: https://4lowtherabbit.github.io/blogs/2019/10/SNAT/. Although written in 2019, it doesn't seem dated much, it mentioned old and new algorithm on port allocation. Where old is 160 and new is preallocated 128.

The troubleshooting guide did say you get 128 preallocated port, and after that you may run into issue, which is an alternative way saying only 128 is guaranteed.

A high level summary

Basically, App Service Plan use an Azure Load Blancer for external network request. And the algorithm sharing the SNAT port is the same, detail is listed here: https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-outbound-connections
It says when you have a VM pool size of 201-400, load balancer SNAT port to 128.

This probably means that Azure is trying to utilize their App Service stamp to be at top nearly 400 plans that shares the same load balancer. So whenever you got unlucky, that azure choose to create an App Service Plan in a busy 'box', you may not be able to use more than 128 SNAT ports, however if you got lucky with an instance running on a less busy 'box', you might see your app can easily chew more than 128, for us it was 300-ish with no issue other than troubleshooting tool saying SNAT port exhaustion detected with very limited failure.

A short term workaround if you are trying to put off fire

So the short term workaround could be keep scaling out and in of your app service until it destroy the unlucky instance you got, then you would be out of the water temporarily, or you might get even more unlucky instances and everything got capped at 128.

Long term code level fix without NAT gateway

Firstly, fixing things in code level may not always be possible, but you can always try analysis the application behaviour to see how it is used. And you do not necessarily need to check this in the cloud, TCP viewer(https://learn.microsoft.com/en-us/sysinternals/downloads/tcpview) is a good tool to help you understand sockets usage locally, although not exactly the same thing, if you could manage to reduce sockets usage, you will in turn reduce SNAT port.

Trick one: Tweaking connection pool

HttpClient actually matters less, because all connection pooling is done inside the HttpMessageHandler/HttpClientHandler. Further to this, when come to .Net Framework and .Net Core, how Http connection pool is being handled is different. A very good article explains this in detail can be found here: https://www.stevejgordon.co.uk/httpclient-connection-pooling-in-dotnet-core

But either way you have some options to control the connection pooling with changing settings such as PooledConnectionLifetime, PooledConnectionIdleTimeout and MaxConnectionsPerServer. Be very careful changing MaxConnectionsPerServer as it could make a bit different on choking your code performance. Worth mentioning the default value of this in .Net Framework is 2, and in Core it is unlimited.

I personally found PooledConnectionIdleTimeout the most useful and the least risky to change, the idea of this is try to reduce reestablish http connection when you can, but whatever the default value is(again mind the different between Full framework and Core), it was not chosen with 128 SNAT port in mind. It is chosen based on when you run your code in an OS that you have 65536 sockets. So when you have that available pool shrunk to below 2% (128 / 65536), it doesn't seem to be a terrible idea to change this setting to a smaller value, 2% of the default could be too excessive, check your traffic with a good observability tool, look at outgoing traffic and figure out a value. (In our case, I picked 5 seconds).

I have only checked the open source dotnet runtime, the code managing the connection pool has sealed class using domain as part of the connection pool key. So even if your downstream services in fact behind the same load balancer with same IP, there is no way to change the code to make them sharing the same pool. I have not check full framework implementation but would imagine it would be very similar.

Trick two: Using HTTP/2 by default if downstream support it

I saved this at last, but it is the most powerful change I have ever seen. When googling around, I found no information whatsoever on internet between HTTP/2 and SNAT port. Which is the main reason I chose to type all those out to answer my own question here, in hope to help folks wondering the same thing in the future.

Use HTTP/2! Seriously, use it and check the result!

It took me sometime to connect the dots because no one on internet mention this clearly. But if you look closer, skip all the pep talk about how great HTTP/2 is on binary transportation, header compression and server push all that, not saying those are not great, the biggest advantage with our context is that with HTTP/2, you can send concurrent request to the server with only one TCP connection, there is no longer a need to create new connection when something else is using the port, as long as it is the same server, your request can be chucked in there to reuse the sockets/SNAT port.

For Full Framework, you can do this but I have not tried, some doc from MS: https://learn.microsoft.com/en-us/dotnet/api/system.net.http.httpclient?view=netframework-4.8

For .Net Core, it is very easy as setting a property to HttpClient or change specific messages, an example article here: https://www.siakabaro.com/use-http-2-with-httpclient-in-net-6-0/

To our specific case, our SNAT port usage dropped from 300 to around 30, this is on top of me tweaking idle timeout to 5 seconds, because we have 60-ish domains pointing to that load balancer, but only 30-ish has heavy traffic, so as a result it barely uses more than 30 SNAT port. I also did extreme case simulation, actually the busier your traffic the greater the improvement you will observe, as no matter how horribly it was in HTTP/1 (I made it more than 10k SNAT port request that all failed), it can easily shrink down to 1 port per domain.

Something worth mentioning, but I have not tested out, we have multiple (5) App Service running in the single service plan, so an instance is hosting many App Services. When you consider 60-ish domains in each 50, you got 300-ish SNAT port. I suspect once opted in to use HTTP/2, TCP connection might be then shared between those App Services in that same instance, hence it could really come down to 30. But I have not validated this, so don't take my word for it.

Lastly the great thing after this is, if SNAT port if your bottleneck around how many App Service can you put into a service plan, after HTTP/2 code change you are likely able to put a few more in it, which is a wonderful cost saving trick!

I hope this would help someone, if you gone through all these you should now realize those approaches does not just apply to .Net world, anything you use if support the same idea allowing you change pool behavior and ope in HTTP/2 will benefit, and thanks for reading to the end!

huangapple
  • 本文由 发表于 2023年7月17日 19:53:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/76704191.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定