Put AWS Lambda function into a VPC and then "IOException: Connection reset by peer" started happening, but only occasionally

huangapple go评论81阅读模式
英文:

Put AWS Lambda function into a VPC and then "IOException: Connection reset by peer" started happening, but only occasionally

问题

我有一个Java AWS Lambda函数,通过API Gateway充当API。在过去的几个月里,它一直运行24/7,以前从未发生过这种特定错误。

今天,我进行了一个更新,添加了Elasticache,这要求我将Lambda放入与Elasticache相同的VPC中。在此之前,Lambda未分配给任何VPC,只是正常运行。

在进行了大量的配置调整之后,似乎我终于让它工作了 - Lambda JAR现在能够连接到Elasticache,同时仍然能够连接到其他需要的东西。

但是,部署后的几分钟后,我开始收到来自Algorithmia调用的错误:

java.util.concurrent.ExecutionException: java.io.IOException: Connection reset by peer
at org.apache.http.concurrent.BasicFuture.getResult(BasicFuture.java:71)
at org.apache.http.concurrent.BasicFuture.get(BasicFuture.java:102)
at com.algorithmia.algo.FutureAlgoResponse.get(FutureAlgoResponse.java:41)
at <place that we invoke it>

错误发生的地方是非常简单的调用代码:

FutureAlgoResponse futureAlgoResponse = algo.pipeAsync(<stuff>);
AlgoResponse result = futureAlgoResponse.get(3L, TimeUnit.SECONDS);

更重要的是,它已经在生产环境中运行了将近一年,从未发生过这种错误。

所以我想这一定与VPC有关!但它大部分时间都能正常工作。我们每隔几秒钟运行一次该代码,它只在每隔几分钟失败一次。当它失败时,通常会连续失败1-3次请求。

我们的Lambda设置为15秒超时,失败的请求在约1秒后响应,而且要再次强调,直到今天我们将Lambda移到VPC中之前,从未看到过这个错误。

Lambda的VPC配置感觉相当混乱和复杂,所以我肯定在某个地方搞砸了。但是它只在每隔几分钟的时间内偶尔发生一次,这让我很难用我有限的AWS知识来调试。我希望有人能分享一些可能的原因!

以下是我进行设置的方式:

  • 创建一个新的VPC
  • 在VPC中创建2个子网(以及相应的路由表),一个公共子网和一个私有子网
  • 为VPC创建一个Internet网关,为公共子网创建一个NAT网关。
  • 为NAT网关分配一个弹性IP。
  • 对于安全组启用所有传入和传出(可能不需要传入,但我们将返回并修复)
  • 在该VPC中启动Elasticache
  • 将Lambda分配给该VPC - 具体来说是私有子网+前面提到的安全组

老实说,我对如何进一步调查这个问题一无所知,所以我真的希望有人知道“哦,是的,连接在VPC中可能会超时,因为_____”。或者,如果能提供如何更好地调试这个问题的任何提示,将不胜感激。

编辑:一些进一步的搜索表明,这可能与NAT设置有关?我基本上只是默认地“创建NAT网关”并将其放在了私有子网上。

英文:

I have a Java AWS Lambda function serving as an API via API Gateway. For the past few months, it's been running 24/7 and hasn't had this particular error before.

Today, I did an update to add Elasticache, which required me to put the Lambda into the same VPC as the Elasticache. Before this, the Lambda was not assigned to any VPC, just running as normal.

After lots of config adjustments, it seemed like I finally got it working - the Lambda JAR is now able to connect to Elasticache while still having connectivity to the other things it needs.

But, a few minutes after deployment, I started getting this error from an Algorithmia call:

java.util.concurrent.ExecutionException: java.io.IOException: Connection reset by peer
at org.apache.http.concurrent.BasicFuture.getResult(BasicFuture.java:71)
at org.apache.http.concurrent.BasicFuture.get(BasicFuture.java:102)
at com.algorithmia.algo.FutureAlgoResponse.get(FutureAlgoResponse.java:41)
at <place that we invoke it>

The invoking code where the error occurs is very straightforward:

        FutureAlgoResponse futureAlgoResponse = algo.pipeAsync(<stuff>);
        AlgoResponse result = futureAlgoResponse.get(3L, TimeUnit.SECONDS);

And more importantly, it has been in production for nearly a year without ever having this error.

So I guess it must have something to do with the VPC! But, it works most of the time. We're running that code every few seconds, and it only fails every few minutes. When it fails, it usually fails for 1-3 requests in a row.

Our Lambda is set to 15s timeout and the requests that fail are responding after ~1s, and to reiterate, we've never seen this error until we moved the Lambda into a VPC today.

The Lambda VPC configuration felt fairly messy and involved, so I'm sure I messed up something somewhere. But the fact that it only happens a few times every few minutes makes it hard for me to debug with my limited AWS knowledge. I'm hoping someone can share some possible causes!

Here is how I did the setup:

  • Create a new VPC
  • Create 2 subnets (and corresponding route tables) in the VPC, one public and one private
  • Create an internet gateway for the VPC and a NAT gateway for the public subnet.
  • Assign an elastic IP to the NAT gateway.
  • Enable all incoming and outgoing for the security group (incoming might not be needed but we'll go back and fix that)
  • Spin up an Elasticache in that VPC
  • Assign the Lambda to that VPC - specifically the private subnet + aforementioned security group

I honestly haven't the slightest clue how to investigate this further, so I'm really hoping someone just knows "oh yeah connections can time out in a VPC because _____". Alternatively, would appreciate any tips on how to debug this better.

Edit: Some more searching suggests it may have to do with the NAT setup? I basically just did a default "Create NAT gateway" and threw it onto the private subnet.

答案1

得分: 2

Amazon支持团队给出了诊断和解决方案!

简而言之,是的,超时是问题所在。建议的解决方案是实施TCP保活,以确保不会达到350秒的空闲超时(或者增加流量,但这对我们并不起作用)。

实际上,我们最终所做的就是不再使用Elasticache。那是我们需要将Lambda放入VPC的唯一原因,经过思考,我们决定在我们的流量达到Elasticache的好处真正对我们有意义之前还需要一段时间(与简单的托管在EC2上的Redis实例相比)。因此,现在我们的缓存只是在EC2上运行的普通Redis实例。

以下是完整的回复内容:

“<首先逐步讲解了我的设置的每个步骤以及它们看起来是正确的>... 然而,在过去的两天里,我确实看到了许多NAT网关空闲超时,你怀疑这可能是问题的原因。请参考下面的NAT网关指标。

话虽如此,IdleTimeoutCount指标计算了从活动状态转换为空闲状态的连接数量。如果活动连接没有正常关闭并且在过去的350秒内没有任何活动,则活动连接将转换为空闲状态。值大于零表示有连接已被移至空闲状态。如果IdleTimeoutCount的值增加,可能表明NAT网关后面的客户端正在重用陈旧的连接。

如故障排除文档中所述,为了防止连接被断开,您可以在连接上启动更多的流量。或者,如果可能的话,您还可以在实例上启用小于350秒的TCP保活。以固定间隔发送保活探测将确保在NAT网关和远程端服务器之间有一些流量。保活数据包将重置350秒空闲超时计数器,使连接根据应用程序的需要保持活动状态。

回答您的问题:“这就是这里发生的情况吗?”

答案:在验证了从VPC角度来看,Lambda函数的一切都井然无序(安全组、NACL、路由表)之后,NAT网关的空闲超时绝对是可能的原因。这也得到了上面提供的IdleTimeoutCount指标的确认,显示连接因不活动而超时。”

英文:

Amazon support comes through with a diagnosis and solution!

tl;dr Yes, timeouts were the issue. Suggested fix is to implement a TCP keep-alive to make the 350-second idle timeout isn't reached (or just have more traffic, which doesn't really work for us).

What we actually did in the end is just move off of Elasticache. That was the only reason we needed to put our Lambda in a VPC, and after thinking about it, we decided it's going to be a while before our traffic reaches levels where Elasticache's benefits are really tangible to us (vs. a simple EC2-hosted Redis instance). So now our cache is just a regular Redis instance running on EC2.

Here's the full response:

"<first talking through each step of my setup and how those appear to be correct>... However, for the past two days, I do see a number of NAT gateway idle timeouts, which you suspect could be the issue. Please refer to the NAT gateway metrics below.

With this said, the IdleTimeoutCount metric counts the number of connections that transitioned from the active state to the idle state. An active connection transitions to idle if it was not closed gracefully and there was no activity for the last 350 seconds. A value greater than zero indicates that there are connections that have been moved to an idle state. If the value for IdleTimeoutCount increases, it may indicate that clients behind the NAT gateway are re-using stale connections.

As mentioned in the troubleshooting documentation, to prevent the connection from being dropped, you can initiate more traffic over the connection. Alternatively, you can also enable TCP keepalive on the instance with a value less than 350 seconds, if possible. Sending keepalive probes at a fixed interval will ensure there is some traffic going through the connection between the NAT gateway and the remote end server. The keepalive packets will reset the 350 seconds idle timeout counters, causing the connection to stay alive for as long as needed by the application.

To answer your question: “Is this what's going on here?”

Answer: After verifying that everything from a VPC perspective is in order for the Lambda functions (SG, NACLs, route tables), the NAT gateway idle timeouts are definite possibility here. This is also confirmed by the IdleTimeoutCount metric provided above showing that connections are timing out due to inactivity."

huangapple
  • 本文由 发表于 2020年8月24日 09:52:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/63553768.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定