Mongo连接计数每10秒增加一次,使用mgo驱动程序。

huangapple go评论80阅读模式
英文:

Mongo Connection Count creeping up one per 10 second with mgo driver

问题

我们使用这个链接来监控我们的mongoDB连接计数:

http://godoc.org/labix.org/v2/mgo#GetStats

然而,我们一直面临着一个奇怪的连接泄漏问题,每10秒连接计数就会增加1个。这个问题与是否有请求无关。我可以在本地启动一个服务器,什么都不做,连接计数仍然会增加。连接计数最终会增加到几千个,然后导致应用程序/数据库崩溃,我们不得不重新启动应用程序。

这可能不足以让你进行调试。有人有任何想法吗?你过去处理过的连接泄漏问题。你是如何调试的?我可以用什么方法来调试这个问题。

我们已经尝试了一些方法,扫描了我们的代码库,查找可能打开连接的代码,并在那里放置计数器/调试语句,到目前为止,我们没有发现泄漏。就像是某个库中存在泄漏一样。

这是一个我们一直在开发的分支中的一个bug,已经有几百个提交了。我们已经对比了这个分支和主分支,但找不到为什么这个分支会有连接泄漏的原因。

这是我参考的数据集的一个例子:

Clusters:      1   
MasterConns:   9936      <-- 每秒增加1个
SlaveConns:    -7359     <-- 为什么是负数?
SentOps:       42091780   
ReceivedOps:   38684525   
ReceivedDocs:  39466143   
SocketsAlive:  78        <-- socket计数和主连接计数之间有什么区别?
SocketsInUse:  1231   
SocketRefs:    1231

MasterConns是每10秒增加1个的数字。我不太确定其他数字的含义。

英文:

We monitor our mongoDB connection count using this:

http://godoc.org/labix.org/v2/mgo#GetStats

However, we have been facing a strange connection leak issue where the connectionCount creeps up consistently by 1 more open connection per 10 seconds. (That's regardless whether there is any requests). I can spin up a server in localhost, leave it there, do nothing, the conectionCount will still creep up. Connection count eventually creeps up to a few thousand and it kills the app/db then and we have to restart the app.

This might not be enough information for you to debug. Does anyone have any ideas, connection leaks that you have dealt with in the past. How did you debug it? What are some of the way that I can debug this.

We have tried a few things, we scanned our code base for any code that could open a connection and put counters/debugging statements there, and so far we have found no leak. It is almost like there is a leak in a library somewhere.

This is a bug in a branch that we have been working on and there have been a few hundred commits into it. We have done a diff between this and master and couldn't find why there is a connection leak in this branch.

As an example, there is the dataset that I am referencing:

Clusters:      1   
MasterConns:   9936      <-- creeps up 1 per second
SlaveConns:    -7359     <-- why is this negative?
SentOps:       42091780   
ReceivedOps:   38684525   
ReceivedDocs:  39466143   
SocketsAlive:  78        <-- what is the difference between the socket count and the master conns count?
SocketsInUse:  1231   
SocketRefs:    1231

MasterConns is the number that creeps up one per 10 second. I am not entirely sure what the other numbers can mean.

答案1

得分: 14

MasterConns无法告诉您是否存在泄漏,因为它不会减少。该字段表示自上次统计重置以来建立的连接数,而不是当前正在使用的套接字数。后者由SocketsAlive字段表示。

为了在这个问题上给您一些额外的放心,mgo套件中的每个测试都包含了确保测试完成后统计数据显示合理值的逻辑,以便潜在的泄漏不会被忽视。这就是引入这种统计收集系统的主要原因。

然后,您之所以看到这个数字每隔大约10秒增加一次,是因为内部活动会学习集群的状态。话虽如此,最近已经更改了这种行为,以便不建立新的连接,而是从连接池中选择现有的套接字,所以我认为您没有使用最新版本。

SlaveConns为负数看起来像是一个错误。关于连接统计的收集,有一个小的边缘情况,因为在我们与服务器通信之前,我们无法确定给定的服务器是主服务器还是从服务器,所以可能存在一个未覆盖的路径。如果您在升级后仍然看到这种行为,请报告此问题,我将很乐意查看。

SocketsInUse是仍然被一个或多个会话引用的套接字数,无论它们是否存活(连接已建立)或不存活。SocketsAlive再次表示实际的活动TCP连接数。两者之间的差异表示一些会话没有关闭。如果这些会话仍然被应用程序在内存中保持,并最终将被关闭,那么这可能是可以接受的;如果应用程序错过了session.Close操作,那么这可能是一个泄漏。

英文:

MasterConns cannot tell you whether there's a leak or not, because it does not decrease. The field indicates the number of connections made since the last statistics reset, not the number of sockets that are currently in use. The latter is indicated by the SocketsAlive field.

To give you some additional relief on the subject, every single test in the mgo suite is wrapped around logic that ensures that statistics show sane values after the test finishes, so that potential leaks don't go unnoticed. That's the main reason why such statistics collection system was introduced.

Then, the reason why you see this number increasing every 10 seconds or so is due to the internal activity that happens to learn the status of the cluster. That said, this behavior was recently changed so that it doesn't establish new connections and instead picks existent sockets from the pool, so I believe you're not using the latest release.

Having SlaveConns negative looks like a bug. There's a small edge case about statistics collection for connections made, because we cannot tell whether a given server is a master or a slave before we've talked to it, so there might be an uncovered path. If you still see that behavior after you upgrade, please report the issue and I'll be happy to look at it.

SocketsInUse is the number of sockets that are still being referenced by one or more sessions, whether they are alive (the connection is established) or not. SocketsAlive is, again, the real number of live TCP connections. The delta between the two indicates that a number of sessions were not closed. This may be okay, if they are still being held in memory by the application and will eventually be closed, or it may be a leak if a session.Close operation was missed by the application.

huangapple
  • 本文由 发表于 2013年10月19日 01:35:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/19455787.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定