英文:
Mongo Connection Count creeping up one per 10 second with mgo driver
问题
我们使用这个链接来监控我们的mongoDB连接计数:
http://godoc.org/labix.org/v2/mgo#GetStats
然而,我们一直面临着一个奇怪的连接泄漏问题,每10秒连接计数就会增加1个。这个问题与是否有请求无关。我可以在本地启动一个服务器,什么都不做,连接计数仍然会增加。连接计数最终会增加到几千个,然后导致应用程序/数据库崩溃,我们不得不重新启动应用程序。
这可能不足以让你进行调试。有人有任何想法吗?你过去处理过的连接泄漏问题。你是如何调试的?我可以用什么方法来调试这个问题。
我们已经尝试了一些方法,扫描了我们的代码库,查找可能打开连接的代码,并在那里放置计数器/调试语句,到目前为止,我们没有发现泄漏。就像是某个库中存在泄漏一样。
这是一个我们一直在开发的分支中的一个bug,已经有几百个提交了。我们已经对比了这个分支和主分支,但找不到为什么这个分支会有连接泄漏的原因。
这是我参考的数据集的一个例子:
Clusters: 1
MasterConns: 9936 <-- 每秒增加1个
SlaveConns: -7359 <-- 为什么是负数?
SentOps: 42091780
ReceivedOps: 38684525
ReceivedDocs: 39466143
SocketsAlive: 78 <-- socket计数和主连接计数之间有什么区别?
SocketsInUse: 1231
SocketRefs: 1231
MasterConns是每10秒增加1个的数字。我不太确定其他数字的含义。
英文:
We monitor our mongoDB connection count using this:
http://godoc.org/labix.org/v2/mgo#GetStats
However, we have been facing a strange connection leak issue where the connectionCount creeps up consistently by 1 more open connection per 10 seconds. (That's regardless whether there is any requests). I can spin up a server in localhost, leave it there, do nothing, the conectionCount will still creep up. Connection count eventually creeps up to a few thousand and it kills the app/db then and we have to restart the app.
This might not be enough information for you to debug. Does anyone have any ideas, connection leaks that you have dealt with in the past. How did you debug it? What are some of the way that I can debug this.
We have tried a few things, we scanned our code base for any code that could open a connection and put counters/debugging statements there, and so far we have found no leak. It is almost like there is a leak in a library somewhere.
This is a bug in a branch that we have been working on and there have been a few hundred commits into it. We have done a diff between this and master and couldn't find why there is a connection leak in this branch.
As an example, there is the dataset that I am referencing:
Clusters: 1
MasterConns: 9936 <-- creeps up 1 per second
SlaveConns: -7359 <-- why is this negative?
SentOps: 42091780
ReceivedOps: 38684525
ReceivedDocs: 39466143
SocketsAlive: 78 <-- what is the difference between the socket count and the master conns count?
SocketsInUse: 1231
SocketRefs: 1231
MasterConns is the number that creeps up one per 10 second. I am not entirely sure what the other numbers can mean.
答案1
得分: 14
MasterConns
无法告诉您是否存在泄漏,因为它不会减少。该字段表示自上次统计重置以来建立的连接数,而不是当前正在使用的套接字数。后者由SocketsAlive
字段表示。
为了在这个问题上给您一些额外的放心,mgo套件中的每个测试都包含了确保测试完成后统计数据显示合理值的逻辑,以便潜在的泄漏不会被忽视。这就是引入这种统计收集系统的主要原因。
然后,您之所以看到这个数字每隔大约10秒增加一次,是因为内部活动会学习集群的状态。话虽如此,最近已经更改了这种行为,以便不建立新的连接,而是从连接池中选择现有的套接字,所以我认为您没有使用最新版本。
SlaveConns
为负数看起来像是一个错误。关于连接统计的收集,有一个小的边缘情况,因为在我们与服务器通信之前,我们无法确定给定的服务器是主服务器还是从服务器,所以可能存在一个未覆盖的路径。如果您在升级后仍然看到这种行为,请报告此问题,我将很乐意查看。
SocketsInUse
是仍然被一个或多个会话引用的套接字数,无论它们是否存活(连接已建立)或不存活。SocketsAlive
再次表示实际的活动TCP连接数。两者之间的差异表示一些会话没有关闭。如果这些会话仍然被应用程序在内存中保持,并最终将被关闭,那么这可能是可以接受的;如果应用程序错过了session.Close
操作,那么这可能是一个泄漏。
英文:
MasterConns
cannot tell you whether there's a leak or not, because it does not decrease. The field indicates the number of connections made since the last statistics reset, not the number of sockets that are currently in use. The latter is indicated by the SocketsAlive
field.
To give you some additional relief on the subject, every single test in the mgo suite is wrapped around logic that ensures that statistics show sane values after the test finishes, so that potential leaks don't go unnoticed. That's the main reason why such statistics collection system was introduced.
Then, the reason why you see this number increasing every 10 seconds or so is due to the internal activity that happens to learn the status of the cluster. That said, this behavior was recently changed so that it doesn't establish new connections and instead picks existent sockets from the pool, so I believe you're not using the latest release.
Having SlaveConns
negative looks like a bug. There's a small edge case about statistics collection for connections made, because we cannot tell whether a given server is a master or a slave before we've talked to it, so there might be an uncovered path. If you still see that behavior after you upgrade, please report the issue and I'll be happy to look at it.
SocketsInUse
is the number of sockets that are still being referenced by one or more sessions, whether they are alive (the connection is established) or not. SocketsAlive
is, again, the real number of live TCP connections. The delta between the two indicates that a number of sessions were not closed. This may be okay, if they are still being held in memory by the application and will eventually be closed, or it may be a leak if a session.Close
operation was missed by the application.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论