英文:
Serilog.Sinks.Async is spawning thousands of threads
问题
我们正在使用 Serilog 文件接收器和 Elastic Search 接收器,两者都使用了一个 Serilog 异步接收器,以确保一切都在后台线程中处理。
最近几天,我们遇到了服务器范围的问题,所有应用程序都变慢了。系统工程师采取了一个完整的用户转储(使用调试诊断工具),在检查了该转储后,看起来我们在 Serilog.Sinks.Async.BackgroundWorkerSink 中有一个问题。
这是 clrstack 中的一个约 3000 个线程之一:
OS 线程 Id: 0x30b7c
Child SP IP 呼叫站点
00000025AAA8F048 00007ffd91720bb4 [HelperMethodFrame_1OBJ: 00000025aaa8f048] System.Threading.Monitor.ObjWait(Int32, System.Object)
00000025AAA8F170 00007ffce4454638 System.Threading.SemaphoreSlim.WaitUntilCountOrTimeout(Int32, UInt32, System.Threading.CancellationToken) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/SemaphoreSlim.cs @ 462]
00000025AAA8F1C0 00007ffce43f1b4a System.Threading.SemaphoreSlim.Wait(Int32, System.Threading.CancellationToken) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/SemaphoreSlim.cs @ 365]
00000025AAA8F270 00007ffce44b1e27 System.Collections.Concurrent.BlockingCollection`1[[System.__Canon, System.Private.CoreLib]].TryTakeWithNoTimeValidation(System.__Canon ByRef, Int32, System.Threading.CancellationToken, System.Threading.CancellationTokenSource)
00000025AAA8F2F0 00007ffce44b1ce4 System.Collections.Concurrent.BlockingCollection`1+d__68[[System.__Canon, System.Private.CoreLib]].MoveNext()
00000025AAA8F340 00007ffce2ede22d Serilog.Sinks.Async.BackgroundWorkerSink.Pump()
00000025AAA8F390 00007ffce43d6617 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ExecutionContext.cs @ 183]
00000025AAA8F400 00007ffce44124fe System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs @ 2333]
00000025AAA8F700 00007ffd421eaed3 [DebuggerU2MCatchHandlerFrame: 00000025aaa8f700]
看起来成千上万的线程都在 SemaphoreSlim.Wait 上等待一个可用的线程。
我们是否有任何配置错误?或者我们可以做些什么来进行优化?
我们相信问题很可能是与文件接收器有关的...
我们想知道的事情有:
- 为文件接收器设置 "shared:true" 是否明智?
- 将 Elastic Search 接收器与异步接收器结合使用是否明智?
- Serilog 不是应该使用线程池吗?为什么我们有约 3000 个线程?
这是我们(简化后的)配置:
var logCfg =
new LoggerConfiguration()
.Enrich.WithProperty("machine", System.Environment.MachineName)
.WriteTo.Map(keyPropertyName: "$filename", defaultKey: "fallback",
configure: (fileName, wt) =>
wt.Async(c =>
c.File(formatter: formatter
, path: logOptions.AuditPath
, shared: true
, fileSizeLimitBytes: fileSizeLimitBytes ?? 41943040
, rollingInterval: RollingInterval.Day
, rollOnFileSizeLimit: true
)
)
)
.WriteTo.Async(c =>
c.Elasticsearch(new Serilog.Sinks.Elasticsearch.ElasticsearchSinkOptions(new Uri(elasticUrl))
{
IndexFormat = "my-audit-" + DateTime.Now.Year,
ModifyConnectionSettings = x => x.MyAuthentication(elasticCreds[0], elasticCreds[1])
}
)
);
如果有关系的话,这是我们的 NuGet 版本:
- Serilog.Sinks.Async 1.5.0.0
- Serilog.sinks.Elasticsearch 8.4.1
- Serilog.Sinks.File 5.0.0
英文:
We are using Serilog file sink and Elastic search sink that both use a Serilog async sink to make sure everything is handled in a background thread.
The last days we had server-wide issues and all applications slowed down.
There was a full user dump taken by the system engineers (with debug diagnostic tool) and upon inspection of that dump, it looks like we have an issue with the Serilog.Sinks.Async.BackgroundWorkerSink
This is one of the ~3000 threads in the clrstack :
OS Thread Id: 0x30b7c
Child SP IP Call Site
00000025AAA8F048 00007ffd91720bb4 [HelperMethodFrame_1OBJ: 00000025aaa8f048] System.Threading.Monitor.ObjWait(Int32, System.Object)
00000025AAA8F170 00007ffce4454638 System.Threading.SemaphoreSlim.WaitUntilCountOrTimeout(Int32, UInt32, System.Threading.CancellationToken) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/SemaphoreSlim.cs @ 462]
00000025AAA8F1C0 00007ffce43f1b4a System.Threading.SemaphoreSlim.Wait(Int32, System.Threading.CancellationToken) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/SemaphoreSlim.cs @ 365]
00000025AAA8F270 00007ffce44b1e27 System.Collections.Concurrent.BlockingCollection`1[[System.__Canon, System.Private.CoreLib]].TryTakeWithNoTimeValidation(System.__Canon ByRef, Int32, System.Threading.CancellationToken, System.Threading.CancellationTokenSource)
00000025AAA8F2F0 00007ffce44b1ce4 System.Collections.Concurrent.BlockingCollection`1+d__68[[System.__Canon, System.Private.CoreLib]].MoveNext()
00000025AAA8F340 00007ffce2ede22d Serilog.Sinks.Async.BackgroundWorkerSink.Pump()
00000025AAA8F390 00007ffce43d6617 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ExecutionContext.cs @ 183]
00000025AAA8F400 00007ffce44124fe System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs @ 2333]
00000025AAA8F700 00007ffd421eaed3 [DebuggerU2MCatchHandlerFrame: 00000025aaa8f700]
It looks like thousands of threads have a SemaphoreSlim.Wait, just waiting on a thread to become available.
Is there anything we misconfigured? Or that we can do to optimise ?
We believe the issue is most likely something with the file sink...
Things we are wondering about are :
- Is it wise to have "shared:true" for the file sink?
- Is it wise to combine the elastic search sink with the async sink?
- Isn't Serilog supposed to use the Treadpool? Why then do we have ~3000 threads?
This is our (simplified) configuration :
var logCfg =
new LoggerConfiguration()
.Enrich.WithProperty("machine", System.Environment.MachineName)
.WriteTo.Map(keyPropertyName: "$filename", defaultKey: "fallback",
configure: (fileName, wt) =>
wt.Async(c =>
c.File(formatter: formatter
, path: logOptions.AuditPath
, shared: true
, fileSizeLimitBytes: fileSizeLimitBytes ?? 41943040
, rollingInterval: RollingInterval.Day
, rollOnFileSizeLimit: true
)
)
)
.WriteTo.Async(c =>
c.Elasticsearch(new Serilog.Sinks.Elasticsearch.ElasticsearchSinkOptions(new Uri(elasticUrl))
{
IndexFormat = "my-audit-" + DateTime.Now.Year,
ModifyConnectionSettings = x => x.MyAuthentication(elasticCreds[0], elasticCreds[1])
}
)
);
}
If it matters, these are our nuget versions :
- Serilog.Sinks.Async 1.5.0.0
- Serilog.sinks.Elasticsearch 8.4.1
- Serilog.Sinks.File 5.0.0
答案1
得分: 2
最重要的部分是您希望在Map周围使用Async Sink,而不是您所展示的相反方式。
当前的Async Sink实现使用GetConsumingEnumerable
,它会消耗一个线程。这对性能来说效果很好,但当然不是最优的(我认为有一个关于将其切换到Channels的GH问题,这将减轻负担但可能会降低性能)。不管怎样,您只需要其中一个。
这方面的高层次解释是:
- 捕获值等操作发生在实际执行线程上;这会生成一个
LogEvent
结构。这不能/不需要并行化。 - 然后,Sink连接定义了来自所有日志记录线程的LogEvents的路由并将它们分发出去。如果您使用Async,它会将它们放入队列供消费者随时获取,而不是立即执行。(去阅读一下,它非常巧妙且简短,会使其余部分显而易见)
因此,像您展示的那样有两个Async sinks会增加成本 - 您需要一个外部Async,然后再执行Elastic,然后是File(通常像Elastic这样的东西会进行内部缓冲,因此会迅速处理数据)。
英文:
Most important bit is that you want the Async Sink around the Map, not the inverse as you have shown.
The current impl of the Async Sink uses GetConsumingEnumerable
which consumes a thread. That works great for perf but is not optimal of course (think there is a GH Issue about switching it to Channels, which would alleviate that but might make its perf worse). Regardless of that, you only want one of them.
The high level on this is:
- Capturing values etc happens on the actual execution thread; that produces a
LogEvent
structure. That can't / doesn't need to be parallelised. - The sink wireup then defines routing of LogEvents from across all logging threads and dispatches them. If you use Async, it stuffs them into a queue for the consumer to grab as and when it can instead of doing it immediately. (Go read it, it's really neat and short and will render the rest obvious)
So having two Async sinks like you have is costing more - you want an outer Async, and the do the Elastic followed by the File (normally something like Elastic will do internal buffering and hence will take the data quickly)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论