英文:
Fluentbit error "cannot adjust chunk size" on GKE
问题
我的服务正在运行在GKE上,我正在使用EFK堆栈进行日志记录。每个节点都有一个通过DaemonSet创建的fluentbit pod,还有一个fluentd聚合器pod。这个结构一开始运行得很好,但现在fluentbit pods出现了错误。它持续出现错误并重新启动。
这个错误的原因是什么,我应该如何解决它?
来自fluentbit
的日志:
[2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size '/fluent-bit/buffer//emitter.3/1-1689072144.132487045.flb' to 4096 bytes
[lib/chunkio/src/cio_file_unix.c:528 errno=28] No space left on device
[lib/chunkio/src/cio_file.c:1116 errno=28] No space left on device
[2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size '/fluent-bit/buffer//emitter.3/1-1689072143.865639031.flb' to 4096 bytes
[lib/chunkio/src/cio_file_unix.c:528 errno=28] No space left on device
[lib/chunkio/src/cio_file.c:1116 errno=28] No space left on device
[2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size '/fluent-bit/buffer//emitter.3/1-1689072143.703709663.flb' to 4096 bytes
[2023/07/18 08:08:22] [ info] [storage] ver=1.3.0, type=memory+filesystem, sync=full, checksum=off, max_chunks_up=128
[2023/07/18 08:08:22] [ info] [storage] backlog input plugin: storage_backlog.1
[2023/07/18 08:08:22] [ info] [cmetrics] version=0.5.7
[2023/07/18 08:08:22] [ info] [ctraces ] version=0.2.5
[2023/07/18 08:08:22] [ info] [input:tail:tail.0] initializing
[2023/07/18 08:08:22] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
[2023/07/18 08:08:22] [error] [sqldb] error=disk I/O error
[2023/07/18/08/08/22] [error] [input:tail:tail.0] db: could not create 'in_tail_files' table
[2023/07/18 08:08:22] [error] [input:tail:tail.0] could not open/create database
[2023/07/18 08:08:22] [error] failed to initialize input tail.0
[2023/07/18 08:08:22] [error] [engine] input initialization failed
[2023/07/18 08:08:22] [error] [lib] backend failed
fluent-bit
的事件:
> kubectl describe po fluent-bit-xmkj6
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulling 51m (x1718 over 6d3h) kubelet Pulling image "cr.fluentbit.io/fluent/fluent-bit:2.0.5"
Warning BackOff 96s (x43323 over 6d3h) kubelet Back-off restarting failed container
fluent-bit.conf
:
[SERVICE]
Daemon Off
Flush 1
Log_Level info
storage.path /fluent-bit/buffer/
storage.sync full
storage.checksum off
Parsers_File parsers.conf
Parsers_File custom_parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
Health_Check On
[INPUT]
Name tail
Path /var/log/containers/*.log
db /fluent-bit/buffer/logs.db
multiline.parser docker, cri
Tag kube.*
Skip_Long_Lines On
Skip_Empty_lines On
[FILTER]
Name kubernetes
Match kube.**
Kube_URL https://kubernetes.default.svc.cluster.local:443
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log On
Keep_Log Off
Annotations Off
K8S-Logging.Parser On
K8S-Logging.Exclude On
[FILTER]
Name rewrite_tag
Log_Level debug
Match kube.**
Rule $kubernetes['labels']['type'] ^(.*)$ dev.service.$kubernetes['namespace_name'].$kubernetes['labels']['type'].$kubernetes['container_name'] false
Emitter_Name re_emitted_type
Emitter_Storage.type filesystem
[FILTER]
Name rewrite_tag
Log_Level debug
Match kube.**
Rule $kubernetes['container_name'] ^(.*)$ dev.service.$kubernetes['namespace_name'].$kubernetes['container_name'] false
Emitter_Name re_emitted_no_type
Emitter_Storage.type filesystem
[OUTPUT]
Name forward
Match *
Retry_Limit False
Workers 1
Host 172.32.20.10
Port 30006
英文:
My services are running on GKE, and I am using EFK stack for logging. Each of node have a fluentbit pod by DaemonSet, and there is a fluentd aggregator pod. This structure had worked well at first, but fluentbit pods are making errors now. It continues making errors and restarting.
What is the reason of this error and how can I solve it?
Logs from fluentbit
:
[2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size '/fluent-bit/buffer//emitter.3/1-1689072144.132487045.flb' to 4096 bytes
[lib/chunkio/src/cio_file_unix.c:528 errno=28] No space left on device
[lib/chunkio/src/cio_file.c:1116 errno=28] No space left on device
[2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size '/fluent-bit/buffer//emitter.3/1-1689072143.865639031.flb' to 4096 bytes
[lib/chunkio/src/cio_file_unix.c:528 errno=28] No space left on device
[lib/chunkio/src/cio_file.c:1116 errno=28] No space left on device
[2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size '/fluent-bit/buffer//emitter.3/1-1689072143.703709663.flb' to 4096 bytes
[2023/07/18 08:08:22] [ info] [storage] ver=1.3.0, type=memory+filesystem, sync=full, checksum=off, max_chunks_up=128
[2023/07/18 08:08:22] [ info] [storage] backlog input plugin: storage_backlog.1
[2023/07/18 08:08:22] [ info] [cmetrics] version=0.5.7
[2023/07/18 08:08:22] [ info] [ctraces ] version=0.2.5
[2023/07/18 08:08:22] [ info] [input:tail:tail.0] initializing
[2023/07/18 08:08:22] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
[2023/07/18 08:08:22] [error] [sqldb] error=disk I/O error
[2023/07/18 08:08:22] [error] [input:tail:tail.0] db: could not create 'in_tail_files' table
[2023/07/18 08:08:22] [error] [input:tail:tail.0] could not open/create database
[2023/07/18 08:08:22] [error] failed initialize input tail.0
[2023/07/18 08:08:22] [error] [engine] input initialization failed
[2023/07/18 08:08:22] [error] [lib] backend failed
Events of fluentbit
:
> kubectl describe po fluent-bit-xmkj6
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulling 51m (x1718 over 6d3h) kubelet Pulling image "cr.fluentbit.io/fluent/fluent-bit:2.0.5"
Warning BackOff 96s (x43323 over 6d3h) kubelet Back-off restarting failed container
fluent-bit.conf
:
[SERVICE]
Daemon Off
Flush 1
Log_Level info
storage.path /fluent-bit/buffer/
storage.sync full
storage.checksum off
Parsers_File parsers.conf
Parsers_File custom_parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
Health_Check On
[INPUT]
Name tail
Path /var/log/containers/*.log
db /fluent-bit/buffer/logs.db
multiline.parser docker, cri
Tag kube.*
Skip_Long_Lines On
Skip_Empty_lines On
[FILTER]
Name kubernetes
Match kube.**
Kube_URL https://kubernetes.default.svc.cluster.local:443
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log On
Keep_Log Off
Annotations Off
K8S-Logging.Parser On
K8S-Logging.Exclude On
[FILTER]
Name rewrite_tag
Log_Level debug
Match kube.**
Rule $kubernetes['labels']['type'] ^(.*)$ dev.service.$kubernetes['namespace_name'].$kubernetes['labels']['type'].$kubernetes['container_name'] false
Emitter_Name re_emitted_type
Emitter_Storage.type filesystem
[FILTER]
Name rewrite_tag
Log_Level debug
Match kube.**
Rule $kubernetes['container_name'] ^(.*)$ dev.service.$kubernetes['namespace_name'].$kubernetes['container_name'] false
Emitter_Name re_emitted_no_type
Emitter_Storage.type filesystem
[OUTPUT]
Name forward
Match *
Retry_Limit False
Workers 1
Host 172.32.20.10
Port 30006
答案1
得分: 0
This error has two possible causes.
- 磁盘空间实际已用尽。
- inotify 资源已用尽。
磁盘空间实际已用尽:
您可以通过在节点上运行 df 命令来检查节点上是否剩余足够的磁盘空间。
检查磁盘使用情况
df -h
检查inode使用情况
df -ih
如果发现磁盘空间不足:
- 从节点中删除未使用的文件。
- 创建一个具有更大磁盘大小的节点池。
inotify 资源已用尽:
如果您的节点上剩余足够的磁盘空间,但仍然收到 "设备上没有剩余空间" 错误,那么很可能是 inotify 资源已用尽。
kubectl logs -f 使用 inotify 来监视文件的更改,它会消耗 inotify watches 的资源。
在 Linux 中,有关 inotify watches 数量的限制。您可以通过查看 fs.inotify.max_user_watches 内核参数来检查当前的限制:
$ sudo sysctl fs.inotify.max_user_watches
您可以使用以下一行命令来查看节点上每个进程消耗了多少 inotify watches:
echo -e "COUNT\tPID\tUSER\tCOMMAND" ; sudo find /proc/[0-9]*/fdinfo -type f 2>/dev/null | sudo xargs grep ^inotify 2>/dev/null | cut -d/ -f 3 | uniq -c | sort -nr | { while read -rs COUNT PID; do echo -en "$COUNT\t$PID\t" ; ps -p $PID -o user=,command=; done }
上述命令将查找消耗大量 inotify watches 的进程。
有几种减轻此问题的选项。
- 您可以更改应用程序以不消耗大量的 inotify watches。
- 或者您可以增加 fs.inotify.max_user_watches 内核参数的值。(例如:sudo sysctl fs.inotify.max_user_watches=24576) - 请注意,每个 inotify watch 都会消耗一些内存占用,因此应谨慎使用此解决方案。
您可以部署一个 DaemonSet 来提高集群节点上 inotify watches 的限制。从节点稳定性的角度来看,这应该是安全的。
command:
- /bin/sh
- -c
- |
while true; do
sysctl -w fs.inotify.max_user_watches=524288
sleep 10
done
imagePullPolicy: IfNotPresent
英文:
This error has two possible causes.
- The disk space is actually exhausted.
- inotify resources are exhausted.
The disk space is actually exhausted:
You can check whether enough disk space is left on the node by running df command on the node.
# Check disk usage
df -h
# Check inode usage
df -ih
If you find disk space is pressured:
- Remove unused files from the node.
- Create a node pool with a larger disk size.
inotify resources are exhausted:
If you have enough disk space left on your node but still getting the no space left on device error, it's highly likely that inotify resources are exhausted.
kubectl logs -f
uses inotify to monitor the changes on the file and it consumes the resource of inotify watches.
In Linux, there is a limitation of the number of inotify watches. You can check the current limitation by looking at fs.inotify.max_user_watches kernel parameter using the following:
$ sudo sysctl fs.inotify.max_user_watches
You can check how many inotify watches are consumed by each process on the node by using the following one liner
echo -e "COUNT\tPID\tUSER\tCOMMAND" ; sudo find /proc/[0-9]*/fdinfo -type f 2>/dev/null | sudo xargs grep ^inotify 2>/dev/null | cut -d/ -f 3 | uniq -c | sort -nr | { while read -rs COUNT PID; do echo -en "$COUNT\t$PID\t" ; ps -p $PID -o user=,command=; done }
The above command will find large consumers of inotify watches.
There are a few options to mitigate the issue.
- You can either change the application to not consume the large inotify watches.
- Or you can increase the fs.inotify.max_user_watches kernel parameter. (e.g. sudo sysctl fs.inotify.max_user_watches=24576) - Note that each inotify watch consumes some memory footprint, so this solution should be used with caution.
You can deploy a DaemonSet to raise the limit for inotify watches on their cluster's nodes. This should be safe from a node stability perspective.
command:
- /bin/sh
- -c
- |
while true; do
sysctl -w fs.inotify.max_user_watches=524288
sleep 10
done
imagePullPolicy: IfNotPresent
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论