Fluentbit错误:“无法调整块大小”在GKE上。

huangapple go评论68阅读模式
英文:

Fluentbit error "cannot adjust chunk size" on GKE

问题

我的服务正在运行在GKE上,我正在使用EFK堆栈进行日志记录。每个节点都有一个通过DaemonSet创建的fluentbit pod,还有一个fluentd聚合器pod。这个结构一开始运行得很好,但现在fluentbit pods出现了错误。它持续出现错误并重新启动。

这个错误的原因是什么,我应该如何解决它?

来自fluentbit的日志:

[2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size '/fluent-bit/buffer//emitter.3/1-1689072144.132487045.flb' to 4096 bytes
[lib/chunkio/src/cio_file_unix.c:528 errno=28] No space left on device
[lib/chunkio/src/cio_file.c:1116 errno=28] No space left on device
[2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size '/fluent-bit/buffer//emitter.3/1-1689072143.865639031.flb' to 4096 bytes
[lib/chunkio/src/cio_file_unix.c:528 errno=28] No space left on device
[lib/chunkio/src/cio_file.c:1116 errno=28] No space left on device
[2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size '/fluent-bit/buffer//emitter.3/1-1689072143.703709663.flb' to 4096 bytes
[2023/07/18 08:08:22] [ info] [storage] ver=1.3.0, type=memory+filesystem, sync=full, checksum=off, max_chunks_up=128
[2023/07/18 08:08:22] [ info] [storage] backlog input plugin: storage_backlog.1
[2023/07/18 08:08:22] [ info] [cmetrics] version=0.5.7
[2023/07/18 08:08:22] [ info] [ctraces ] version=0.2.5
[2023/07/18 08:08:22] [ info] [input:tail:tail.0] initializing
[2023/07/18 08:08:22] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
[2023/07/18 08:08:22] [error] [sqldb] error=disk I/O error
[2023/07/18/08/08/22] [error] [input:tail:tail.0] db: could not create 'in_tail_files' table
[2023/07/18 08:08:22] [error] [input:tail:tail.0] could not open/create database
[2023/07/18 08:08:22] [error] failed to initialize input tail.0
[2023/07/18 08:08:22] [error] [engine] input initialization failed
[2023/07/18 08:08:22] [error] [lib] backend failed

fluent-bit的事件:

> kubectl describe po fluent-bit-xmkj6
...

Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Normal   Pulling  51m (x1718 over 6d3h)   kubelet  Pulling image "cr.fluentbit.io/fluent/fluent-bit:2.0.5"
  Warning  BackOff  96s (x43323 over 6d3h)  kubelet  Back-off restarting failed container

fluent-bit.conf

[SERVICE]
    Daemon Off
    Flush 1
    Log_Level info
    storage.path /fluent-bit/buffer/
    storage.sync full
    storage.checksum off        
    Parsers_File parsers.conf
    Parsers_File custom_parsers.conf
    HTTP_Server On
    HTTP_Listen 0.0.0.0
    HTTP_Port 2020
    Health_Check On

[INPUT]
    Name tail
    Path /var/log/containers/*.log
    db /fluent-bit/buffer/logs.db
    multiline.parser docker, cri
    Tag kube.*        
    Skip_Long_Lines On
    Skip_Empty_lines On

[FILTER]
    Name kubernetes
    Match kube.**
    Kube_URL https://kubernetes.default.svc.cluster.local:443
    Kube_Tag_Prefix kube.var.log.containers.
    Merge_Log On
    Keep_Log Off
    Annotations Off
    K8S-Logging.Parser On
    K8S-Logging.Exclude On

[FILTER]
    Name rewrite_tag
    Log_Level debug
    Match kube.**
    Rule $kubernetes['labels']['type'] ^(.*)$ dev.service.$kubernetes['namespace_name'].$kubernetes['labels']['type'].$kubernetes['container_name'] false
    Emitter_Name re_emitted_type
    Emitter_Storage.type filesystem

[FILTER]
    Name rewrite_tag
    Log_Level debug
    Match kube.**
    Rule $kubernetes['container_name'] ^(.*)$ dev.service.$kubernetes['namespace_name'].$kubernetes['container_name'] false
    Emitter_Name re_emitted_no_type
    Emitter_Storage.type filesystem

[OUTPUT]
    Name forward
    Match *
    Retry_Limit False
    Workers 1
    Host 172.32.20.10
    Port 30006
英文:

My services are running on GKE, and I am using EFK stack for logging. Each of node have a fluentbit pod by DaemonSet, and there is a fluentd aggregator pod. This structure had worked well at first, but fluentbit pods are making errors now. It continues making errors and restarting.

What is the reason of this error and how can I solve it?

Logs from fluentbit:

[2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size '/fluent-bit/buffer//emitter.3/1-1689072144.132487045.flb' to 4096 bytes
[lib/chunkio/src/cio_file_unix.c:528 errno=28] No space left on device
[lib/chunkio/src/cio_file.c:1116 errno=28] No space left on device
[2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size '/fluent-bit/buffer//emitter.3/1-1689072143.865639031.flb' to 4096 bytes
[lib/chunkio/src/cio_file_unix.c:528 errno=28] No space left on device
[lib/chunkio/src/cio_file.c:1116 errno=28] No space left on device
[2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size '/fluent-bit/buffer//emitter.3/1-1689072143.703709663.flb' to 4096 bytes
[2023/07/18 08:08:22] [ info] [storage] ver=1.3.0, type=memory+filesystem, sync=full, checksum=off, max_chunks_up=128
[2023/07/18 08:08:22] [ info] [storage] backlog input plugin: storage_backlog.1
[2023/07/18 08:08:22] [ info] [cmetrics] version=0.5.7
[2023/07/18 08:08:22] [ info] [ctraces ] version=0.2.5
[2023/07/18 08:08:22] [ info] [input:tail:tail.0] initializing
[2023/07/18 08:08:22] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
[2023/07/18 08:08:22] [error] [sqldb] error=disk I/O error
[2023/07/18 08:08:22] [error] [input:tail:tail.0] db: could not create 'in_tail_files' table
[2023/07/18 08:08:22] [error] [input:tail:tail.0] could not open/create database
[2023/07/18 08:08:22] [error] failed initialize input tail.0
[2023/07/18 08:08:22] [error] [engine] input initialization failed
[2023/07/18 08:08:22] [error] [lib] backend failed

Events of fluentbit:

> kubectl describe po fluent-bit-xmkj6
...

Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Normal   Pulling  51m (x1718 over 6d3h)   kubelet  Pulling image "cr.fluentbit.io/fluent/fluent-bit:2.0.5"
  Warning  BackOff  96s (x43323 over 6d3h)  kubelet  Back-off restarting failed container

fluent-bit.conf:

[SERVICE]
    Daemon Off
    Flush 1
    Log_Level info
    storage.path /fluent-bit/buffer/
    storage.sync full
    storage.checksum off        
    Parsers_File parsers.conf
    Parsers_File custom_parsers.conf
    HTTP_Server On
    HTTP_Listen 0.0.0.0
    HTTP_Port 2020
    Health_Check On

[INPUT]
    Name tail
    Path /var/log/containers/*.log
    db /fluent-bit/buffer/logs.db
    multiline.parser docker, cri
    Tag kube.*        
    Skip_Long_Lines On
    Skip_Empty_lines On

[FILTER]
    Name kubernetes
    Match kube.**
    Kube_URL https://kubernetes.default.svc.cluster.local:443
    Kube_Tag_Prefix kube.var.log.containers.
    Merge_Log On
    Keep_Log Off
    Annotations Off
    K8S-Logging.Parser On
    K8S-Logging.Exclude On

[FILTER]
    Name rewrite_tag
    Log_Level debug
    Match kube.**
    Rule $kubernetes['labels']['type'] ^(.*)$ dev.service.$kubernetes['namespace_name'].$kubernetes['labels']['type'].$kubernetes['container_name'] false
    Emitter_Name re_emitted_type
    Emitter_Storage.type filesystem

[FILTER]
    Name rewrite_tag
    Log_Level debug
    Match kube.**
    Rule $kubernetes['container_name'] ^(.*)$ dev.service.$kubernetes['namespace_name'].$kubernetes['container_name'] false
    Emitter_Name re_emitted_no_type
    Emitter_Storage.type filesystem

[OUTPUT]
    Name forward
    Match *
    Retry_Limit False
    Workers 1
    Host 172.32.20.10
    Port 30006

答案1

得分: 0

This error has two possible causes.

  1. 磁盘空间实际已用尽。
  2. inotify 资源已用尽。

磁盘空间实际已用尽:

您可以通过在节点上运行 df 命令来检查节点上是否剩余足够的磁盘空间。

检查磁盘使用情况

df -h

检查inode使用情况

df -ih

如果发现磁盘空间不足:

  • 从节点中删除未使用的文件。
  • 创建一个具有更大磁盘大小的节点池。

inotify 资源已用尽:

如果您的节点上剩余足够的磁盘空间,但仍然收到 "设备上没有剩余空间" 错误,那么很可能是 inotify 资源已用尽。

kubectl logs -f 使用 inotify 来监视文件的更改,它会消耗 inotify watches 的资源。

在 Linux 中,有关 inotify watches 数量的限制。您可以通过查看 fs.inotify.max_user_watches 内核参数来检查当前的限制:

$ sudo sysctl fs.inotify.max_user_watches

您可以使用以下一行命令来查看节点上每个进程消耗了多少 inotify watches:

echo -e "COUNT\tPID\tUSER\tCOMMAND" ; sudo find /proc/[0-9]*/fdinfo -type f 2>/dev/null | sudo xargs grep ^inotify 2>/dev/null | cut -d/ -f 3 | uniq -c | sort -nr | { while read -rs COUNT PID; do echo -en "$COUNT\t$PID\t" ; ps -p $PID -o user=,command=; done }

上述命令将查找消耗大量 inotify watches 的进程。

有几种减轻此问题的选项。

  • 您可以更改应用程序以不消耗大量的 inotify watches。
  • 或者您可以增加 fs.inotify.max_user_watches 内核参数的值。(例如:sudo sysctl fs.inotify.max_user_watches=24576) - 请注意,每个 inotify watch 都会消耗一些内存占用,因此应谨慎使用此解决方案。

您可以部署一个 DaemonSet 来提高集群节点上 inotify watches 的限制。从节点稳定性的角度来看,这应该是安全的。

command:
- /bin/sh
- -c
- |
while true; do
sysctl -w fs.inotify.max_user_watches=524288
sleep 10
done
imagePullPolicy: IfNotPresent

英文:

This error has two possible causes.

  1. The disk space is actually exhausted.
  2. inotify resources are exhausted.

The disk space is actually exhausted:

You can check whether enough disk space is left on the node by running df command on the node.

# Check disk usage
df -h

# Check inode usage
df -ih

If you find disk space is pressured:

  • Remove unused files from the node.
  • Create a node pool with a larger disk size.

inotify resources are exhausted:

If you have enough disk space left on your node but still getting the no space left on device error, it's highly likely that inotify resources are exhausted.

kubectl logs -f uses inotify to monitor the changes on the file and it consumes the resource of inotify watches.

In Linux, there is a limitation of the number of inotify watches. You can check the current limitation by looking at fs.inotify.max_user_watches kernel parameter using the following:

$ sudo sysctl fs.inotify.max_user_watches

You can check how many inotify watches are consumed by each process on the node by using the following one liner

echo -e "COUNT\tPID\tUSER\tCOMMAND" ; sudo find /proc/[0-9]*/fdinfo -type f 2>/dev/null | sudo xargs grep ^inotify 2>/dev/null | cut -d/ -f 3 | uniq -c | sort -nr | { while read -rs COUNT PID; do echo -en "$COUNT\t$PID\t" ; ps -p $PID -o user=,command=; done }

The above command will find large consumers of inotify watches.

There are a few options to mitigate the issue.

  • You can either change the application to not consume the large inotify watches.
  • Or you can increase the fs.inotify.max_user_watches kernel parameter. (e.g. sudo sysctl fs.inotify.max_user_watches=24576) - Note that each inotify watch consumes some memory footprint, so this solution should be used with caution.

You can deploy a DaemonSet to raise the limit for inotify watches on their cluster's nodes. This should be safe from a node stability perspective.

 command:
        - /bin/sh
        - -c
        - |
          while true; do
            sysctl -w fs.inotify.max_user_watches=524288
            sleep 10
          done
        imagePullPolicy: IfNotPresent

huangapple
  • 本文由 发表于 2023年7月18日 16:22:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76710807.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定