Fluentbit错误:“无法调整块大小”在GKE上。

huangapple go评论103阅读模式
英文:

Fluentbit error "cannot adjust chunk size" on GKE

问题

我的服务正在运行在GKE上,我正在使用EFK堆栈进行日志记录。每个节点都有一个通过DaemonSet创建的fluentbit pod,还有一个fluentd聚合器pod。这个结构一开始运行得很好,但现在fluentbit pods出现了错误。它持续出现错误并重新启动。

这个错误的原因是什么,我应该如何解决它?

来自fluentbit的日志:

  1. [2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size '/fluent-bit/buffer//emitter.3/1-1689072144.132487045.flb' to 4096 bytes
  2. [lib/chunkio/src/cio_file_unix.c:528 errno=28] No space left on device
  3. [lib/chunkio/src/cio_file.c:1116 errno=28] No space left on device
  4. [2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size '/fluent-bit/buffer//emitter.3/1-1689072143.865639031.flb' to 4096 bytes
  5. [lib/chunkio/src/cio_file_unix.c:528 errno=28] No space left on device
  6. [lib/chunkio/src/cio_file.c:1116 errno=28] No space left on device
  7. [2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size '/fluent-bit/buffer//emitter.3/1-1689072143.703709663.flb' to 4096 bytes
  8. [2023/07/18 08:08:22] [ info] [storage] ver=1.3.0, type=memory+filesystem, sync=full, checksum=off, max_chunks_up=128
  9. [2023/07/18 08:08:22] [ info] [storage] backlog input plugin: storage_backlog.1
  10. [2023/07/18 08:08:22] [ info] [cmetrics] version=0.5.7
  11. [2023/07/18 08:08:22] [ info] [ctraces ] version=0.2.5
  12. [2023/07/18 08:08:22] [ info] [input:tail:tail.0] initializing
  13. [2023/07/18 08:08:22] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
  14. [2023/07/18 08:08:22] [error] [sqldb] error=disk I/O error
  15. [2023/07/18/08/08/22] [error] [input:tail:tail.0] db: could not create 'in_tail_files' table
  16. [2023/07/18 08:08:22] [error] [input:tail:tail.0] could not open/create database
  17. [2023/07/18 08:08:22] [error] failed to initialize input tail.0
  18. [2023/07/18 08:08:22] [error] [engine] input initialization failed
  19. [2023/07/18 08:08:22] [error] [lib] backend failed

fluent-bit的事件:

  1. > kubectl describe po fluent-bit-xmkj6
  2. ...
  3. Events:
  4. Type Reason Age From Message
  5. ---- ------ ---- ---- -------
  6. Normal Pulling 51m (x1718 over 6d3h) kubelet Pulling image "cr.fluentbit.io/fluent/fluent-bit:2.0.5"
  7. Warning BackOff 96s (x43323 over 6d3h) kubelet Back-off restarting failed container

fluent-bit.conf

  1. [SERVICE]
  2. Daemon Off
  3. Flush 1
  4. Log_Level info
  5. storage.path /fluent-bit/buffer/
  6. storage.sync full
  7. storage.checksum off
  8. Parsers_File parsers.conf
  9. Parsers_File custom_parsers.conf
  10. HTTP_Server On
  11. HTTP_Listen 0.0.0.0
  12. HTTP_Port 2020
  13. Health_Check On
  14. [INPUT]
  15. Name tail
  16. Path /var/log/containers/*.log
  17. db /fluent-bit/buffer/logs.db
  18. multiline.parser docker, cri
  19. Tag kube.*
  20. Skip_Long_Lines On
  21. Skip_Empty_lines On
  22. [FILTER]
  23. Name kubernetes
  24. Match kube.**
  25. Kube_URL https://kubernetes.default.svc.cluster.local:443
  26. Kube_Tag_Prefix kube.var.log.containers.
  27. Merge_Log On
  28. Keep_Log Off
  29. Annotations Off
  30. K8S-Logging.Parser On
  31. K8S-Logging.Exclude On
  32. [FILTER]
  33. Name rewrite_tag
  34. Log_Level debug
  35. Match kube.**
  36. Rule $kubernetes['labels']['type'] ^(.*)$ dev.service.$kubernetes['namespace_name'].$kubernetes['labels']['type'].$kubernetes['container_name'] false
  37. Emitter_Name re_emitted_type
  38. Emitter_Storage.type filesystem
  39. [FILTER]
  40. Name rewrite_tag
  41. Log_Level debug
  42. Match kube.**
  43. Rule $kubernetes['container_name'] ^(.*)$ dev.service.$kubernetes['namespace_name'].$kubernetes['container_name'] false
  44. Emitter_Name re_emitted_no_type
  45. Emitter_Storage.type filesystem
  46. [OUTPUT]
  47. Name forward
  48. Match *
  49. Retry_Limit False
  50. Workers 1
  51. Host 172.32.20.10
  52. Port 30006
英文:

My services are running on GKE, and I am using EFK stack for logging. Each of node have a fluentbit pod by DaemonSet, and there is a fluentd aggregator pod. This structure had worked well at first, but fluentbit pods are making errors now. It continues making errors and restarting.

What is the reason of this error and how can I solve it?

Logs from fluentbit:

  1. [2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size '/fluent-bit/buffer//emitter.3/1-1689072144.132487045.flb' to 4096 bytes
  2. [lib/chunkio/src/cio_file_unix.c:528 errno=28] No space left on device
  3. [lib/chunkio/src/cio_file.c:1116 errno=28] No space left on device
  4. [2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size '/fluent-bit/buffer//emitter.3/1-1689072143.865639031.flb' to 4096 bytes
  5. [lib/chunkio/src/cio_file_unix.c:528 errno=28] No space left on device
  6. [lib/chunkio/src/cio_file.c:1116 errno=28] No space left on device
  7. [2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size '/fluent-bit/buffer//emitter.3/1-1689072143.703709663.flb' to 4096 bytes
  8. [2023/07/18 08:08:22] [ info] [storage] ver=1.3.0, type=memory+filesystem, sync=full, checksum=off, max_chunks_up=128
  9. [2023/07/18 08:08:22] [ info] [storage] backlog input plugin: storage_backlog.1
  10. [2023/07/18 08:08:22] [ info] [cmetrics] version=0.5.7
  11. [2023/07/18 08:08:22] [ info] [ctraces ] version=0.2.5
  12. [2023/07/18 08:08:22] [ info] [input:tail:tail.0] initializing
  13. [2023/07/18 08:08:22] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
  14. [2023/07/18 08:08:22] [error] [sqldb] error=disk I/O error
  15. [2023/07/18 08:08:22] [error] [input:tail:tail.0] db: could not create 'in_tail_files' table
  16. [2023/07/18 08:08:22] [error] [input:tail:tail.0] could not open/create database
  17. [2023/07/18 08:08:22] [error] failed initialize input tail.0
  18. [2023/07/18 08:08:22] [error] [engine] input initialization failed
  19. [2023/07/18 08:08:22] [error] [lib] backend failed

Events of fluentbit:

  1. > kubectl describe po fluent-bit-xmkj6
  2. ...
  3. Events:
  4. Type Reason Age From Message
  5. ---- ------ ---- ---- -------
  6. Normal Pulling 51m (x1718 over 6d3h) kubelet Pulling image "cr.fluentbit.io/fluent/fluent-bit:2.0.5"
  7. Warning BackOff 96s (x43323 over 6d3h) kubelet Back-off restarting failed container

fluent-bit.conf:

  1. [SERVICE]
  2. Daemon Off
  3. Flush 1
  4. Log_Level info
  5. storage.path /fluent-bit/buffer/
  6. storage.sync full
  7. storage.checksum off
  8. Parsers_File parsers.conf
  9. Parsers_File custom_parsers.conf
  10. HTTP_Server On
  11. HTTP_Listen 0.0.0.0
  12. HTTP_Port 2020
  13. Health_Check On
  14. [INPUT]
  15. Name tail
  16. Path /var/log/containers/*.log
  17. db /fluent-bit/buffer/logs.db
  18. multiline.parser docker, cri
  19. Tag kube.*
  20. Skip_Long_Lines On
  21. Skip_Empty_lines On
  22. [FILTER]
  23. Name kubernetes
  24. Match kube.**
  25. Kube_URL https://kubernetes.default.svc.cluster.local:443
  26. Kube_Tag_Prefix kube.var.log.containers.
  27. Merge_Log On
  28. Keep_Log Off
  29. Annotations Off
  30. K8S-Logging.Parser On
  31. K8S-Logging.Exclude On
  32. [FILTER]
  33. Name rewrite_tag
  34. Log_Level debug
  35. Match kube.**
  36. Rule $kubernetes['labels']['type'] ^(.*)$ dev.service.$kubernetes['namespace_name'].$kubernetes['labels']['type'].$kubernetes['container_name'] false
  37. Emitter_Name re_emitted_type
  38. Emitter_Storage.type filesystem
  39. [FILTER]
  40. Name rewrite_tag
  41. Log_Level debug
  42. Match kube.**
  43. Rule $kubernetes['container_name'] ^(.*)$ dev.service.$kubernetes['namespace_name'].$kubernetes['container_name'] false
  44. Emitter_Name re_emitted_no_type
  45. Emitter_Storage.type filesystem
  46. [OUTPUT]
  47. Name forward
  48. Match *
  49. Retry_Limit False
  50. Workers 1
  51. Host 172.32.20.10
  52. Port 30006

答案1

得分: 0

This error has two possible causes.

  1. 磁盘空间实际已用尽。
  2. inotify 资源已用尽。

磁盘空间实际已用尽:

您可以通过在节点上运行 df 命令来检查节点上是否剩余足够的磁盘空间。

检查磁盘使用情况

df -h

检查inode使用情况

df -ih

如果发现磁盘空间不足:

  • 从节点中删除未使用的文件。
  • 创建一个具有更大磁盘大小的节点池。

inotify 资源已用尽:

如果您的节点上剩余足够的磁盘空间,但仍然收到 "设备上没有剩余空间" 错误,那么很可能是 inotify 资源已用尽。

kubectl logs -f 使用 inotify 来监视文件的更改,它会消耗 inotify watches 的资源。

在 Linux 中,有关 inotify watches 数量的限制。您可以通过查看 fs.inotify.max_user_watches 内核参数来检查当前的限制:

$ sudo sysctl fs.inotify.max_user_watches

您可以使用以下一行命令来查看节点上每个进程消耗了多少 inotify watches:

echo -e "COUNT\tPID\tUSER\tCOMMAND" ; sudo find /proc/[0-9]*/fdinfo -type f 2>/dev/null | sudo xargs grep ^inotify 2>/dev/null | cut -d/ -f 3 | uniq -c | sort -nr | { while read -rs COUNT PID; do echo -en "$COUNT\t$PID\t" ; ps -p $PID -o user=,command=; done }

上述命令将查找消耗大量 inotify watches 的进程。

有几种减轻此问题的选项。

  • 您可以更改应用程序以不消耗大量的 inotify watches。
  • 或者您可以增加 fs.inotify.max_user_watches 内核参数的值。(例如:sudo sysctl fs.inotify.max_user_watches=24576) - 请注意,每个 inotify watch 都会消耗一些内存占用,因此应谨慎使用此解决方案。

您可以部署一个 DaemonSet 来提高集群节点上 inotify watches 的限制。从节点稳定性的角度来看,这应该是安全的。

command:
- /bin/sh
- -c
- |
while true; do
sysctl -w fs.inotify.max_user_watches=524288
sleep 10
done
imagePullPolicy: IfNotPresent

英文:

This error has two possible causes.

  1. The disk space is actually exhausted.
  2. inotify resources are exhausted.

The disk space is actually exhausted:

You can check whether enough disk space is left on the node by running df command on the node.

  1. # Check disk usage
  2. df -h
  3. # Check inode usage
  4. df -ih

If you find disk space is pressured:

  • Remove unused files from the node.
  • Create a node pool with a larger disk size.

inotify resources are exhausted:

If you have enough disk space left on your node but still getting the no space left on device error, it's highly likely that inotify resources are exhausted.

kubectl logs -f uses inotify to monitor the changes on the file and it consumes the resource of inotify watches.

In Linux, there is a limitation of the number of inotify watches. You can check the current limitation by looking at fs.inotify.max_user_watches kernel parameter using the following:

  1. $ sudo sysctl fs.inotify.max_user_watches

You can check how many inotify watches are consumed by each process on the node by using the following one liner

  1. echo -e "COUNT\tPID\tUSER\tCOMMAND" ; sudo find /proc/[0-9]*/fdinfo -type f 2>/dev/null | sudo xargs grep ^inotify 2>/dev/null | cut -d/ -f 3 | uniq -c | sort -nr | { while read -rs COUNT PID; do echo -en "$COUNT\t$PID\t" ; ps -p $PID -o user=,command=; done }

The above command will find large consumers of inotify watches.

There are a few options to mitigate the issue.

  • You can either change the application to not consume the large inotify watches.
  • Or you can increase the fs.inotify.max_user_watches kernel parameter. (e.g. sudo sysctl fs.inotify.max_user_watches=24576) - Note that each inotify watch consumes some memory footprint, so this solution should be used with caution.

You can deploy a DaemonSet to raise the limit for inotify watches on their cluster's nodes. This should be safe from a node stability perspective.

  1. command:
  2. - /bin/sh
  3. - -c
  4. - |
  5. while true; do
  6. sysctl -w fs.inotify.max_user_watches=524288
  7. sleep 10
  8. done
  9. imagePullPolicy: IfNotPresent

huangapple
  • 本文由 发表于 2023年7月18日 16:22:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76710807.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定