2023年7月18日 16:22:18go评论152阅读模式

英文:

Fluentbit error "cannot adjust chunk size" on GKE

问题

我的服务正在运行在GKE上，我正在使用EFK堆栈进行日志记录。每个节点都有一个通过DaemonSet创建的fluentbit pod，还有一个fluentd聚合器pod。这个结构一开始运行得很好，但现在fluentbit pods出现了错误。它持续出现错误并重新启动。

这个错误的原因是什么，我应该如何解决它？

来自fluentbit的日志：

[2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size '/fluent-bit/buffer//emitter.3/1-1689072144.132487045.flb' to 4096 bytes
[lib/chunkio/src/cio_file_unix.c:528 errno=28] No space left on device
[lib/chunkio/src/cio_file.c:1116 errno=28] No space left on device
[2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size '/fluent-bit/buffer//emitter.3/1-1689072143.865639031.flb' to 4096 bytes
[lib/chunkio/src/cio_file_unix.c:528 errno=28] No space left on device
[lib/chunkio/src/cio_file.c:1116 errno=28] No space left on device
[2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size '/fluent-bit/buffer//emitter.3/1-1689072143.703709663.flb' to 4096 bytes
[2023/07/18 08:08:22] [ info] [storage] ver=1.3.0, type=memory+filesystem, sync=full, checksum=off, max_chunks_up=128
[2023/07/18 08:08:22] [ info] [storage] backlog input plugin: storage_backlog.1
[2023/07/18 08:08:22] [ info] [cmetrics] version=0.5.7
[2023/07/18 08:08:22] [ info] [ctraces ] version=0.2.5
[2023/07/18 08:08:22] [ info] [input:tail:tail.0] initializing
[2023/07/18 08:08:22] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
[2023/07/18 08:08:22] [error] [sqldb] error=disk I/O error
[2023/07/18/08/08/22] [error] [input:tail:tail.0] db: could not create 'in_tail_files' table
[2023/07/18 08:08:22] [error] [input:tail:tail.0] could not open/create database
[2023/07/18 08:08:22] [error] failed to initialize input tail.0
[2023/07/18 08:08:22] [error] [engine] input initialization failed
[2023/07/18 08:08:22] [error] [lib] backend failed

fluent-bit的事件：

> kubectl describe po fluent-bit-xmkj6
...

Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Normal   Pulling  51m (x1718 over 6d3h)   kubelet  Pulling image "cr.fluentbit.io/fluent/fluent-bit:2.0.5"
  Warning  BackOff  96s (x43323 over 6d3h)  kubelet  Back-off restarting failed container

fluent-bit.conf：

[SERVICE]
    Daemon Off
    Flush 1
    Log_Level info
    storage.path /fluent-bit/buffer/
    storage.sync full
    storage.checksum off        
    Parsers_File parsers.conf
    Parsers_File custom_parsers.conf
    HTTP_Server On
    HTTP_Listen 0.0.0.0
    HTTP_Port 2020
    Health_Check On

[INPUT]
    Name tail
    Path /var/log/containers/*.log
    db /fluent-bit/buffer/logs.db
    multiline.parser docker, cri
    Tag kube.*        
    Skip_Long_Lines On
    Skip_Empty_lines On

[FILTER]
    Name kubernetes
    Match kube.**
    Kube_URL https://kubernetes.default.svc.cluster.local:443
    Kube_Tag_Prefix kube.var.log.containers.
    Merge_Log On
    Keep_Log Off
    Annotations Off
    K8S-Logging.Parser On
    K8S-Logging.Exclude On

[FILTER]
    Name rewrite_tag
    Log_Level debug
    Match kube.**
    Rule $kubernetes['labels']['type'] ^(.*)$ dev.service.$kubernetes['namespace_name'].$kubernetes['labels']['type'].$kubernetes['container_name'] false
    Emitter_Name re_emitted_type
    Emitter_Storage.type filesystem

[FILTER]
    Name rewrite_tag
    Log_Level debug
    Match kube.**
    Rule $kubernetes['container_name'] ^(.*)$ dev.service.$kubernetes['namespace_name'].$kubernetes['container_name'] false
    Emitter_Name re_emitted_no_type
    Emitter_Storage.type filesystem

[OUTPUT]
    Name forward
    Match *
    Retry_Limit False
    Workers 1
    Host 172.32.20.10
    Port 30006

英文:

My services are running on GKE, and I am using EFK stack for logging. Each of node have a fluentbit pod by DaemonSet, and there is a fluentd aggregator pod. This structure had worked well at first, but fluentbit pods are making errors now. It continues making errors and restarting.

What is the reason of this error and how can I solve it?

Logs from fluentbit:

[2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size &#39;/fluent-bit/buffer//emitter.3/1-1689072144.132487045.flb&#39; to 4096 bytes
[lib/chunkio/src/cio_file_unix.c:528 errno=28] No space left on device
[lib/chunkio/src/cio_file.c:1116 errno=28] No space left on device
[2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size &#39;/fluent-bit/buffer//emitter.3/1-1689072143.865639031.flb&#39; to 4096 bytes
[lib/chunkio/src/cio_file_unix.c:528 errno=28] No space left on device
[lib/chunkio/src/cio_file.c:1116 errno=28] No space left on device
[2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size &#39;/fluent-bit/buffer//emitter.3/1-1689072143.703709663.flb&#39; to 4096 bytes
[2023/07/18 08:08:22] [ info] [storage] ver=1.3.0, type=memory+filesystem, sync=full, checksum=off, max_chunks_up=128
[2023/07/18 08:08:22] [ info] [storage] backlog input plugin: storage_backlog.1
[2023/07/18 08:08:22] [ info] [cmetrics] version=0.5.7
[2023/07/18 08:08:22] [ info] [ctraces ] version=0.2.5
[2023/07/18 08:08:22] [ info] [input:tail:tail.0] initializing
[2023/07/18 08:08:22] [ info] [input:tail:tail.0] storage_strategy=&#39;memory&#39; (memory only)
[2023/07/18 08:08:22] [error] [sqldb] error=disk I/O error
[2023/07/18 08:08:22] [error] [input:tail:tail.0] db: could not create &#39;in_tail_files&#39; table
[2023/07/18 08:08:22] [error] [input:tail:tail.0] could not open/create database
[2023/07/18 08:08:22] [error] failed initialize input tail.0
[2023/07/18 08:08:22] [error] [engine] input initialization failed
[2023/07/18 08:08:22] [error] [lib] backend failed

Events of fluentbit:

&gt; kubectl describe po fluent-bit-xmkj6
...

Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Normal   Pulling  51m (x1718 over 6d3h)   kubelet  Pulling image &quot;cr.fluentbit.io/fluent/fluent-bit:2.0.5&quot;
  Warning  BackOff  96s (x43323 over 6d3h)  kubelet  Back-off restarting failed container

fluent-bit.conf:

[SERVICE]
    Daemon Off
    Flush 1
    Log_Level info
    storage.path /fluent-bit/buffer/
    storage.sync full
    storage.checksum off        
    Parsers_File parsers.conf
    Parsers_File custom_parsers.conf
    HTTP_Server On
    HTTP_Listen 0.0.0.0
    HTTP_Port 2020
    Health_Check On

[INPUT]
    Name tail
    Path /var/log/containers/*.log
    db /fluent-bit/buffer/logs.db
    multiline.parser docker, cri
    Tag kube.*        
    Skip_Long_Lines On
    Skip_Empty_lines On

[FILTER]
    Name kubernetes
    Match kube.**
    Kube_URL https://kubernetes.default.svc.cluster.local:443
    Kube_Tag_Prefix kube.var.log.containers.
    Merge_Log On
    Keep_Log Off
    Annotations Off
    K8S-Logging.Parser On
    K8S-Logging.Exclude On

[FILTER]
    Name rewrite_tag
    Log_Level debug
    Match kube.**
    Rule $kubernetes[&#39;labels&#39;][&#39;type&#39;] ^(.*)$ dev.service.$kubernetes[&#39;namespace_name&#39;].$kubernetes[&#39;labels&#39;][&#39;type&#39;].$kubernetes[&#39;container_name&#39;] false
    Emitter_Name re_emitted_type
    Emitter_Storage.type filesystem

[FILTER]
    Name rewrite_tag
    Log_Level debug
    Match kube.**
    Rule $kubernetes[&#39;container_name&#39;] ^(.*)$ dev.service.$kubernetes[&#39;namespace_name&#39;].$kubernetes[&#39;container_name&#39;] false
    Emitter_Name re_emitted_no_type
    Emitter_Storage.type filesystem

[OUTPUT]
    Name forward
    Match *
    Retry_Limit False
    Workers 1
    Host 172.32.20.10
    Port 30006

答案1

得分: 0

This error has two possible causes.

磁盘空间实际已用尽。
inotify 资源已用尽。

磁盘空间实际已用尽:

您可以通过在节点上运行 df 命令来检查节点上是否剩余足够的磁盘空间。

检查磁盘使用情况

df -h

检查inode使用情况

df -ih

如果发现磁盘空间不足：

从节点中删除未使用的文件。
创建一个具有更大磁盘大小的节点池。

inotify 资源已用尽:

如果您的节点上剩余足够的磁盘空间，但仍然收到 "设备上没有剩余空间" 错误，那么很可能是 inotify 资源已用尽。

kubectl logs -f 使用 inotify 来监视文件的更改，它会消耗 inotify watches 的资源。

在 Linux 中，有关 inotify watches 数量的限制。您可以通过查看 fs.inotify.max_user_watches 内核参数来检查当前的限制：

$ sudo sysctl fs.inotify.max_user_watches

您可以使用以下一行命令来查看节点上每个进程消耗了多少 inotify watches：

echo -e "COUNT\tPID\tUSER\tCOMMAND" ; sudo find /proc/[0-9]*/fdinfo -type f 2>/dev/null | sudo xargs grep ^inotify 2>/dev/null | cut -d/ -f 3 | uniq -c | sort -nr | { while read -rs COUNT PID; do echo -en "$COUNT\t$PID\t" ; ps -p $PID -o user=,command=; done }

上述命令将查找消耗大量 inotify watches 的进程。

有几种减轻此问题的选项。

您可以更改应用程序以不消耗大量的 inotify watches。
或者您可以增加 fs.inotify.max_user_watches 内核参数的值。（例如：sudo sysctl fs.inotify.max_user_watches=24576） - 请注意，每个 inotify watch 都会消耗一些内存占用，因此应谨慎使用此解决方案。

您可以部署一个 DaemonSet 来提高集群节点上 inotify watches 的限制。从节点稳定性的角度来看，这应该是安全的。

command:
- /bin/sh
- -c
- |
while true; do
sysctl -w fs.inotify.max_user_watches=524288
sleep 10
done
imagePullPolicy: IfNotPresent

英文:

This error has two possible causes.

The disk space is actually exhausted.
inotify resources are exhausted.

The disk space is actually exhausted:

You can check whether enough disk space is left on the node by running df command on the node.

# Check disk usage
df -h

# Check inode usage
df -ih

If you find disk space is pressured:

Remove unused files from the node.
Create a node pool with a larger disk size.

inotify resources are exhausted:

If you have enough disk space left on your node but still getting the no space left on device error, it's highly likely that inotify resources are exhausted.

kubectl logs -f uses inotify to monitor the changes on the file and it consumes the resource of inotify watches.

In Linux, there is a limitation of the number of inotify watches. You can check the current limitation by looking at fs.inotify.max_user_watches kernel parameter using the following:

$ sudo sysctl fs.inotify.max_user_watches

You can check how many inotify watches are consumed by each process on the node by using the following one liner

echo -e &quot;COUNT\tPID\tUSER\tCOMMAND&quot; ; sudo find /proc/[0-9]*/fdinfo -type f 2&gt;/dev/null | sudo xargs grep ^inotify 2&gt;/dev/null | cut -d/ -f 3 | uniq -c | sort -nr | { while read -rs COUNT PID; do echo -en &quot;$COUNT\t$PID\t&quot; ; ps -p $PID -o user=,command=; done }

The above command will find large consumers of inotify watches.

There are a few options to mitigate the issue.

You can either change the application to not consume the large inotify watches.
Or you can increase the fs.inotify.max_user_watches kernel parameter. (e.g. sudo sysctl fs.inotify.max_user_watches=24576) - Note that each inotify watch consumes some memory footprint, so this solution should be used with caution.

You can deploy a DaemonSet to raise the limit for inotify watches on their cluster's nodes. This should be safe from a node stability perspective.

 command:
        - /bin/sh
        - -c
        - |
          while true; do
            sysctl -w fs.inotify.max_user_watches=524288
            sleep 10
          done
        imagePullPolicy: IfNotPresent

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Fluentbit错误：“无法调整块大小”在GKE上。

问题

答案1

检查磁盘使用情况

检查inode使用情况

micrometer exposing actuator metrics vs kube-state-metrics vs metrics-server “`

无法更新或删除现有的Argo事件传感器和EventSource。

在自动驾驶集群中面临扩展问题

Kustomize：修补以相同名称开头的多个资源

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论