Fluent-bit hanging and stopping operation #9927

nikitamishagin · 2025-02-07T17:06:34Z

Bug Report

Describe the bug
From time to time, the fluent-bit application hangs and stops collecting logs. No activity is performed, there are no logs from fluent-bit itself, and metrics stop being generated. The Kubernetes pod does not crash or restart when this issue occurs, and it remains in the Ready state. This problem can only be detected manually by noticing that the application's logs have not been delivered.

To Reproduce

Deploy helm chart to Kubernetes cluster
Start collecting logs from application in cluster
Wait for a problem

Expected behavior
Stable operation and collection of all logs without hangs.

Your Environment

Version used: chart version 0.47.9, app version 3.1.7
Configuration:

    [SERVICE]
        Daemon           Off
        Flush            {{ .Values.flush }}
        Log_Level        {{ .Values.logLevel }}
        Parsers_File     /fluent-bit/etc/parsers.conf
        Parsers_File     /fluent-bit/etc/conf/custom_parsers.conf
        HTTP_Server      On
        HTTP_Listen      0.0.0.0
        HTTP_Port        {{ .Values.metricsPort }}
        Health_Check     On
        storage.sync     normal
        storage.checksum off
        storage.path     /var/log/flb-storage/

    [INPUT]
        Name           systemd
        Tag            host.*
        Read_From_Tail On
        storage.type   filesystem

    [INPUT]
        Name            tail
        Tag             kube.*
        Path            /var/log/containers/*.log
        Parser          cri
        Skip_Long_Lines On
        storage.type    filesystem

    [INPUT]
        Name          tail
        Tag           audit.*
        Path          /var/log/kube-audit/kube-audit.log
        storage.type  filesystem

    [FILTER]
        Name modify
        Match host.*
        Rename _HOSTNAME @HOSTNAME
        Rename _MACHINE_ID @MACHINE_ID
        Rename _BOOT_ID @BOOT_ID
        Rename _CMDLINE @CMDLINE
        Rename _COMM @COMM
        Rename _EXE @EXE
        Rename _PID @PID
        Rename _UID @UID
        Rename _GID @GID
        Rename _CAP_EFFECTIVE @CAP_EFFECTIVE
        Rename _SELINUX_CONTEXT @SELINUX_CONTEXT
        Rename _AUDIT_LOGINUID @AUDIT_LOGINUID
        Rename _AUDIT_SESSION @AUDIT_SESSION
        Rename _SYSTEMD_CGROUP @SYSTEMD_CGROUP
        Rename _SYSTEMD_SLICE @SYSTEMD_SLICE
        Rename _SYSTEMD_UNIT @SYSTEMD_UNIT
        Rename _SYSTEMD_SESSION @SYSTEMD_SESSION
        Rename _SYSTEMD_OWNER_UID @SYSTEMD_OWNER_UID
        Rename _TRANSPORT @TRANSPORT
        Rename _KERNEL_DEVICE @KERNEL_DEVICE
        Rename _KERNEL_SUBSYSTEM @KERNEL_SUBSYSTEM
        Rename _UDEV_SYSNAME @UDEV_SYSNAME
        Rename _UDEV_DEVNODE @UDEV_DEVNODE
        Rename _UDEV_DEVLINK @UDEV_DEVLINK
        Remove _SOURCE_REALTIME_TIMESTAMP

    [FILTER]
        Name         parser
        Match        kube.*
        Key_Name     message
        Parser       json_custom
        Reserve_Data True

    [FILTER]
        Name                kubernetes
        Match               kube.*
        Merge_Log           On
        Keep_Log            Off
        K8S-Logging.Parser  On
        K8S-Logging.Exclude On

    [FILTER]
        Name   lua
        Match  kube.*
        script /fluent-bit/scripts/filters.lua
        Call   modify_log

    [FILTER]
        Name  rewrite_tag
        Match kube.*
        Rule  $es_index_date ^(daily)$ daily.$TAG false

    [FILTER]
        Name  rewrite_tag
        Match kube.*
        Rule  $es_index_date ^(weekly)$ weekly.$TAG false

    [FILTER]
        Name  rewrite_tag
        Match kube.*
        Rule  $es_index_date ^(monthly)$ monthly.$TAG false

    [FILTER]
        Name   record_modifier
        Match  audit.*
        Record cluster <cluster-name>

    [OUTPUT]
        Name                     es
        Match                    host.*
        Host                     ${ELASTICSEARCH_HOST}
        Port                     ${ELASTICSEARCH_PORT}
        HTTP_User                ${ELASTICSEARCH_USER}
        HTTP_Passwd              ${ELASTICSEARCH_PASSWORD}
        Logstash_Format          On
        Logstash_Prefix          infra=journal
        Logstash_DateFormat      %Y-%m-%d
        Time_Key                 @SOURCE_REALTIME_TIMESTAMP
        Time_Key_Format          %Y-%m-%dT%H:%M:%S
        storage.total_limit_size 1G

    [OUTPUT]
        Name                     es
        Match                    daily.kube.*
        Host                     ${ELASTICSEARCH_HOST}
        Port                     ${ELASTICSEARCH_PORT}
        HTTP_User                ${ELASTICSEARCH_USER}
        HTTP_Passwd              ${ELASTICSEARCH_PASSWORD}
        Logstash_Format          On
        Logstash_Prefix_Key      pre_index
        Logstash_DateFormat      %Y-%m-%d
        Time_Key_Nanos           On
        storage.total_limit_size 1G

    [OUTPUT]
        Name                     es
        Match                    weekly.kube.*
        Host                     ${ELASTICSEARCH_HOST}
        Port                     ${ELASTICSEARCH_PORT}
        HTTP_User                ${ELASTICSEARCH_USER}
        HTTP_Passwd              ${ELASTICSEARCH_PASSWORD}
        Logstash_Format          On
        Logstash_Prefix_Key      pre_index
        Logstash_DateFormat      %Y-w%W
        Time_Key_Nanos           On
        storage.total_limit_size 1G

    [OUTPUT]
        Name                     es
        Match                    monthly.kube.*
        Host                     ${ELASTICSEARCH_HOST}
        Port                     ${ELASTICSEARCH_PORT}
        HTTP_User                ${ELASTICSEARCH_USER}
        HTTP_Passwd              ${ELASTICSEARCH_PASSWORD}
        Logstash_Format          On
        Logstash_Prefix_Key      pre_index
        Logstash_DateFormat      %Y-m%m
        Time_Key_Nanos           On
        storage.total_limit_size 1G

    [OUTPUT]
        name                     syslog
        match                    audit.*
        host                     <private-host>
        port                     15110
        mode                     udp
        syslog_format            rfc3164
        syslog_maxsize           2048
        syslog_hostname_key      cluster
        syslog_message_key       log
        storage.total_limit_size 4G

    [PARSER]
        Name   json_custom
        Format json

  filters.lua: |
    function modify_log(tag, timestamp, record)
        -- Add cluster name.
        record.kubernetes.cluster = "{{ .Environment.Name }}"
        -- Check that annotations exist.
        if record["kubernetes"]["annotations"] then
            -- Check that es-index-project annotation exists.
            if record["kubernetes"]["annotations"]["example.com/es-index-project"] then
                -- Check that es-index-project annotation contains a value
                if string.len(record["kubernetes"]["annotations"]["example.com/es-index-project"]) > 0 then
                    es_index_project = record["kubernetes"]["annotations"]["example.com/es-index-project"]
                else
                    -- If there is no value, then you need to assign fixme so as not to lose logs
                    es_index_project = "fixme"
                end
                -- Check that es-index-date annotation exists.
                if record["kubernetes"]["annotations"]["example.com/es-index-date"] then
                    es_index_date = record["kubernetes"]["annotations"]["example.com/es-index-date"]
                    if es_index_date == "daily" then
                        record.es_index_date = es_index_date
                    elseif es_index_date == "weekly" then
                        record.es_index_date = es_index_date
                    elseif es_index_date == "monthly" then
                        record.es_index_date = es_index_date
                    else
                        -- Discard the log if es-index-date is not equal daily|weekly|monthly.
                        return -1, 0, 0
                    end
                    -- Check that es-index-ext annotation exists.
                    if record["kubernetes"]["annotations"]["example.com/es-index-ext"] then
                        es_index_ext = record["kubernetes"]["annotations"]["example.com/es-index-ext"]
                        -- Check that es-index-ext is not nil.
                        if string.len(es_index_ext) > 0 then
                            -- Create a prefix with es_index_ext key for the index.
                            record.pre_index = es_index_project .. "=k8s-" .. es_index_ext
                        end
                    else
                        -- Create a prefix without es_index_ext key for the index.
                        record.pre_index = es_index_project .. "=k8s"
                    end
                else
                    -- Discard the log if es-index-date is not exists.
                    return -1, 0, 0
                end
            else
                -- Discard the log if es-index-project is not exists.
                return -1, 0, 0
            end
        else
            -- Discard the log if es-index-project is not exists.
            return -1, 0, 0
        end
        -- Return the modified log.
        return 2, timestamp, record
    end

Environment name and version (e.g. Kubernetes? What version?): vanilla kubernetes v1.30.6
Server type and version: VM on libvirt
Operating System and version: CentOS 9 Stream 6.10.10-1.el9.elrepo.x86_64
Filters and plugins: In config.

Additional context
The issue does not occur frequently. It happens a few times per week on one of over 400 instances. It might occur more often, but we do not notice it due to the difficulty of detecting the problem.

The text was updated successfully, but these errors were encountered:

Snak3GMS · 2025-02-07T17:22:52Z

I can confirm that I am experiencing the same issue with the same configuration and versions. The behavior is identical to what has been reported.

pawel-lmcb · 2025-02-07T22:19:47Z

@Snak3GMS @nikitamishagin any logs that you can attach ?

Seems like behavior isn't exactly perfect, but these might all be related:
#9917
#9906

These all exhibit behaviors of hanging

Snak3GMS · 2025-02-08T00:34:31Z

@pawel-lmcb

strace

[pid    18] 00:12:40.898547 <... restart_syscall resumed>) = 0 <0.891804>
[pid    18] 00:12:40.898699 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid     1] 00:12:41.064008 <... clock_nanosleep resumed>0x7fff3e6c4b10) = 0 <1.000104>
[pid     1] 00:12:41.064069 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid    18] 00:12:41.898923 <... clock_nanosleep resumed>0x7f2a635fb760) = 0 <1.000146>
[pid    18] 00:12:41.899006 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid     1] 00:12:42.064434 <... clock_nanosleep resumed>0x7fff3e6c4b10) = 0 <1.000231>
[pid     1] 00:12:42.064566 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid    16] 00:12:42.152983 <... epoll_wait resumed>[{EPOLLIN, {u32=1684037792, u64=139820049391776}}], 64, -1) = 1 <1.498995>
[pid    15] 00:12:42.153083 <... epoll_wait resumed>[{EPOLLIN, {u32=1688232096, u64=139820053586080}}], 64, -1) = 1 <1.499113>
[pid    14] 00:12:42.153209 <... epoll_wait resumed>[{EPOLLIN, {u32=1692426400, u64=139820057780384}}], 64, -1) = 1 <1.499217>
[pid    13] 00:12:42.153285 <... epoll_wait resumed>[{EPOLLIN, {u32=1696620704, u64=139820061974688}}], 64, -1) = 1 <1.499286>
[pid    12] 00:12:42.153319 <... epoll_wait resumed>[{EPOLLIN, {u32=1742758048, u64=139820108112032}}], 64, -1) = 1 <1.499237>
[pid    11] 00:12:42.153364 <... epoll_wait resumed>[{EPOLLIN, {u32=1746952352, u64=139820112306336}}], 64, -1) = 1 <1.499269>
[pid    10] 00:12:42.153389 <... epoll_wait resumed>[{EPOLLIN, {u32=1751146656, u64=139820116500640}}], 64, -1) = 1 <1.499238>
[pid     9] 00:12:42.153414 <... epoll_wait resumed>[{EPOLLIN, {u32=1755340960, u64=139820120694944}}], 64, -1) = 1 <1.499250>
[pid    16] 00:12:42.153447 read(184,  <unfinished ...>
[pid    15] 00:12:42.153461 read(180,  <unfinished ...>
[pid    14] 00:12:42.153474 read(176,  <unfinished ...>
[pid    13] 00:12:42.153501 read(172,  <unfinished ...>
[pid    12] 00:12:42.153515 read(135,  <unfinished ...>
[pid    11] 00:12:42.153534 read(137,  <unfinished ...>
[pid    10] 00:12:42.153548 read(136,  <unfinished ...>
[pid    16] 00:12:42.153574 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000111>
[pid    15] 00:12:42.153629 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000158>
[pid    14] 00:12:42.153650 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000167>
[pid    13] 00:12:42.153669 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000160>
[pid    12] 00:12:42.153690 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000166>
[pid    11] 00:12:42.153712 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000168>
[pid     9] 00:12:42.153734 read(134,  <unfinished ...>
[pid    16] 00:12:42.153754 epoll_wait(161,  <unfinished ...>
[pid    15] 00:12:42.153803 epoll_wait(158,  <unfinished ...>
[pid    14] 00:12:42.153828 epoll_wait(149,  <unfinished ...>
[pid    13] 00:12:42.153851 epoll_wait(140,  <unfinished ...>
[pid    12] 00:12:42.153871 epoll_wait(121,  <unfinished ...>
[pid    11] 00:12:42.153893 epoll_wait(118,  <unfinished ...>
[pid    10] 00:12:42.153908 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000349>
[pid    16] 00:12:42.153966 <... epoll_wait resumed>[], 64, 0) = 0 <0.000189>
[pid    15] 00:12:42.153986 <... epoll_wait resumed>[], 64, 0) = 0 <0.000168>
[pid    14] 00:12:42.154003 <... epoll_wait resumed>[], 64, 0) = 0 <0.000159>
[pid    13] 00:12:42.154020 <... epoll_wait resumed>[], 64, 0) = 0 <0.000157>
[pid    12] 00:12:42.154037 <... epoll_wait resumed>[], 64, 0) = 0 <0.000152>
[pid    11] 00:12:42.154055 <... epoll_wait resumed>[], 64, 0) = 0 <0.000153>
[pid     9] 00:12:42.154072 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000329>
[pid    16] 00:12:42.154126 epoll_wait(161,  <unfinished ...>
[pid    15] 00:12:42.154161 epoll_wait(158,  <unfinished ...>
[pid    14] 00:12:42.154176 epoll_wait(149,  <unfinished ...>
[pid    13] 00:12:42.154188 epoll_wait(140,  <unfinished ...>
[pid    12] 00:12:42.154203 epoll_wait(121,  <unfinished ...>
[pid    11] 00:12:42.154216 epoll_wait(118,  <unfinished ...>
[pid    10] 00:12:42.154229 epoll_wait(109,  <unfinished ...>
[pid     9] 00:12:42.154242 epoll_wait(106,  <unfinished ...>
[pid    10] 00:12:42.154258 <... epoll_wait resumed>[], 64, 0) = 0 <0.000020>
[pid     9] 00:12:42.154278 <... epoll_wait resumed>[], 64, 0) = 0 <0.000027>
[pid    10] 00:12:42.154297 epoll_wait(109,  <unfinished ...>
[pid     9] 00:12:42.154313 epoll_wait(106,  <unfinished ...>
[pid    19] 00:12:42.652866 <... epoll_wait resumed>[{EPOLLIN, {u32=1644290128, u64=139820009644112}}], 256, -1) = 1 <2.646141>
[pid    19] 00:12:42.652957 read(202, "\1\0\0\0\0\0\0\0", 8) = 8 <0.000025>
[pid    19] 00:12:42.653034 epoll_wait(196,  <unfinished ...>
[pid    18] 00:12:42.899288 <... clock_nanosleep resumed>0x7f2a635fb760) = 0 <1.000156>
[pid    18] 00:12:42.899357 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid     1] 00:12:43.064814 <... clock_nanosleep resumed>0x7fff3e6c4b10) = 0 <1.000198>
[pid     1] 00:12:43.064876 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid    15] 00:12:43.652891 <... epoll_wait resumed>[{EPOLLIN, {u32=1688232096, u64=139820053586080}}], 64, -1) = 1 <1.498676>
[pid    13] 00:12:43.653018 <... epoll_wait resumed>[{EPOLLIN, {u32=1696620704, u64=139820061974688}}], 64, -1) = 1 <1.498813>
[pid    11] 00:12:43.653115 <... epoll_wait resumed>[{EPOLLIN, {u32=1746952352, u64=139820112306336}}], 64, -1) = 1 <1.498886>
[pid     9] 00:12:43.653176 <... epoll_wait resumed>[{EPOLLIN, {u32=1755340960, u64=139820120694944}}], 64, -1) = 1 <1.498851>
[pid    16] 00:12:43.653209 <... epoll_wait resumed>[{EPOLLIN, {u32=1684037792, u64=139820049391776}}], 64, -1) = 1 <1.499074>
[pid    15] 00:12:43.653234 read(180,  <unfinished ...>
[pid    14] 00:12:43.653252 <... epoll_wait resumed>[{EPOLLIN, {u32=1692426400, u64=139820057780384}}], 64, -1) = 1 <1.499068>
[pid    13] 00:12:43.653276 read(172,  <unfinished ...>
[pid    12] 00:12:43.653292 <... epoll_wait resumed>[{EPOLLIN, {u32=1742758048, u64=139820108112032}}], 64, -1) = 1 <1.499079>
[pid    11] 00:12:43.653315 read(137,  <unfinished ...>
[pid    10] 00:12:43.653328 <... epoll_wait resumed>[{EPOLLIN, {u32=1751146656, u64=139820116500640}}], 64, -1) = 1 <1.499021>
[pid    16] 00:12:43.653356 read(184,  <unfinished ...>
[pid    15] 00:12:43.653372 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000127>
[pid    14] 00:12:43.653395 read(176,  <unfinished ...>
[pid    13] 00:12:43.653409 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000124>
[pid    11] 00:12:43.653429 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000105>
[pid     9] 00:12:43.653450 read(134,  <unfinished ...>
[pid    16] 00:12:43.653474 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000108>
[pid    15] 00:12:43.653504 epoll_wait(158,  <unfinished ...>
[pid    14] 00:12:43.653567 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000161>
[pid    13] 00:12:43.653590 epoll_wait(140,  <unfinished ...>
[pid    12] 00:12:43.653603 read(135,  <unfinished ...>
[pid    11] 00:12:43.653617 epoll_wait(118,  <unfinished ...>
[pid    10] 00:12:43.653631 read(136,  <unfinished ...>
[pid     9] 00:12:43.653641 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000182>
[pid    16] 00:12:43.653672 epoll_wait(161,  <unfinished ...>
[pid    15] 00:12:43.653685 <... epoll_wait resumed>[], 64, 0) = 0 <0.000171>
[pid    14] 00:12:43.653702 epoll_wait(149,  <unfinished ...>
[pid    13] 00:12:43.653717 <... epoll_wait resumed>[], 64, 0) = 0 <0.000118>
[pid    12] 00:12:43.653732 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000121>
[pid    11] 00:12:43.653754 <... epoll_wait resumed>[], 64, 0) = 0 <0.000127>
[pid    10] 00:12:43.653801 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000162>
[pid     9] 00:12:43.653825 epoll_wait(106,  <unfinished ...>
[pid    16] 00:12:43.653848 <... epoll_wait resumed>[], 64, 0) = 0 <0.000167>
[pid    15] 00:12:43.653865 epoll_wait(158,  <unfinished ...>
[pid    14] 00:12:43.653878 <... epoll_wait resumed>[], 64, 0) = 0 <0.000166>
[pid    13] 00:12:43.653894 epoll_wait(140,  <unfinished ...>
[pid    12] 00:12:43.653908 epoll_wait(121,  <unfinished ...>
[pid    11] 00:12:43.653922 epoll_wait(118,  <unfinished ...>
[pid    10] 00:12:43.653938 epoll_wait(109,  <unfinished ...>
[pid     9] 00:12:43.653952 <... epoll_wait resumed>[], 64, 0) = 0 <0.000118>
[pid    16] 00:12:43.653974 epoll_wait(161,  <unfinished ...>
[pid    14] 00:12:43.653986 epoll_wait(149,  <unfinished ...>
[pid    12] 00:12:43.653998 <... epoll_wait resumed>[], 64, 0) = 0 <0.000081>
[pid    10] 00:12:43.654015 <... epoll_wait resumed>[], 64, 0) = 0 <0.000069>
[pid    12] 00:12:43.654038 epoll_wait(121,  <unfinished ...>
[pid     9] 00:12:43.654051 epoll_wait(106,  <unfinished ...>
[pid    10] 00:12:43.654067 epoll_wait(109,  <unfinished ...>
[pid    18] 00:12:43.899534 <... clock_nanosleep resumed>0x7f2a635fb760) = 0 <1.000131>
[pid    18] 00:12:43.899609 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid     1] 00:12:44.065226 <... clock_nanosleep resumed>0x7fff3e6c4b10) = 0 <1.000236>
[pid     1] 00:12:44.065374 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid    19] 00:12:44.241955 <... epoll_wait resumed>[{EPOLLIN, {u32=1644212352, u64=139820009566336}}], 256, -1) = 1 <1.588873>
[pid    19] 00:12:44.242051 accept4(199, {sa_family=AF_INET, sin_port=htons(33386), sin_addr=inet_addr("10.99.10.35")}, [16], SOCK_CLOEXEC|SOCK_NONBLOCK) = 46 <0.000034>
[pid    19] 00:12:44.242162 epoll_ctl(196, EPOLL_CTL_ADD, 46, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=1644781568, u64=139820010135552}}) = 0 <0.000017>
[pid    19] 00:12:44.242232 epoll_wait(196, [{EPOLLIN, {u32=1644781568, u64=139820010135552}}], 256, -1) = 1 <0.000014>
[pid    19] 00:12:44.242305 recvfrom(46, "GET /api/v1/metrics/prometheus H"..., 4096, 0, NULL, NULL) = 1733 <0.000017>
[pid    19] 00:12:44.242478 newfstatat(AT_FDCWD, "/api/v1/metrics/prometheus", 0x7f2a62bfafd0, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory) <0.000034>
[pid    19] 00:12:44.242809 writev(46, [{iov_base="HTTP/1.1 200 OK\r\n", iov_len=17}, {iov_base="Server: Monkey/1.7.2\r\nDate: Sat,"..., iov_len=59}, {iov_base="Transfer-Encoding: chunked\r\n", iov_len=28}], 3) = 104 <0.000050>
[pid    19] 00:12:44.242926 writev(46, [{iov_base="Content-Type: text/plain; versio"..., iov_len=41}, {iov_base="\r\n", iov_len=2}], 2) = 43 <0.000034>
[pid    19] 00:12:44.243018 sendto(46, "2255\r\n", 6, 0, NULL, 0) = 6 <0.000054>
[pid    19] 00:12:44.243136 epoll_ctl(196, EPOLL_CTL_DEL, 46, NULL) = 0 <0.000019>
[pid    19] 00:12:44.243221 sendto(46, "# HELP fluentbit_filter_add_reco"..., 8789, 0, NULL, 0) = 8789 <0.000075>
[pid    19] 00:12:44.243357 sendto(46, "\r\n", 2, 0, NULL, 0) = 2 <0.000019>
[pid    19] 00:12:44.243430 epoll_ctl(196, EPOLL_CTL_ADD, 46, {EPOLLOUT|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=1644781568, u64=139820010135552}}) = 0 <0.000014>
[pid    19] 00:12:44.243504 epoll_wait(196, [{EPOLLOUT, {u32=1644781568, u64=139820010135552}}], 256, -1) = 1 <0.000016>
[pid    19] 00:12:44.243588 epoll_ctl(196, EPOLL_CTL_DEL, 46, NULL) = 0 <0.000045>
[pid    19] 00:12:44.243699 sendto(46, "0\r\n\r\n", 5, 0, NULL, 0) = 5 <0.000015>
[pid    19] 00:12:44.243811 epoll_ctl(196, EPOLL_CTL_ADD, 46, {EPOLLOUT|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=1644781568, u64=139820010135552}}) = 0 <0.000021>
[pid    19] 00:12:44.243883 epoll_wait(196, [{EPOLLOUT, {u32=1644781568, u64=139820010135552}}], 256, -1) = 1 <0.000029>
[pid    19] 00:12:44.243958 epoll_ctl(196, EPOLL_CTL_DEL, 46, NULL) = 0 <0.000015>
[pid    19] 00:12:44.244051 epoll_ctl(196, EPOLL_CTL_ADD, 46, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=1644781568, u64=139820010135552}}) = 0 <0.000015>
[pid    19] 00:12:44.244122 epoll_wait(196,  <unfinished ...>
[pid    18] 00:12:44.899854 <... clock_nanosleep resumed>0x7f2a635fb760) = 0 <1.000158>
[pid    18] 00:12:44.899931 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid     1] 00:12:45.065666 <... clock_nanosleep resumed>0x7fff3e6c4b10) = 0 <1.000178>
[pid     1] 00:12:45.065735 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid    16] 00:12:45.154860 <... epoll_wait resumed>[{EPOLLIN, {u32=1684037792, u64=139820049391776}}], 64, -1) = 1 <1.500826>
[pid    15] 00:12:45.154910 <... epoll_wait resumed>[{EPOLLIN, {u32=1688232096, u64=139820053586080}}], 64, -1) = 1 <1.501036>
[pid    13] 00:12:45.154936 <... epoll_wait resumed>[{EPOLLIN, {u32=1696620704, u64=139820061974688}}], 64, -1) = 1 <1.501033>
[pid    12] 00:12:45.154972 <... epoll_wait resumed>[{EPOLLIN, {u32=1742758048, u64=139820108112032}}], 64, -1) = 1 <1.500919>
[pid    11] 00:12:45.154999 <... epoll_wait resumed>[{EPOLLIN, {u32=1746952352, u64=139820112306336}}], 64, -1) = 1 <1.501068>
[pid    10] 00:12:45.155025 <... epoll_wait resumed>[{EPOLLIN, {u32=1751146656, u64=139820116500640}}], 64, -1) = 1 <1.500950>
[pid     9] 00:12:45.155050 <... epoll_wait resumed>[{EPOLLIN, {u32=1755340960, u64=139820120694944}}], 64, -1) = 1 <1.500989>

Curl request to ‘/api/v1/storage’

{
  "storage_layer": {
    "chunks": {
      "total_chunks": 4,
      "mem_chunks": 4,
      "fs_chunks": 0,
      "fs_chunks_up": 0,
      "fs_chunks_down": 0
    }
  },
  "input_chunks": {
    "systemd.0": {
      "status": {
        "overlimit": false,
        "mem_size": "0b",
        "mem_limit": "0b"
      },
      "chunks": {
        "total": 0,
        "up": 0,
        "down": 0,
        "busy": 0,
        "busy_size": "0b"
      }
    },
    "tail.1": {
      "status": {
        "overlimit": false,
        "mem_size": "0b",
        "mem_limit": "0b"
      },
      "chunks": {
        "total": 0,
        "up": 0,
        "down": 0,
        "busy": 0,
        "busy_size": "0b"
      }
    },
    "tail.2": {
      "status": {
        "overlimit": false,
        "mem_size": "0b",
        "mem_limit": "0b"
      },
      "chunks": {
        "total": 0,
        "up": 0,
        "down": 0,
        "busy": 0,
        "busy_size": "0b"
      }
    },
    "storage_backlog.3": {
      "status": {
        "overlimit": false,
        "mem_size": "0b",
        "mem_limit": "0b"
      },
      "chunks": {
        "total": 0,
        "up": 0,
        "down": 0,
        "busy": 0,
        "busy_size": "0b"
      }
    },
    "emitter_for_rewrite_tag.4": {
      "status": {
        "overlimit": false,
        "mem_size": "0b",
        "mem_limit": "9.5M"
      },
      "chunks": {
        "total": 0,
        "up": 0,
        "down": 0,
        "busy": 0,
        "busy_size": "0b"
      }
    },
    "emitter_for_rewrite_tag.5": {
      "status": {
        "overlimit": false,
        "mem_size": "22.2K",
        "mem_limit": "9.5M"
      },
      "chunks": {
        "total": 4,
        "up": 4,
        "down": 0,
        "busy": 0,
        "busy_size": "0b"
      }
    },
    "emitter_for_rewrite_tag.6": {
      "status": {
        "overlimit": false,
        "mem_size": "0b",
        "mem_limit": "9.5M"
      },
      "chunks": {
        "total": 0,
        "up": 0,
        "down": 0,
        "busy": 0,
        "busy_size": "0b"
      }
    }
  }
}

Pod describe(Current state)

    State:          Running
    Started:        Mon, 30 Dec 2024 03:39:27 +0300
    Ready:          True
    Restart Count:  2
    Limits:
        cpu:        200m
        memory:     128Mi
    Requests:
        cpu:        200m
        memory:     128Mi
    Conditions:
        Type                        Status
        PodReadyToStartContainers   True
        Initialized                 True
        Ready                       True
        ContainersReady             True
        PodScheduled                True

Last logs (stuck about a month ago)

[2025/01/09 15:24:27] [ info] [input:tail:tail.1] inotify_fs_add(): inode=337644498 watch_fd=4787 name=/var/log/containers/xxx-5784f8cbcd-2txq5_xxx_migrations-4b50a78e51a4e07aa11245e624fc4e67bfd0d4b34edb965388eb0eaf05c2dca5.log
[2025/01/09 15:24:28] [ warn] [engine] failed to flush chunk '1-1736436267.677796148.flb', retry in 9 seconds: task_id=7, input=emitter_for_rewrite_tag.6 > output=es.3 (out_id=3)
[2025/01/09 15:24:37] [error] [engine] chunk '1-1736436267.677796148.flb' cannot be retried: task_id=7, input=emitter_for_rewrite_tag.6 > output=es.3
[2025/01/09 15:25:26] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=433188932 watch_fd=4506
[2025/01/09 15:25:27] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=370864192 watch_fd=4775
[2025/01/09 15:25:28] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=332235753 watch_fd=4509
[2025/01/09 15:25:29] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=296558453 watch_fd=4756
[2025/01/09 15:25:29] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=449752668 watch_fd=4500
[2025/01/09 15:25:29] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=386694775 watch_fd=4748
[2025/01/09 15:25:29] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=72887180 watch_fd=4542
[2025/01/09 15:25:29] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=5697947 watch_fd=4383
[2025/01/09 15:27:27] [ info] [filter:kubernetes:kubernetes.2]  token updated
[2025/01/09 15:27:27] [ info] [input:tail:tail.1] inotify_fs_add(): inode=391506156 watch_fd=4788 name=/var/log/containers/xxx-b57647789-gfbvk_xxx-static-dc5b651c143f9e36af9c19da6cde65e55d252417df909c551aad241f4aef5a64.log
[2025/01/09 15:27:27] [ info] [input:tail:tail.1] inotify_fs_add(): inode=223083475 watch_fd=4789 name=/var/log/containers/xxx-7b8c7bd55f-bn7ch_xxx-server-6520de96062c8f54116157f179ab0c8460c31ac6da042a33fd5f0cb2c59b57f7.log
[2025/01/09 15:27:27] [ info] [input:tail:tail.1] inotify_fs_add(): inode=400043213 watch_fd=4790 name=/var/log/containers/xxx-bob-7b8c7bd55f-bn7ch_xxx-xstatic-2a429bca73887c93233634ca80462272fc2669bb6cdf95da84c687d6dbab4ff9.log
[2025/01/09 15:27:27] [ info] [input:tail:tail.1] inotify_fs_add(): inode=244744155 watch_fd=4791 name=/var/log/containers/xxx-bob-7b8c7bd55f-rqxh2_xxx-xnews-server-5c8fc7ae42b6924bb0c2bfe9ed9ea4c4f82d46907696bfd9262f99f042b0f0bf.log
[2025/01/09 15:27:27] [ info] [input:tail:tail.1] inotify_fs_add(): inode=386711596 watch_fd=4792 name=/var/log/containers/xxx-bob-7b8c7bd55f-rqxh2_xxx-xstatic-e0ba58f5aca4b17063df13b4efd2018ecad2f896fdda82a612fb5875b6f01b22.log
[2025/01/09 15:27:27] [ info] [input:tail:tail.1] inotify_fs_add(): inode=213421478 watch_fd=4793 name=/var/log/containers/xxx-bob-b57647789-gfbvk_xxx-xserver-1c25bac499bfecc754d112857ec94805ced64f35db604af02c1d6a8c5c9b6423.log
[2025/01/09 15:27:28] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=391506156 watch_fd=4788
[2025/01/09 15:27:28] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=213421478 watch_fd=4793

AzureLeMoon · 2025-02-09T18:36:30Z

these seem similar to #9903 as well, could you also check the cpu usage when fluent-bit hangs?

nikitamishagin · 2025-02-10T08:04:26Z

@AzureLeMoon, ok. I will record the cpu usage and send it when the problem repeats

nikitamishagin · 2025-02-10T08:10:24Z

@pawel-lmcb, there is nothing unusual in the logs. They look something like this:

[2025/02/10 07:28:51] [ warn] [engine] failed to flush chunk '1-1739172530.835782756.flb', retry in 7 seconds: task_id=3, input=emitter_for_rewrite_tag.5 > output=es.2 (out_id=2)
[2025/02/10 07:28:58] [ info] [engine] flush chunk '1-1739172530.835782756.flb' succeeded at retry 1: task_id=3, input=emitter_for_rewrite_tag.5 > output=es.2 (out_id=2)
[2025/02/10 08:00:46] [ warn] [engine] failed to flush chunk '1-1739174446.210853217.flb', retry in 8 seconds: task_id=0, input=emitter_for_rewrite_tag.5 > output=es.2 (out_id=2)
[2025/02/10 08:00:54] [ info] [engine] flush chunk '1-1739174446.210853217.flb' succeeded at retry 1: task_id=0, input=emitter_for_rewrite_tag.5 > output=es.2 (out_id=2)
[2025/02/10 08:04:53] [ warn] [engine] failed to flush chunk '1-1739174693.685714426.flb', retry in 10 seconds: task_id=3, input=emitter_for_rewrite_tag.5 > output=es.2 (out_id=2)
[2025/02/10 08:05:03] [ info] [engine] flush chunk '1-1739174693.685714426.flb' succeeded at retry 1: task_id=3, input=emitter_for_rewrite_tag.5 > output=es.2 (out_id=2)

And when a problem occurs the logging just stops.

Snak3GMS · 2025-02-10T09:09:54Z

@AzureLeMoon
Below is the CPU usage from the same host where the fluent-bit pod logs were taken. Fluent-bit pod is still stuck on this host since January 9th.

[xxx@xxx containers]$ mpstat 1 5
Linux 6.10.10-1.el9.elrepo.x86_64 (xxx) 	02/10/2025 	_x86_64_	(16 CPU)

12:02:14 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
12:02:15 PM  all    8.98    0.00    1.32    0.00    0.00    0.06    0.00    0.00    0.00   89.64
12:02:16 PM  all    7.76    0.00    0.94    0.00    0.00    0.25    0.00    0.00    0.00   91.05
12:02:17 PM  all    5.47    0.00    0.82    0.00    0.00    0.06    0.00    0.00    0.00   93.65
12:02:18 PM  all    6.03    0.00    1.38    0.00    0.00    0.06    0.00    0.00    0.00   92.52
12:02:19 PM  all    7.83    0.00    2.19    0.00    0.00    0.06    0.00    0.00    0.00   89.91
Average:     all    7.22    0.00    1.33    0.00    0.00    0.10    0.00    0.00    0.00   91.35

kubectl top pod fluent-bit-5gx94 --namespace kube-system
NAME               CPU(cores)   MEMORY(bytes)
fluent-bit-5gx94   1m           106Mi

uristernik · 2025-02-10T18:08:59Z

Bumping also this, looks similar
#9557

AzureLeMoon · 2025-02-10T18:29:28Z

Bumping also this, looks similar
#9557

This appears to be the exact problem i have as described in #9903

nikitamishagin added the status: waiting-for-triage label Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fluent-bit hanging and stopping operation #9927

Fluent-bit hanging and stopping operation #9927

nikitamishagin commented Feb 7, 2025

Snak3GMS commented Feb 7, 2025

pawel-lmcb commented Feb 7, 2025

Snak3GMS commented Feb 8, 2025

AzureLeMoon commented Feb 9, 2025

nikitamishagin commented Feb 10, 2025

nikitamishagin commented Feb 10, 2025

Snak3GMS commented Feb 10, 2025 •

edited

Loading

uristernik commented Feb 10, 2025

AzureLeMoon commented Feb 10, 2025

Fluent-bit hanging and stopping operation #9927

Fluent-bit hanging and stopping operation #9927

Comments

nikitamishagin commented Feb 7, 2025

Bug Report

Snak3GMS commented Feb 7, 2025

pawel-lmcb commented Feb 7, 2025

Snak3GMS commented Feb 8, 2025

AzureLeMoon commented Feb 9, 2025

nikitamishagin commented Feb 10, 2025

nikitamishagin commented Feb 10, 2025

Snak3GMS commented Feb 10, 2025 • edited Loading

uristernik commented Feb 10, 2025

AzureLeMoon commented Feb 10, 2025

Snak3GMS commented Feb 10, 2025 •

edited

Loading