Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"[ Error] timerfd: Too many open files, errno=24 at /tmp/fluent-bit/lib/monkey/mk_core/mk_event_epoll.c:221" #9966

Open
duj4 opened this issue Feb 20, 2025 · 3 comments

Comments

@duj4
Copy link

duj4 commented Feb 20, 2025

Bug Report

Describe the bug
We are trying to understand how much data Fluent-Bit can store when downstream (Loki) is out of service, below is our configuration:

service:
  storage.metrics: on
  storage.sync: normal
  storage.checksum: on
  storage.path: <path_to_data>
  storage.max_chunks_up: 256
  storage.backlog.mem_limit: 1G
  storage.delete_irrecoverable_chunks: on
  scheduler.base: 2
  scheduler.cap: 30

pipeline:
  inputs:
    - name: tail
       path: <path_to_file1>
      ...
      storage.type: filesystem
    - name: tail
       path: <path_to_file2>
      ...
      storage.type: filesystem   

  outputs:
    - name: loki
      ...
      storage.total_limit_size: 5G
      retry_limit: no_limits

Logs below started posted out:

[log] error opening log file /d/d1/monitoring/fluent-bit/log/fluent-bit.log. Using stderr.
[2025/02/20 17:37:10] [  Error] timerfd: Too many open files, errno=24 at /tmp/fluent-bit/lib/monkey/mk_core/mk_event_epoll.c:221
[2025/02/20 17:37:10] [error] [sched] a 'retry request' could not be scheduled. the system might be running out of memory or file descriptors. The scheduler will do a retry later.
[2025/02/20 17:37:10] [  Error] timerfd: Too many open files, errno=24 at /tmp/fluent-bit/lib/monkey/mk_core/mk_event_epoll.c:221
[log] error opening log file /d/d1/monitoring/fluent-bit/log/fluent-bit.log. Using stderr.
[2025/02/20 17:37:10] [error] [sched] a 'retry request' could not be scheduled. the system might be running out of memory or file descriptors. The scheduler will do a retry later.
[log] error opening log file /d/d1/monitoring/fluent-bit/log/fluent-bit.log. Using stderr.
[2025/02/20 17:37:10] [error] [sched] a 'retry request' could not be scheduled. the system might be running out of memory or file descriptors. The scheduler will do a retry later.
[log] error opening log file /d/d1/monitoring/fluent-bit/log/fluent-bit.log. Using stderr.
[2025/02/20 17:37:10] [error] [storage] [cio file] cannot open chunk: tail.0/2852447-1740040285.907365846.flb
[log] error opening log file /d/d1/monitoring/fluent-bit/log/fluent-bit.log. Using stderr.
[2025/02/20 17:37:10] [  Error] timerfd: Too many open files, errno=24 at /tmp/fluent-bit/lib/monkey/mk_core/mk_event_epoll.c:221
[2025/02/20 17:37:10] [  Error] timerfd: Too many open files, errno=24 at /tmp/fluent-bit/lib/monkey/mk_core/mk_event_epoll.c:221
[2025/02/20 17:37:10] [  Error] timerfd: Too many open files, errno=24 at /tmp/fluent-bit/lib/monkey/mk_core/mk_event_epoll.c:221
[log] error opening log file /d/d1/monitoring/fluent-bit/log/fluent-bit.log. Using stderr.
[2025/02/20 17:37:10] [  Error] timerfd: Too many open files, errno=24 at /tmp/fluent-bit/lib/monkey/mk_core/mk_event_epoll.c:221
[log] error opening log file /d/d1/monitoring/fluent-bit/log/fluent-bit.log. Using stderr.
[log] error opening log file /d/d1/monitoring/fluent-bit/log/fluent-bit.log. Using stderr.
[log] error opening log file /d/d1/monitoring/fluent-bit/log/fluent-bit.log. Using stderr.
[log] error opening log file /d/d1/monitoring/fluent-bit/log/fluent-bit.log. Using stderr.
[2025/02/20 17:37:12] [  Error] timerfd: Too many open files, errno=24 at /tmp/fluent-bit/lib/monkey/mk_core/mk_event_epoll.c:221
[log] error opening log file /d/d1/monitoring/fluent-bit/log/fluent-bit.log. Using stderr.
[log] error opening log file /d/d1/monitoring/fluent-bit/log/fluent-bit.log. Using stderr.
[log] error opening log file /d/d1/monitoring/fluent-bit/log/fluent-bit.log. Using stderr.

Output of storage api:

$ curl -s http://127.0.0.1:9108/api/v1/storage | jq                                        {
  "storage_layer": {
    "chunks": {
      "total_chunks": 2050,
      "mem_chunks": 0,
      "fs_chunks": 2050,
      "fs_chunks_up": 54,
      "fs_chunks_down": 1996
    }
  },
  "input_chunks": {
    "tail.0": {
      "status": {
        "overlimit": false,
        "mem_size": "814b",
        "mem_limit": "0b"
      },
      "chunks": {
        "total": 113,
        "up": 6,
        "down": 107,
        "busy": 112,
        "busy_size": "1.2K"
      }
    },
    "tail.1": {
      "status": {
        "overlimit": false,
        "mem_size": "3.3K",
        "mem_limit": "190.7M"
      },
      "chunks": {
        "total": 1937,
        "up": 48,
        "down": 1889,
        "busy": 1936,
        "busy_size": "51.1K"
      }
    },
    "storage_backlog.2": {
      "status": {
        "overlimit": false,
        "mem_size": "0b",
        "mem_limit": "0b"
      },
      "chunks": {
        "total": 0,
        "up": 0,
        "down": 0,
        "busy": 0,
        "busy_size": "0b"
      }
    }
  }
}

Below is the kernel config:

/proc/sys/fs/inotify/max_user_watches 181268
/proc/sys/fs/inotify/max_user_instances 128
LimitNOFILE=262144
LimitNOFILESoft=1024
ulimit -n -> 1024
ulimit -u -> 95148

Per the suggestion from links below, it seems I need to tuning the kernel configuration mentioned above:
#1777 (comment)
#9151 (comment)

Your Environment

  • Version used: 3.2.4
  • Environment name and version (e.g. Kubernetes? What version?): Linux
  • Server type and version: Linux
  • Operating System and version: RHEL 8.0

And here are my questions:

  1. I also noticed that the task_id will not be higher than 2047, may I know if this is by design? And is there any mapping between the task_id and chunk file stored in the filesystem?
  2. There would be hundreds of Linux servers in our environments, adjusting the kernels would not be realistic, would Fluent-Bit fix this issue?
@nuclearpidgeon
Copy link
Contributor

nuclearpidgeon commented Feb 20, 2025

I ran into this recently, I think on RHEL8 services are defaulted to running with a max open file ulimit of 1024. Had to add LimitNOFILE=12345 or whatever big number into the systemd unit file to get around it.

Fluentbit doesn't have any code to try change its own (soft) limit of open files - it just crashes out if open() syscalls etc. fail. So the LimitNOFILESoft=1024 thing might be causing your problems.

Each input plugin spins up one or more POSIX pipes for internal signalling, which consumes 2 file descriptors (one for the in side one for the out side). epoll and inotify stuff happens too which all adds up to consuming more file descriptors than you might first think.

@pkqsun
Copy link

pkqsun commented Feb 21, 2025

For same scenario in my test case, adding LimitNOFILE=65536 in systemd fluent-bit.service file works well, no "Too many open files" error reported after more than 2h keeping buffering data without output.

[Service]
Type=xxxx
ExecStart=xxxx
WorkingDirectory=xxxx
Restart=xxxx
LimitNOFILE=65536

However, I still have some confuses regarding the case of Output/Loki is down but fluent-bit is keeping running and buffering data:

1> I noticed that the maximum task_id for fluent-bit is still 2048 (0-2047). Even though this time there is no "Too many open files" error, but I am not sure if this value is fixed or not, as the result of "ulimit -n" for common user is still 1024.

2> With the storage strategy of memory + filesystem, the local file of chunks is 2050 (total_chunks) at maximum as I could list in data/tail.xx/ folder.

$ curl -s http://127.0.0.1:2020/api/v1/storage | jq | head
{
  "storage_layer": {
    "chunks": {
      "total_chunks": 2050,
      "mem_chunks": 0,
      "fs_chunks": 2050,
      "fs_chunks_up": N,  # my max_chunks_up is set to 256
      "fs_chunks_down": 2050 - N
    }
  },

Basically, I assumed one task_id deal with one fs_chunk for flushing action (I set retry_limit to no_limits), but the number of them are diff. Does the number of total_chunks have some relationship with task_id or "LimitNOFILE" ?
Could someone give me a brief explanation ?

3> Still above case, when total_chunks is reach 2050, only the number of up and down fs_chunks are changed but keeping the sum of them is 2050.
From this time, I guess new logs are stopped buffering as I noticed that these file names (format is: <fluentbit_pid>-<epoch_time>.flb) are not changed, am I right ?
If it is yes, how to avoid lost logs for such case or the number of total_chunks could be set via specific parameter ?

Appreciate for any comments.

@duj4
Copy link
Author

duj4 commented Feb 21, 2025

For same scenario in my test case, adding LimitNOFILE=65536 in systemd fluent-bit.service file works well, no "Too many open files" error reported after more than 2h keeping buffering data without output.

[Service]
Type=xxxx
ExecStart=xxxx
WorkingDirectory=xxxx
Restart=xxxx
LimitNOFILE=65536

However, I still have some confuses regarding the case of Output/Loki is down but fluent-bit is keeping running and buffering data:

1> I noticed that the maximum task_id for fluent-bit is still 2048 (0-2047). Even though this time there is no "Too many open files" error, but I am not sure if this value is fixed or not, as the result of "ulimit -n" for common user is still 1024.

2> With the storage strategy of memory + filesystem, the local file of chunks is 2050 (total_chunks) at maximum as I could list in data/tail.xx/ folder.

$ curl -s http://127.0.0.1:2020/api/v1/storage | jq | head
{
  "storage_layer": {
    "chunks": {
      "total_chunks": 2050,
      "mem_chunks": 0,
      "fs_chunks": 2050,
      "fs_chunks_up": N,  # my max_chunks_up is set to 256
      "fs_chunks_down": 2050 - N
    }
  },

Basically, I assumed one task_id deal with one fs_chunk for flushing action (I set retry_limit to no_limits), but the number of them are diff. Does the number of total_chunks have some relationship with task_id or "LimitNOFILE" ? Could someone give me a brief explanation ?

3> Still above case, when total_chunks is reach 2050, only the number of up and down fs_chunks are changed but keeping the sum of them is 2050. From this time, I guess new logs are stopped buffering as I noticed that these file names (format is: <fluentbit_pid>-<epoch_time>.flb) are not changed, am I right ? If it is yes, how to avoid lost logs for such case or the number of total_chunks could be set via specific parameter ?

Appreciate for any comments.

hi @nuclearpidgeon, after trying the suggestion offered, FLB is working as expected for now and this may bring several other questions as my colleague mentioned, it would be much appreciated if you could help check these at your convenience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants