Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pending tasks fails to flush during hot reload when using external go output plugins #9733

Closed
imankurpatel000 opened this issue Dec 16, 2024 · 0 comments · Fixed by #9734
Closed

Comments

@imankurpatel000
Copy link
Contributor

imankurpatel000 commented Dec 16, 2024

Bug Report

Describe the bug
Pending tasks fails to flush during hot reload when using external go output plugins. Basically, if there are any chunks pending to be flushed during a hot reload then those chunks fails to flush if you are using a go output plugin. The error message is

[2024/12/16 12:15:29] [ warn] [engine] failed to flush chunk '1-1734351323.147762458.flb', retry in 1 seconds: task_id=0, input=fluent-tail-input > output=new_alias_3

Impact of the issue

  • When using external go output plugins, during a hot reload if there are any chunks pending to be flushed, then those will fail continuously until the retries are exhausted and those chunks are dropped so ultimately you end up losing all those pending chunks.
  • And because the fluent-bit keeps retrying and keeps failing to flush chunks, it ultimately delays the hot reload by a several minutes depending on how many pending chunks were waiting to be flushed and the value set for Retry_Limit. This can again lead to delay in log processing or even losing logs because the new config wasn't reloaded for several minutes.

To Reproduce

  • Steps to reproduce the problem:
    • I prepared a repository with all the required files and steps to reproduce the problem. Please clone the repository https://github.com/imankurpatel000/fluent-bit-hot-reload-issue/ and follow the steps as provided in readme. It just requires you to run a few commands and that should replicate the issue for you. Feel free to let me know if there is further help required to replicate the issue.

Expected behavior
When the fluent-bit is hot reloaded, it should allow go plugins to flush the pending chunks without any error.

Your Environment

Additional context
This issue is caused by the changes done in #7997, specifically commit 25b470d which starts returning FLB_RETRY during a hot reload and does not actually call the plugin to flush the remaining chunks. And because it returns FLB_RETRY every time, fluent-bit keeps retrying all the chunks with exponential backoff which overall delays the hot reload and we also lose the pending chunks. The comment in this code says

    /* To prevent flush callback executions, we need to check the
     * status of hot-reloading. The actual problem is: we don't have
     * pause procedure/mechanism for output plugin. For now, we just halt the
     * flush callback here during hot-reloading is in progress. */

So maybe there is some reason behind it, but at least, I can't understand it. Fluent-bit is already pausing all the inputs so new logs are anyway not being ingested so why not just let the output plugin flush out pending chunks and continue with the actual reload process. I also don't understand why this was added for external go plugins but not for internal output plugins because they continue to flush out pending chunks.

I already tested by removing this code and after which I don't see any problem with go output plugins during hot reloading. So I am going to raise a PR to remove this code. But please feel free to let me know if this code is actually required and if it is required then how else can we solve this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant