Pending tasks fails to flush during hot reload when using external go output plugins #9733

imankurpatel000 · 2024-12-16T16:09:18Z

Bug Report

Describe the bug
Pending tasks fails to flush during hot reload when using external go output plugins. Basically, if there are any chunks pending to be flushed during a hot reload then those chunks fails to flush if you are using a go output plugin. The error message is

[2024/12/16 12:15:29] [ warn] [engine] failed to flush chunk '1-1734351323.147762458.flb', retry in 1 seconds: task_id=0, input=fluent-tail-input > output=new_alias_3

Impact of the issue

When using external go output plugins, during a hot reload if there are any chunks pending to be flushed, then those will fail continuously until the retries are exhausted and those chunks are dropped so ultimately you end up losing all those pending chunks.
And because the fluent-bit keeps retrying and keeps failing to flush chunks, it ultimately delays the hot reload by a several minutes depending on how many pending chunks were waiting to be flushed and the value set for Retry_Limit. This can again lead to delay in log processing or even losing logs because the new config wasn't reloaded for several minutes.

To Reproduce

Steps to reproduce the problem:
- I prepared a repository with all the required files and steps to reproduce the problem. Please clone the repository https://github.com/imankurpatel000/fluent-bit-hot-reload-issue/ and follow the steps as provided in readme. It just requires you to run a few commands and that should replicate the issue for you. Feel free to let me know if there is further help required to replicate the issue.

Expected behavior
When the fluent-bit is hot reloaded, it should allow go plugins to flush the pending chunks without any error.

Your Environment

Version used: 3.2.2
Configuration: All the details are provided in https://github.com/imankurpatel000/fluent-bit-hot-reload-issue/
Environment name and version (e.g. Kubernetes? What version?):
Operating System and version: Apple M3 Pro (macOS 15.0)
Filters and plugins: All the details are provided in https://github.com/imankurpatel000/fluent-bit-hot-reload-issue/

Additional context
This issue is caused by the changes done in #7997, specifically commit 25b470d which starts returning FLB_RETRY during a hot reload and does not actually call the plugin to flush the remaining chunks. And because it returns FLB_RETRY every time, fluent-bit keeps retrying all the chunks with exponential backoff which overall delays the hot reload and we also lose the pending chunks. The comment in this code says

    /* To prevent flush callback executions, we need to check the
     * status of hot-reloading. The actual problem is: we don't have
     * pause procedure/mechanism for output plugin. For now, we just halt the
     * flush callback here during hot-reloading is in progress. */

So maybe there is some reason behind it, but at least, I can't understand it. Fluent-bit is already pausing all the inputs so new logs are anyway not being ingested so why not just let the output plugin flush out pending chunks and continue with the actual reload process. I also don't understand why this was added for external go plugins but not for internal output plugins because they continue to flush out pending chunks.

I already tested by removing this code and after which I don't see any problem with go output plugins during hot reloading. So I am going to raise a PR to remove this code. But please feel free to let me know if this code is actually required and if it is required then how else can we solve this problem.

The text was updated successfully, but these errors were encountered:

imankurpatel000 added the status: waiting-for-triage label Dec 16, 2024

imankurpatel000 mentioned this issue Dec 16, 2024

plugin_proxy: Allow to execute flush callback on Golang side during hot-reloading #9734

Merged

3 tasks

niedbalski closed this as completed in #9734 Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pending tasks fails to flush during hot reload when using external go output plugins #9733

Pending tasks fails to flush during hot reload when using external go output plugins #9733

imankurpatel000 commented Dec 16, 2024 •

edited

Loading

Pending tasks fails to flush during hot reload when using external go output plugins #9733

Pending tasks fails to flush during hot reload when using external go output plugins #9733

Comments

imankurpatel000 commented Dec 16, 2024 • edited Loading

Bug Report

imankurpatel000 commented Dec 16, 2024 •

edited

Loading