Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework fluent-bit-watcher and make use of the hot-reload mechanism #1051

Merged
merged 4 commits into from
Feb 18, 2024

Conversation

markusthoemmes
Copy link
Collaborator

@markusthoemmes markusthoemmes commented Jan 30, 2024

Note: I recognize that I haven't proposed this change per-se and that it's not necessarily necessary to do this rework in order to implement hot-reload. I'm happy to change things around and I don't want to come across as "here's a rework, plz merge" 😅 . This is more of a discussion-base for now.

What this PR does / why we need it:

This reworks the fluent-bit-watcher to be "single-child-process" only. Since we can now use hot-reload, we don't need to account for potential fluent-bit restarts "under the hood" and can instead rely on a single process to live "forever". Whenever that one fluent-bit process exits, the watcher should exit as well and Kubernetes' CrashLoopBackoff take care of the rest.

That fundamental change in assumption allows us to vastly simplify the lifecycle in the watcher, which this does. It gets rid of all, sometimes unsafely shared, globals in favor of a single process being started, watched, kicked and eventually stopped.

SIGTERM is still forwarded to the child process. There is no need for a SIGKILL timeout anymore since Kubernetes will eventually send a SIGKILL itself if the process isn't exiting as expected.

Does this PR introduced a user-facing change?

The fluent-bit config is now reloaded through the hot-reload mechanism and the underlying process is no longer restarted to force a config reload.

ACTION REQUIRED: Due to the usage of the hot-reload feature, this version is no longer suitable for fluent-bit versions below 2.1.
ACTION REQUIRED: Due to the usage of the hot-reload feature, the `--exit-on-failure` and `--flb-timeout` flags have become deprecated and have no effect. `--exit-on-failure` is no longer necessary as the fluent-bit pod will exit whenever fluent-bit exits (error or not) and `--flb-timeout` can be recreated by adding a `terminationGracePeriod` to the `FluentBit` resource.

Additional documentation, usage docs, etc.:


@markusthoemmes markusthoemmes force-pushed the rework-fluent-bit-watcher branch from 963f00e to 4c1fb3a Compare January 30, 2024 15:48
This reworks the fluent-bit-watcher to be "single-child-process" only. Since we can now use hot-reload, we don't need to account for potential fluent-bit restarts "under the hood" and can instead rely on a single process to live "forever". Whenever that one fluent-bit process exits, the watcher should exit as well and Kubernetes' `CrashLookBackoff` take care of the rest.

That fundamental change in assumption allows us to vastly simplify the lifecycle in the watcher, which this does. It gets rid of all, sometimes unsafely shared, globals in favor of a single process being started, watched, kicked and eventually stopped.

SIGTERM is still forwarded to the child process. There is no need for a SIGKILL timeout anymore since Kubernetes will eventually send a SIGKILL itself if the process isn't exiting as expected.

Signed-off-by: Markus Thömmes <[email protected]>
@benjaminhuo
Copy link
Member

@markusthoemmes Thanks very much for this change!
@wanjunlei @wenchajun Please help to review this

wenchajun
wenchajun previously approved these changes Jan 31, 2024
@wenchajun
Copy link
Member

Here is a note: Due to the use of the hot-reload feature in the new version of Fluent Bit, the Fluent Operator image created based on this is not suitable for image versions below Fluent Bit v2.1.

@markusthoemmes
Copy link
Collaborator Author

Quick check wrt. the --exit-on-failure and --flb-timeout flags: I've just dropped them in the current state. I wonder though if we should keep them for backwards compatibility and log a deprecation?

@wenchajun
Copy link
Member

I think we can retain them and leave documentation notes. Of course, this is a point worth discussing, as both options are feasible.

@markusthoemmes
Copy link
Collaborator Author

Hmm. In terms of feasibility, I think --exit-on-failure is moot since we're exiting regardless (the lifecycle has changed). --flb-timeout actually becomes terminationGracePeriod on the respective DaemonSet/StatefulSet, I think. That'll instruct Kubernetes to send a SIGKILL after the respective time, which makes it behave exactly the same. I'll have another look.

@markusthoemmes
Copy link
Collaborator Author

I've reflected my thoughts on how to deal with the flags in the current state.

I'm dealing with a bit of an issue of the reload though due to the parsers.conf file. I'll have to go check if that's due to my very old fluent-operator version. I'll report back.

Signed-off-by: Markus Thömmes <[email protected]>
@markusthoemmes markusthoemmes force-pushed the rework-fluent-bit-watcher branch from d0b52b9 to 468acdc Compare January 31, 2024 11:16
@benjaminhuo
Copy link
Member

@markusthoemmes Thanks for all the thoughts.
@wanjunlei what do you think?

@benjaminhuo
Copy link
Member

benjaminhuo commented Feb 18, 2024

@markusthoemmes Thanks for this big change. We're on the way to implement a new fluentbit/fluentd config generation mechanism by watching CRDs in the watcher instead of putting the configs in a secret, which is more scalable.

I'm going to invite you as fluent-operator maintainer @markusthoemmes
cc @wanjunlei @wenchajun @Gentleelephant @patrick-stephens

@benjaminhuo benjaminhuo merged commit 2f254cc into fluent:master Feb 18, 2024
6 checks passed
@benjaminhuo
Copy link
Member

I'm going to invite you as fluent-operator maintainer @markusthoemmes

@patrick-stephens @agup006 would you help to invite @markusthoemmes as a member of the fluent org as well?

@markusthoemmes
Copy link
Collaborator Author

Thanks for the invite. Do note that this change has a bug in that config reloads can be swallowed if they happen in too quick succession. I‘ve opened a bug against fluent-bit wrt that, see fluent/fluent-bit#8457

There‘s a workaround by fetching the hot reload count before the kick and then waiting for the count to converge afterwards but that ain‘t implemented in this change. I don‘t know when and if I‘ll find the time to contribute that.

@benjaminhuo
Copy link
Member

benjaminhuo commented Feb 18, 2024

Thanks for the invite. Do note that this change has a bug in that config reloads can be swallowed if they happen in too quick succession. I‘ve opened a bug against fluent-bit wrt that, see fluent/fluent-bit#8457

There‘s a workaround by fetching the hot reload count before the kick and then waiting for the count to converge afterwards but that ain‘t implemented in this change. I don‘t know when and if I‘ll find the time to contribute that.

Looks like we need to wait for this fluentbit PR to be merged: fluent/fluent-bit#8461
And when a new fluentbit release is out after this PR, we can add a retry mechanism to the reload?

@benjaminhuo
Copy link
Member

Thanks for the invite. Do note that this change has a bug in that config reloads can be swallowed if they happen in too quick succession. I‘ve opened a bug against fluent-bit wrt that, see fluent/fluent-bit#8457

There‘s a workaround by fetching the hot reload count before the kick and then waiting for the count to converge afterwards but that ain‘t implemented in this change. I don‘t know when and if I‘ll find the time to contribute that.

@patrick-stephens @agup006 now fluentbit watcher use fluentbit's hot reload feature instead of killing the fluentbit process, but looks like the hot reload has some issues, would you please take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants