-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework fluent-bit-watcher and make use of the hot-reload mechanism #1051
Rework fluent-bit-watcher and make use of the hot-reload mechanism #1051
Conversation
963f00e
to
4c1fb3a
Compare
This reworks the fluent-bit-watcher to be "single-child-process" only. Since we can now use hot-reload, we don't need to account for potential fluent-bit restarts "under the hood" and can instead rely on a single process to live "forever". Whenever that one fluent-bit process exits, the watcher should exit as well and Kubernetes' `CrashLookBackoff` take care of the rest. That fundamental change in assumption allows us to vastly simplify the lifecycle in the watcher, which this does. It gets rid of all, sometimes unsafely shared, globals in favor of a single process being started, watched, kicked and eventually stopped. SIGTERM is still forwarded to the child process. There is no need for a SIGKILL timeout anymore since Kubernetes will eventually send a SIGKILL itself if the process isn't exiting as expected. Signed-off-by: Markus Thömmes <[email protected]>
4c1fb3a
to
90d364b
Compare
@markusthoemmes Thanks very much for this change! |
Here is a note: Due to the use of the hot-reload feature in the new version of Fluent Bit, the Fluent Operator image created based on this is not suitable for image versions below Fluent Bit v2.1. |
Quick check wrt. the |
I think we can retain them and leave documentation notes. Of course, this is a point worth discussing, as both options are feasible. |
Hmm. In terms of feasibility, I think |
Signed-off-by: Markus Thömmes <[email protected]>
Signed-off-by: Markus Thömmes <[email protected]>
I've reflected my thoughts on how to deal with the flags in the current state. I'm dealing with a bit of an issue of the reload though due to the parsers.conf file. I'll have to go check if that's due to my very old fluent-operator version. I'll report back. |
Signed-off-by: Markus Thömmes <[email protected]>
d0b52b9
to
468acdc
Compare
@markusthoemmes Thanks for all the thoughts. |
@markusthoemmes Thanks for this big change. We're on the way to implement a new fluentbit/fluentd config generation mechanism by watching CRDs in the watcher instead of putting the configs in a secret, which is more scalable. I'm going to invite you as fluent-operator maintainer @markusthoemmes |
@patrick-stephens @agup006 would you help to invite @markusthoemmes as a member of the fluent org as well? |
Thanks for the invite. Do note that this change has a bug in that config reloads can be swallowed if they happen in too quick succession. I‘ve opened a bug against fluent-bit wrt that, see fluent/fluent-bit#8457 There‘s a workaround by fetching the hot reload count before the kick and then waiting for the count to converge afterwards but that ain‘t implemented in this change. I don‘t know when and if I‘ll find the time to contribute that. |
Looks like we need to wait for this fluentbit PR to be merged: fluent/fluent-bit#8461 |
@patrick-stephens @agup006 now fluentbit watcher use fluentbit's hot reload feature instead of killing the fluentbit process, but looks like the hot reload has some issues, would you please take a look? |
Note: I recognize that I haven't proposed this change per-se and that it's not necessarily necessary to do this rework in order to implement hot-reload. I'm happy to change things around and I don't want to come across as "here's a rework, plz merge" 😅 . This is more of a discussion-base for now.
What this PR does / why we need it:
This reworks the fluent-bit-watcher to be "single-child-process" only. Since we can now use hot-reload, we don't need to account for potential fluent-bit restarts "under the hood" and can instead rely on a single process to live "forever". Whenever that one fluent-bit process exits, the watcher should exit as well and Kubernetes'
CrashLoopBackoff
take care of the rest.That fundamental change in assumption allows us to vastly simplify the lifecycle in the watcher, which this does. It gets rid of all, sometimes unsafely shared, globals in favor of a single process being started, watched, kicked and eventually stopped.
SIGTERM is still forwarded to the child process. There is no need for a SIGKILL timeout anymore since Kubernetes will eventually send a SIGKILL itself if the process isn't exiting as expected.
Does this PR introduced a user-facing change?
Additional documentation, usage docs, etc.: