Non-blocking event recorder #425
Labels
area/runtime
Controller runtime related issues and pull requests
enhancement
New feature or request
help wanted
Extra attention is needed
The event recorder currently runs in a blocking fashion due to which it can affect the reconcilers when they emit an event and the event webhook server takes time to respond. The http client waits and retries on failure. This results in the reconciler to wait until the event is posted. Sometimes reconcilers time out waiting for the event emitting to complete, resulting in failed reconciliation. A non-blocking event recorder would help prevent the reconcilers to be affected by failure in event recording.
Following are some ideas to address this issue:
non-blocking request to the event webhook server
A quick and simple change to solve the immediate issue would be to make the http request to the event webhook server non-blocking by running it in a goroutine. This would unblock the reconciler but doesn't guarantee the ordering of the requests. If the webhook server is offline and the reconciliation is failing, the reconcilers retrying may create multiple of these goroutines which may keep retrying to post the event with back-off. Due to the variation in the back-off duration, when the webhook server becomes available, the events would be posted out of order. Since there's no de-duplication at the event source level, the webhook server will have to serve all the accumulated event requests and do de-duplication on its own. This may result in the creation of too many goroutines that are trying to do the same thing. But the goroutines can be configured to fail after certain number of attempts to ensure they get cleaned up.
per controller event processor
Another approach would be to introduce event processor per controller. The events package can provide some API to run an event processor, typically in the main.go file before setting up the reconcilers. The event recorder in the reconcilers would send event to the event processor through a buffered channel. The event processor would collect the events to be posted to the event webhook server, categorize them and process them based on certain strategies. Since all the events go through a central events processor, it can be used to add more functionalities in the event source. Order of the events can be maintained. Event de-duplication can be done at the source and spamming the event server can be avoided. If the event server isn't ready, the event processor can perform one health check and hold all the event processing. If the event buffer gets filled, it can drop certain events based on certain strategies. More interesting things can be done at the source of events centrally at the controller level.
This may be similar to the events notification broadcaster in kubernetes apimachiner https://github.com/kubernetes/apimachinery/blob/fd8a60496be5e4ce19d586738adb48ac6fa66ef9/pkg/watch/mux.go#L43 .
Some other variation of event processor could be to run event processor per reconciler or even per object and that'll create opportunities to handle the events in different ways. Like a tenant configuring the events related to their objects to be sent to their own event server which they manage.
The text was updated successfully, but these errors were encountered: