-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
controller: reconcile func returns an error #31
Conversation
57a866c
to
350512b
Compare
Makes possible to abort reconciliation workflows by making a precondition task or (alternatively any of the concurrent tasks to return an error. Signed-off-by: Guilherme Cassolato <[email protected]>
350512b
to
8a65c24
Compare
Signed-off-by: Guilherme Cassolato <[email protected]>
0ce5912
to
e445f50
Compare
I am wondering what are we calling an error here and do we need to give guidelines on what an error is. For example I don't think a merge conflict due to the resource being reconciled is a is an error. In this there should already be an event for the newer version of that resource in the queue. Resource status should be updated, but if that is what the error handler does then okay but we should commend that. When it comes to cluster resources we should be getting those events, and if the reconcile tries to change modify those resources before it can I don't think it can be called an error. What I do think can be classed as an error is the failure to within in the user code, ie unmarshaling fails. For this I wonder should the user be forced to handle that error in at the source. This idea of a generic error handler seems like a catch all. In the case of a code failure this may require a requeuing of the event. Lastly there is the error when external resources fail. The example here can be requests to route53 being rate limited. For this I also think the user should be force to try handle the failure at the source with a limited number of retries. I am aware this would block the workflow from processing the next event. So again the a resource status should also be updated when the limited retries fail, and the event re-queued. I am even starting to wonder if the updating of the status of a resource can be used as the requeue trigger. I wonder if we are using error in the correct fashion for what we are trying to achieve or if we are using error because that is the way the controller-runtime does it. I will agree there needs to be a way to short circuit the workflow to get to the status updater. But I am not sure if the status updater should even be a different part of the workflow for a given resources. As one approach could be to define a status updater function that is defered at the top of the reconcile, which is pass a status pointer. Once the reconcile function gets into an unrecoverable state the status pointer is updated and the function exits causing the status to get updated. This would cut that work follow short, provide information to the user by the resource status and also create an event in the queue due to the update. |
Ignore me if I'm missing the point but rather then the semantics of what consists of an error and what not, isn't it more important that we are able to log errors as specifically as we can in order to help the user find a way forward? Any reliable hint on what might be going wrong is better than silently failing (from a user perspective). |
For the purpose of the library, I guess my perspective is we're not calling anything an error per se. Rather, we're just providing the user with the option to tag something as an error. Without this option, I guess the message we send is that either errors are in general unacceptable and must be handled at the source (as you say), or – what I believe is likely going to happen in many cases – that an error to be carried over (to be handled or simply logged at another level) must be injected in the
Maybe, via documentation, as "good practices". I think the examples you mentioned are a good start. General rule of thumb, I guess:
Examples of errors that typically suggest events in the queue or not:
Although I would probably prefer handling the error at the source for most cases, I still see 2 use cases for raising the error. First use case is to not repeat yourself. In a workflow with multiple tasks, there might be specific error types that you do not want to handle at all tasks. The second use case is to abort a workflow. Without this PR, currently IDK how one could abort a workflow other than by panicking.
It can, but remember that updating the status may also fail and/or be the error being handled in the place.
IMO, it's up to the user of the library to decide what to do with an error that is raised. However, if nothing suppresses an error that is triggered by a reconciliation function, the top level of the reconciliation, which is the controller provided by the library, will log the error for the user: policy-machinery/controller/controller.go Line 260 in e445f50
|
My 2 cents on error handling
|
Thanks for the input, @eguzki.
TBH, maybe I did not understand this part. Maybe you have a clear semantics on what an error is (should be) in your opinion. IMO, e.g., trying to update a resource during reconciliation and failing due to being rate-limited by the API server is an error that induces a new state for the cluster. In this case, the new state of the cluster is that reconciliation could not be completed, and perhaps a policy previously flagged as Accepted/Enforced no longer is. It is not the update error that goes to the status of the policy, but a reflection of the new state of the cluster regarding this policy, i.e., that it is no longer Accepted/Enforced. It all started with an error though.
Here returning an error is mainly a mean to something (e.g. to abort the workflow, to gather handling at an aggregation level, etc.) The error has a meaning in its own (i.e. an exception happened), but the decision of raising this error (instead of handling it at the source) or not is a prerogative of the user of the lib, to achieve some logic in the code.
In the Policy Machinery, there is zero promise of retrying.
I think the main benefits are 1) structuring the logic in a series of nested/chained reconcile funcs, and 2) the possibility to abort a workflow. Still, it's up to the user of the lib to decide when to use this resource, when to handle errors at the source, and when to simply let errors raise all the way.
I guess we don't need an extensive list. I typology of errors should be fine. |
Let me try to summarise the problems we want to solve here and the options I can see to tackle them. Problem 1: There must be a way to halt a workflow or even the entire reconciliation without panicking. Problem 2: There are common patterns for dealing with errors that users will not want to repeat across multiple tasks of a workflow but to handle at a single point in the code. I see 2 options to solve these problems. (There may be a 3rd, 4th option that I do not see.) We don't need this PR to solve either of the problems described. This is OPTION A that I see. For both problems, users today can rely on the shared To address Problem 1, all subsequent tasks in a possibly complex reconciliation workflow need to implement a safe-guard clause that makes the task to skip in the presence of that piece of data. To address Problem 2, it's the opposite; a subsequent task actually triggers if a particular piece of data is present in the shared map. OPTION A works. However it requires that the source and all other relevant tasks in a workflow to agree on the specific keys which to read from the shared With regards to Problem 1 specifically, I happen to see OPTION A as additionally cumbersome, due to the number of safe-guard clauses that need to spread across the tasks only for the sake of supporting halting a workflow. The other option I see is this PR (OPTION B). Here, raising an error that is not handled by any error handler set at any of a series of nested workflows causes the reconciliation process to halt (solves Problem 1). Setting an error handler to a workflow, on the other hand, make it catch errors that are raised within that workflow. This task then becomes a single point in the code that handles errors for that particular set or subset of tasks wrapped within a same workflow (solves Problem 2). Regarding OPTION B and Problem 2, it's worth noting that it's still up to each source task to decide which errors to raise and which ones to handle themselves. So it's not all errors that will end up in the error handler necessarily. It's also worth noting that an error handler belongs to a particular workflow, which can be nested inside an outer workflow. The nested (inner) error handler only catches errors raised inside the nested (inner) workflow it belongs, and it can decide whether to suppress these error (and thus let the outer workflow to continue) or to raise them/some of them again (thus becoming a problem of the outer workflow now.) This last explanation helps visualise Problems 1 and 2 combined (i.e. partially caught by a common "error handler" task but still raised afterwards to halt the workflow). Both options described above, A and B, could be employed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am mostly happy with this. One comment on the ErrorHandler
but I don't want to hold this up because of that.
type Workflow struct { | ||
Precondition ReconcileFunc | ||
Tasks []ReconcileFunc | ||
Postcondition ReconcileFunc | ||
ErrorHandler ReconcileFunc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I like this. Currently it is a gut feeling but seems complicated from the user prospective. Not sure what would be better, or even worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly you don't like, @Boomatang? Is it because ErrorHandler
is typed as a ReconcileFunc
? Is the whole idea of having an ErrorHandler
in the first place?
If we don't offer the option to catch an error that happens within a workflow, then raising an error equals halting the reconciliation call altogether always, because no matter how many levels of nested reconciliation functions you have, the error would propagate all the way.
The idea behind this is to give users the option to decide, in a complex workflow with multiple nested workflows inside, what layers to propagate errors from and what layers to suppress an error and interrupt propagation (if any).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the ErrorHandler
in the first place. To me it feels like a try catch
block and as it is in the workflow these error handlers can be nested. I know we need a way to escape the workflow, but there is something about this that feels off. I understand the reasoning for wanting this, but I think it is going to be to much power to the user and then they will start off loading the errors to upper layers over handling errors it them selves.
Hopefully I am wrong. I think with time we should see how this holds up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I referred to the use case for this as Problem 2 in this comment.
In fact, users "start off loading the errors to upper layers" is indeed the capability I wanted to enable with this feature. I guess I don't see this inherently as a bad thing. There are cases where you want to do it and cases where you don't.
Do you think there is no case where this is a good thing, @Boomatang? I'm honestly asking it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course there will be case where this is a good thing. This is not at about this being good or bad, it's about it being right. Every example that I can think of that would use this I can see being solved without using it. Yes, some would require other work to be complete, but that is here nor there. The short circuit of a workflow if there is an error in the Precondition is the only part that I can't seem to see an alternative too. If there is an error in one routine in the Tasks list I would still expect the Post-condition to run.
I am happy to accept this and see how it works in practices. The library is no where near V1 and there is not a lot of users yet. I would like to see some other alternatives explored and seen this being used with in Kuadrant operator will help narrow down what is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, left some minor nits/comments
Signed-off-by: Guilherme Cassolato <[email protected]>
Signed-off-by: Guilherme Cassolato <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy with the changes
Makes the
ReconcileFunc
to return an error.Alters the
Workflow
reconciliation to abort the workflow if a precondition task or any of the concurrent tasks return an error.Alters the
Workflow
reconciler to accept an optionalErrorHandler ReconcileFunc
.Note: I'm not sure this is a good idea. Arguably, reconciliation tasks should never fail. Rather, when an error is detected, the task should move the resources to a state that reflects the inconsistency and wait for the next event to trigger, when possibly the error will not happen again.
The main reason for the change was making possible (or easier) for a workflow reconciliation func to abort. Another option could be letting the workflow continue and carry the error to the subsequent tasks, postcondition etc – using the
sync.Map
state (#29) or theerror
arguments. These subsequent tasks, in turn, would add safe-guard clauses that check for errors carried over from previous steps.