-
Notifications
You must be signed in to change notification settings - Fork 38
Nakadi clients resilience to partial outage and partial success #415
Comments
Current situation In case of a partial success (or also in cases like validation errors, which are complete failures), Fahrschein will throw an EventPublishingException with the BatchItemResponses (as returned from Nakadi) for the failed items in the These objects have the If the application sets the eids itself (i.e. doesn't let Nakadi do it) and keeps track of them, this allows it to resend only the failed items later. (This is what nakadi-producer is doing.) But it needs some work on the application side. Possible improvements Instead of the plain response from Nakadi, it could help if the actual event object which was submitted is also included in this exception. This makes it easier for an application to retry just these. (This would require Fahrschein to invent eids where the application doesn't provide them, though.) Splitting the failed events by failure type (validation/partitioning vs. enriching/publishing) might also help, but one needs to be careful here to not mess up event ordering if that's important, so maybe that's better left for the application. It might also be an option to have a "retry" method in the exception, but this might be difficult to combine with a back-off strategy. |
I see there is a documentation about retry strategy, maybe can document it further that teams can use it, https://github.com/zalando-nakadi/fahrschein#fahrschein-compared-to-other-nakadi-client-libraries |
As I understand, that's for the consumption part. Looking at the code, there is no retry logic for publishing, just the exception with the failed events, based on which the application can build their own retry logic. I'm also not sure an automated re-try on publishing with back-off logic would fit into the philosophy of Fahrschein, but that's something for the maintainers to decide. |
breaking this into two PRs:
|
Related old ticket: #219 |
Done except for the retry, which we track in #219 Closing this. |
Is your feature request related to a problem? Please describe.
Nakadi publishing API accepts events in batches. It can fail to publish some events from the batch to underlying storage (Apache Kafka). In that case Nakadi publishing API will return error that batch was partially successful.
It can create problems the following problems, depending on how the Nakadi client and the publishing application deals with this partial success response:
Describe the solution you'd like
Nakadi client should contain a note to developers that publishing can experience partial success. This should be in the client documentation and ideally also within the self contained code documentation, raising awareness for the users, e.g. via docstrings.
An optional retry method on batch level can be provided for the whole batch, but the default strategy must contain a backoff solution in case of continued errors to publish to Nakadi.
An optional retry method can be provided that only re-publishes unsuccessful events to Nakadi. This retry must also support a backoff strategy by default.
Clients must expose the result of a publishing request in a way that developers can understand that there is the possibility of a partial success for batch publishing.
The text was updated successfully, but these errors were encountered: