Nakadi clients resilience to partial outage and partial success #415

adyach · 2023-09-08T12:12:57Z

Is your feature request related to a problem? Please describe.

Nakadi publishing API accepts events in batches. It can fail to publish some events from the batch to underlying storage (Apache Kafka). In that case Nakadi publishing API will return error that batch was partially successful.
It can create problems the following problems, depending on how the Nakadi client and the publishing application deals with this partial success response:

increase in traffic on Nakadi publishing API due to Nakadi clients retrying the whole batch over and over
the application retries identical batches which prevents application from progressing

Describe the solution you'd like

Nakadi client should contain a note to developers that publishing can experience partial success. This should be in the client documentation and ideally also within the self contained code documentation, raising awareness for the users, e.g. via docstrings.
An optional retry method on batch level can be provided for the whole batch, but the default strategy must contain a backoff solution in case of continued errors to publish to Nakadi.
An optional retry method can be provided that only re-publishes unsuccessful events to Nakadi. This retry must also support a backoff strategy by default.
Clients must expose the result of a publishing request in a way that developers can understand that there is the possibility of a partial success for batch publishing.

ePaul · 2023-09-14T09:36:52Z

Current situation

In case of a partial success (or also in cases like validation errors, which are complete failures), Fahrschein will throw an EventPublishingException with the BatchItemResponses (as returned from Nakadi) for the failed items in the responses property.

These objects have the eid of the failed event, a publishingStatus (failed/aborted/submitted (but these are filtered out)), the step where it failed and a detail string.

If the application sets the eids itself (i.e. doesn't let Nakadi do it) and keeps track of them, this allows it to resend only the failed items later. (This is what nakadi-producer is doing.)
It also allows differentiating between validation errors (which likely don't need to be retried, as they are unlikely to succeed the next time, unless the event type definition is changed) and publishing errors (which should be retried, possibly with some back-off).

But it needs some work on the application side.

Possible improvements
(My personal ideas, I'm not a Fahrschein maintainer.)

Instead of the plain response from Nakadi, it could help if the actual event object which was submitted is also included in this exception. This makes it easier for an application to retry just these. (This would require Fahrschein to invent eids where the application doesn't provide them, though.)

Splitting the failed events by failure type (validation/partitioning vs. enriching/publishing) might also help, but one needs to be careful here to not mess up event ordering if that's important, so maybe that's better left for the application.

It might also be an option to have a "retry" method in the exception, but this might be difficult to combine with a back-off strategy.

adyach · 2023-09-14T10:07:45Z

It might also be an option to have a "retry" method in the exception, but this might be difficult to combine with a back-off strategy.

I see there is a documentation about retry strategy, maybe can document it further that teams can use it, https://github.com/zalando-nakadi/fahrschein#fahrschein-compared-to-other-nakadi-client-libraries

ePaul · 2023-09-14T11:25:29Z

I see there is a documentation about retry strategy, maybe can document it further that teams can use it, https://github.com/zalando-nakadi/fahrschein#fahrschein-compared-to-other-nakadi-client-libraries

As I understand, that's for the consumption part. Looking at the code, there is no retry logic for publishing, just the exception with the failed events, based on which the application can build their own retry logic.

I'm also not sure an automated re-try on publishing with back-off logic would fit into the philosophy of Fahrschein, but that's something for the maintainers to decide.

otrosien · 2023-09-15T14:33:10Z

breaking this into two PRs:

documentation of the current behaviour (thanks @ePaul )
implement a default retry-with-backoff behaviour

ePaul · 2023-09-19T15:54:30Z

Related old ticket: #219

otrosien · 2023-10-11T14:01:35Z

Done except for the retry, which we track in #219

Closing this.

otrosien mentioned this issue Sep 15, 2023

Document Nakadi Publishing failure handling #416

Merged

mesut self-assigned this Oct 6, 2023

otrosien mentioned this issue Oct 9, 2023

Allow clients to correctly handle partial publishing failures #424

Merged

otrosien closed this as completed Oct 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nakadi clients resilience to partial outage and partial success #415

Nakadi clients resilience to partial outage and partial success #415

adyach commented Sep 8, 2023

ePaul commented Sep 14, 2023

adyach commented Sep 14, 2023 •

edited

Loading

ePaul commented Sep 14, 2023

otrosien commented Sep 15, 2023

ePaul commented Sep 19, 2023

otrosien commented Oct 11, 2023

Nakadi clients resilience to partial outage and partial success #415

Nakadi clients resilience to partial outage and partial success #415

Comments

adyach commented Sep 8, 2023

ePaul commented Sep 14, 2023

adyach commented Sep 14, 2023 • edited Loading

ePaul commented Sep 14, 2023

otrosien commented Sep 15, 2023

ePaul commented Sep 19, 2023

otrosien commented Oct 11, 2023

adyach commented Sep 14, 2023 •

edited

Loading