Skip to content
This repository has been archived by the owner on May 3, 2024. It is now read-only.

Nakadi clients resilience to partial outage and partial success #415

Closed
adyach opened this issue Sep 8, 2023 · 6 comments
Closed

Nakadi clients resilience to partial outage and partial success #415

adyach opened this issue Sep 8, 2023 · 6 comments
Assignees

Comments

@adyach
Copy link
Contributor

adyach commented Sep 8, 2023

Is your feature request related to a problem? Please describe.

Nakadi publishing API accepts events in batches. It can fail to publish some events from the batch to underlying storage (Apache Kafka). In that case Nakadi publishing API will return error that batch was partially successful.
It can create problems the following problems, depending on how the Nakadi client and the publishing application deals with this partial success response:

  • increase in traffic on Nakadi publishing API due to Nakadi clients retrying the whole batch over and over
  • the application retries identical batches which prevents application from progressing

Describe the solution you'd like

  • Nakadi client should contain a note to developers that publishing can experience partial success. This should be in the client documentation and ideally also within the self contained code documentation, raising awareness for the users, e.g. via docstrings.

  • An optional retry method on batch level can be provided for the whole batch, but the default strategy must contain a backoff solution in case of continued errors to publish to Nakadi.

  • An optional retry method can be provided that only re-publishes unsuccessful events to Nakadi. This retry must also support a backoff strategy by default.

  • Clients must expose the result of a publishing request in a way that developers can understand that there is the possibility of a partial success for batch publishing.

@ePaul
Copy link
Member

ePaul commented Sep 14, 2023

Current situation

In case of a partial success (or also in cases like validation errors, which are complete failures), Fahrschein will throw an EventPublishingException with the BatchItemResponses (as returned from Nakadi) for the failed items in the responses property.

These objects have the eid of the failed event, a publishingStatus (failed/aborted/submitted (but these are filtered out)), the step where it failed and a detail string.

If the application sets the eids itself (i.e. doesn't let Nakadi do it) and keeps track of them, this allows it to resend only the failed items later. (This is what nakadi-producer is doing.)
It also allows differentiating between validation errors (which likely don't need to be retried, as they are unlikely to succeed the next time, unless the event type definition is changed) and publishing errors (which should be retried, possibly with some back-off).

But it needs some work on the application side.

Possible improvements
(My personal ideas, I'm not a Fahrschein maintainer.)

Instead of the plain response from Nakadi, it could help if the actual event object which was submitted is also included in this exception. This makes it easier for an application to retry just these. (This would require Fahrschein to invent eids where the application doesn't provide them, though.)

Splitting the failed events by failure type (validation/partitioning vs. enriching/publishing) might also help, but one needs to be careful here to not mess up event ordering if that's important, so maybe that's better left for the application.

It might also be an option to have a "retry" method in the exception, but this might be difficult to combine with a back-off strategy.

@adyach
Copy link
Contributor Author

adyach commented Sep 14, 2023

It might also be an option to have a "retry" method in the exception, but this might be difficult to combine with a back-off strategy.

I see there is a documentation about retry strategy, maybe can document it further that teams can use it, https://github.com/zalando-nakadi/fahrschein#fahrschein-compared-to-other-nakadi-client-libraries

@ePaul
Copy link
Member

ePaul commented Sep 14, 2023

I see there is a documentation about retry strategy, maybe can document it further that teams can use it, https://github.com/zalando-nakadi/fahrschein#fahrschein-compared-to-other-nakadi-client-libraries

As I understand, that's for the consumption part. Looking at the code, there is no retry logic for publishing, just the exception with the failed events, based on which the application can build their own retry logic.

I'm also not sure an automated re-try on publishing with back-off logic would fit into the philosophy of Fahrschein, but that's something for the maintainers to decide.

@otrosien
Copy link
Member

breaking this into two PRs:

  1. documentation of the current behaviour (thanks @ePaul )
  2. implement a default retry-with-backoff behaviour

@ePaul
Copy link
Member

ePaul commented Sep 19, 2023

Related old ticket: #219

@otrosien
Copy link
Member

Done except for the retry, which we track in #219

Closing this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants