Skip to content
This repository has been archived by the owner on Jul 30, 2024. It is now read-only.

Update publishing guidelines #219

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 3 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -502,19 +502,9 @@ reconnect after a minute. Note that when you are deploying a new instance of you
more instances than partitions the above code will handle this situation (when the old instance will terminate
and disconnect from the stream it will free up some slots, so the new instance will eventually reconnect)

#### Automatically retrying sending of events

Kanadi has a configuration option `kanadi.http-config.failed-publish-event-retry` which allows Kanadi to automatically
resend events should they fail. The setting can also be set using the environment variable
`KANADI_HTTP_CONFIG_FAILED_PUBLISH_EVENT_RETRY`. By default this setting is `false` since enabling this can cause
events to be sent out of order, in other words you shouldn't enable it if you (or your consumers) rely on ordering
of events. Kanadi will only resend the events which actually failed to send and it will refuse to send
events which failed due to schema validation (since resending such events is pointless).

Since Nakadi will only fail to publish an event in extreme circumstances (i.e. under heavy load) the retry
uses an exponential backoff which can be configured with `kanadi.exponential-backoff-config` settings (see
`reference.conf` for information on the settings). If reach the maximum number of retries then `Events.publish`
will fail with the original `Events.Errors.EventValidation` exception.
#### Resilience to Partial Outage and Partial Success

When publishing events with the library, the publishing could either _succeed_, _partially succeed_ or _fail_. The application should be [prepared to handle](RESILIENCE.md) these cases.

#### Modifying the pekko-stream source

Expand Down
46 changes: 46 additions & 0 deletions RESILIENCE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Resilience to Partial Outage and Partial Success

When publishing a batch of events with a [Events.publish](https://github.com/zalando-nakadi/kanadi/blob/d77479b5ef4a6837303fe71a9dc625e7bb8c573d/src/main/scala/org/zalando/kanadi/api/EventsInterface.scala#L9) API,

```scala
def publish[T](name: EventTypeName, events: List[Event[T]], fillMetadata: Boolean = true): Future[Unit]
```

the publishing could either _succeed_, _partially succeed_ or _fail_. On success the publishing operation returns `Future[Unit]` that resolves to a `Unit` value. On _partial success_ or _failure_ the `Future` fails with an exception:

- `EventValidation(batchItemResponse: List[BatchItemResponse])` - on partial success;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not completely clear if it is for the case of 422 when event validation failed or 207 when actual publishing failed. 422 does not need a retry. is there any differentiation ?

Copy link
Collaborator Author

@gchudnov gchudnov Oct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after checking, it seems that the current implementation has a bug that treats 422 as recoverable IF the retries are enabled (they are disabled by default)

Will make a fix.

- `HttpServiceError | GeneralError | OtherError` - on failure;

On partial sucess, `EventValidation` exception contains `List[BatchItemResponse]` with [statuses](https://nakadi.io/manual.html#definition_BatchItemResponse) of submission for each event in the batch.

The Kanadi library contains strategies to handle publishig errors:

## Fail on Partial Success or Error (Default)

The publishing operation immediately fails if the server returns an error that batch was partially successful or failed.
In this case the application can retry the whole batch or retry only failed events, depending on type of exception.
The decision is up to the application. However, it should be noted, that the application **should not retry the whole batch without a backoff strategy**, otherwise it can create problems for server, when many clients retry the same batch over and over.

## Retry with Expponential Backoff
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to retry the whole batch? some configurable retry strategy e.g. via the code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the moment this feature is absent in the library,
but some internal projects have wrappers on top that implement exactly this functionality.

Obviously it can be added, just stating the current state.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


Kanadi has a configuration option `kanadi.http-config.failed-publish-event-retry` which allows Kanadi to automatically
resend events should they fail. The setting can also be set using the environment variable
`KANADI_HTTP_CONFIG_FAILED_PUBLISH_EVENT_RETRY`. By default this setting is `false` since enabling this can cause
events to be sent out of order, in other words you shouldn't enable it if you (or your consumers) rely on ordering
of events. Kanadi will only resend the events which actually failed to send and it will refuse to send
events which failed due to schema validation (since resending such events is pointless).

Since Nakadi will only fail to publish an event in extreme circumstances (i.e. under heavy load) the retry
uses an exponential backoff which can be configured with `kanadi.exponential-backoff-config` settings (see
`reference.conf` for information on the settings). If the maximum number of retries was reached, `Events.publish`
fails with the original `Events.Errors.EventValidation` exception.

## Note

No mater what strategy is used, the application should be prepared to handle the case when the server returns an error that batch was partially successful or failed.

If not properly handled,

- the application can get stuck in a loop of retrying the same batch over and over
- the application can increase load on the server, if it retries the whole batch without a backoff strategy
- the application can lose events, if the returned `Future` is not checked for errors (fire and forget).