diff --git a/README.md b/README.md index 9379aa7..3c3ba35 100644 --- a/README.md +++ b/README.md @@ -502,19 +502,9 @@ reconnect after a minute. Note that when you are deploying a new instance of you more instances than partitions the above code will handle this situation (when the old instance will terminate and disconnect from the stream it will free up some slots, so the new instance will eventually reconnect) -#### Automatically retrying sending of events - -Kanadi has a configuration option `kanadi.http-config.failed-publish-event-retry` which allows Kanadi to automatically -resend events should they fail. The setting can also be set using the environment variable -`KANADI_HTTP_CONFIG_FAILED_PUBLISH_EVENT_RETRY`. By default this setting is `false` since enabling this can cause -events to be sent out of order, in other words you shouldn't enable it if you (or your consumers) rely on ordering -of events. Kanadi will only resend the events which actually failed to send and it will refuse to send -events which failed due to schema validation (since resending such events is pointless). - -Since Nakadi will only fail to publish an event in extreme circumstances (i.e. under heavy load) the retry -uses an exponential backoff which can be configured with `kanadi.exponential-backoff-config` settings (see -`reference.conf` for information on the settings). If reach the maximum number of retries then `Events.publish` -will fail with the original `Events.Errors.EventValidation` exception. +#### Resilience to Partial Outage and Partial Success + +When publishing events with the library, the publishing could either _succeed_, _partially succeed_ or _fail_. The application should be [prepared to handle](RESILIENCE.md) these cases. #### Modifying the pekko-stream source diff --git a/RESILIENCE.md b/RESILIENCE.md new file mode 100644 index 0000000..2265fb2 --- /dev/null +++ b/RESILIENCE.md @@ -0,0 +1,46 @@ +# Resilience to Partial Outage and Partial Success + +When publishing a batch of events with a [Events.publish](https://github.com/zalando-nakadi/kanadi/blob/d77479b5ef4a6837303fe71a9dc625e7bb8c573d/src/main/scala/org/zalando/kanadi/api/EventsInterface.scala#L9) API, + +```scala +def publish[T](name: EventTypeName, events: List[Event[T]], fillMetadata: Boolean = true): Future[Unit] +``` + +the publishing could either _succeed_, _partially succeed_ or _fail_. On success the publishing operation returns `Future[Unit]` that resolves to a `Unit` value. On _partial success_ or _failure_ the `Future` fails with an exception: + +- `EventValidation(batchItemResponse: List[BatchItemResponse])` - on partial success; +- `HttpServiceError | GeneralError | OtherError` - on failure; + +On partial sucess, `EventValidation` exception contains `List[BatchItemResponse]` with [statuses](https://nakadi.io/manual.html#definition_BatchItemResponse) of submission for each event in the batch. + +The Kanadi library contains strategies to handle publishig errors: + +## Fail on Partial Success or Error (Default) + +The publishing operation immediately fails if the server returns an error that batch was partially successful or failed. +In this case the application can retry the whole batch or retry only failed events, depending on type of exception. +The decision is up to the application. However, it should be noted, that the application **should not retry the whole batch without a backoff strategy**, otherwise it can create problems for server, when many clients retry the same batch over and over. + +## Retry with Expponential Backoff + +Kanadi has a configuration option `kanadi.http-config.failed-publish-event-retry` which allows Kanadi to automatically +resend events should they fail. The setting can also be set using the environment variable +`KANADI_HTTP_CONFIG_FAILED_PUBLISH_EVENT_RETRY`. By default this setting is `false` since enabling this can cause +events to be sent out of order, in other words you shouldn't enable it if you (or your consumers) rely on ordering +of events. Kanadi will only resend the events which actually failed to send and it will refuse to send +events which failed due to schema validation (since resending such events is pointless). + +Since Nakadi will only fail to publish an event in extreme circumstances (i.e. under heavy load) the retry +uses an exponential backoff which can be configured with `kanadi.exponential-backoff-config` settings (see +`reference.conf` for information on the settings). If the maximum number of retries was reached, `Events.publish` +fails with the original `Events.Errors.EventValidation` exception. + +## Note + +No mater what strategy is used, the application should be prepared to handle the case when the server returns an error that batch was partially successful or failed. + +If not properly handled, + +- the application can get stuck in a loop of retrying the same batch over and over +- the application can increase load on the server, if it retries the whole batch without a backoff strategy +- the application can lose events, if the returned `Future` is not checked for errors (fire and forget).