-
Notifications
You must be signed in to change notification settings - Fork 292
Too much consumers from a single application
##TL;DR;## The amount of consuming connections from one app to one partition will be limited to maximum number of 5.
##Problem## Sometimes the consumption apps of our users have bug and that results in creation of huge amount of consumers to an event-type which at the end has a big impact on Nakadi availability.
##Proposed solution## We should limit the number of consumers from one app to one partition of event-type. There should be not more than 5 consuming connections active at a time. If the limit is reached Nakadi should answer with 429.
##Implementation details## Zookeeper should be used as a place to store the information about the number of connections.
##Plan A## Shared Semaphore from Curator Framework is a perfect tool for implementing the feature http://curator.apache.org/curator-recipes/shared-semaphore.html. When connection is created - Nakadi instance should acquire a lease from the semaphore. If it’s not possible to acquire it - we should answer with 429. The ZK-node that is used for semaphore should contain the name of application, event-type and the partitions (e.g. “my-app|order-created|7”). So if the consuming connection reads from 4 partitions - Nakadi should acquire lease from 4 nodes.
##Plan B## If for some reason if will not be possible to implement it using Shared Semaphore from Curator framework we can consider implementing it in the following way:
When the consuming connection is created - an ephemeral node is created in ZK (the name of the node can be just an uuid). The parent node for this ephemeral node should contain the name of application, name of event type and the partition. So that all connections for a partition from a single app are stored under this node. When the connection is closed - the ephemeral node is removed.
Example of data in ZK for a partition having 3 active connections:
/my-cool-app|order-created|7
/54d489e4-8b51-4acf-808a-8ae66ca351f2
/0ec3de1f-0ce5-48d3-b500-080cabb7c43d
/565c28ca-7e50-41b6-99e6-580e43855c2d
When the last connection is closed - the parent node should also be removed. In a case when last connection was closed in a bad way (e.g. nakadi instance was terminated) the parent node will hang empty in zookeeper. To fix this problem there should be a job that runs once in 12 hours and removes all empty nodes like that.
Before the new consuming connection is created we should first check if the limit is already reached or not. If there are already 5 existing connections from an app to the partition - Nakadi should answer with 429 status code. If the consuming connection has at least one partition that reached the limit - the connection will be refused.
The new limiting functionality only covers low-level consumption (as subscription API doesn’t have this problem).