Skip to content

Brooklin Architecture

Ahmed Elbahtemy edited this page Apr 10, 2019 · 20 revisions

Brooklin is a Java server application that is typically deployed to a cluster of machines. Each machine can host one or more instances of Brooklin, each of which offers the exact same set of capabilities.

Key Concepts

  • The most fundamental concept in Brooklin is a Datastream.
  • A Datastream is a description of a data pipe between two systems; a source system from which data is consumed and a destination system to which this data is produced.
  • Brooklin allows you to create as many Datastreams as you need to set up different data pipes between various source and destination systems.
  • To support high scalability, Brooklin expects the data streamed between source and destination systems to be partitioned. If the data is not partitioned, it is treated as a stream with a single partition.
  • Likewise, Brooklin breaks every partitioned Datastream into multiple DatastreamTasks, each of which is limited to a subset of the total partitions, that are all processed concurrently for higher throughput.
  • Connector is the abstraction representing modules that consume data from source systems.
  • Different Connector implementations can be written to support consuming data from different source systems.
  • To support producing data to different destinations, Connectors employ a different abstraction — TransportProviders.
  • An example Connector implementation is KafkaConnector, which is intended for consuming data from Kafka.
  • TransportProvider is the abstraction representing modules that produce data to destination systems.
  • Different TransportProvider implementations can be written to support producing data to different source systems.
  • An example TransportProvider implementation is KafkaTransportProvider, which is intended for producing data to Kafka.
  • Brooklin Coordinator is the module responsible for managing the different Connector implementations.
  • There is only a single Coordinator object in every Brooklin server app instance.
  • A Coordinator can either be a leader or non-leader.
  • In a Brooklin cluster, only one Coordinator is designated as leader while the rest are non-leaders.
  • Brooklin employs the ZooKeeper election recipe for electing the leader Coordinator.
  • In addition to managing Connectors, the leader Coordinator is responsible for dividing the work among all the Coordinators in the cluster (including itself) by assigning DatastreamTasks to them.
  • The leader Coordinator can be configured to do DatastreamTask assignment using different strategies (implementations of AssignmentStrategy).
  • An example AssignmentStrategy offered by Brooklin is the LoadbalancingStrategy, which causes the leader Coordinator to evenly distribute all the available DatastreamTasks across all Coordinator instances.

Architecture

  • Brooklin server application is typically deployed to one or more machines, all using ZooKeeper as the source of truth for Datastream and DatastreamTask metadata.
  • Information about the different instances of Brooklin server app as well as their DatastreamTask assignments is also stored in ZooKeeper.
  • Every Brooklin instance exposes a REST endpoint — aka Datastream Management Service (DMS) — that offers CRUD operations on Datastreams over HTTP.

A good way to understand the architecture of Brooklin is to go through the workflow of creating a new Datastream.

Datastream Creation Workflow

The figure below illustrates the main steps involved in Datastream creation.

Brooklin Datastream Creation Workflow

  1. A Brooklin client sends a Datastream creation request to a Brooklin cluster.

  2. The request is routed to the Datastream Management Service REST endpoint of any instance of the Brooklin server app.

  3. The Datastream data is verified and written to ZooKeeper under a certain znode that the leader Coordinator is watching for changes.

  4. The leader Coordinator gets notified of the new Datastream znode creation.

  5. The leader Coordinator reads the metadata of the newly created Datastream and breaks it down into one or more DatastreamTasks. It also uses the AssignmentStrategy of the Connector specified in the Datastream to assign the different DatastreamTasks to the available Coordinator instances, by writing the breakdown and assignment info to ZooKeeper.

  6. The affected Coordinators get notified of the new DatastreamTask assignments created under their respective znodes, which they read and process immediately using the relevant Connectors (i.e. consume from source, produce to destination).

ZooKeeper

In addition to leader Coordinator election, Brooklin maintains the following pieces of info in ZooKeeper:

  • Registered Connector types
  • Datastream and DatastreamTasks metadata
  • DatastreamTask progress information (e.g. offsets/checkpoints)
  • Brooklin app instances
  • DatastreamTask assignments of Brooklin app instances

The figure below illustrates the overall structure of the data Brooklin maintains in ZooKeeper.

Brooklin Data Structure in ZooKeeper

Structure

  • Each Brooklin cluster is mapped to one top-level znode.
  • /dms is where the definitions of individual Datastreams are presisted (in JSON)
  • /connectors
    • A sub-znode for every registered Connector type in this cluster is created under this znode.
    • /connectors/<connector-type>/ is where the definitions of all the DatastreamTasks handled by this Connector type are located, one znode per DatastreamTask.
    • There are two child nodes under every /connectors/<connector-type>/<datastreamtask> znode: config and state.
    • config stores the definitions of the DatastreamTasks
    • state stores the progress info of the Connectors (offsets/checkpoints)
  • /liveinstances
    • This is where the different Brooklin instances create ephemeral znodes with incremental sequence numbers (created using the EPHEMERAL_SEQUENTIAL mode) for the purposes of leader Coordinator election.
    • The value associated with each sub-znode is the hostname of the corresponding Brooklin instance.
  • /instances
    • This is where every Brooklin instance creates a persistent znode for itself. Each znode has a name composed of the concatenation of the instance hostname and its unique sequence number under /liveinstances.
    • Two sub-znodes are created under each Brooklin instance znode — assignments and errors.
    • assignments contains the names of all the DatastreamTasks assigned to this Brooklin instance.
    • errors is where messages about errors encountered by this Brooklin instance are persisted.

Workflow

  • The leader Coordinator watches the /dms znode for changes in order to get notified when Datastreams are created, deleted, or altered.
  • Leader or not, every Coorindator watches its corresponding znode under /instances in order to get notified when changes are made to its assignments child znode.
  • The leader Coordinator assigns DatastreamTasks to individual Brooklin instances by changing the assignment znode under /instances/<instance-name>.