Skip to content

Data Stream Replication

Bei Chu edited this page Feb 21, 2024 · 3 revisions

Problem Statement

There's data being produced continuously from a data source.

The data consuming clients need to be able to consume the whole data stream from any point of time.

At any time, a client may come and go.

Client needs real-time access to data, meaning that the client can request future data, and as soon as the requested data is produced, the client will receive it. The latency should be as low as possible.

Solution

Source Data Structure

The data source keeps track of two pieces of data:

  • Persisted data
  • In-memory data

Whenever data is produced, it's added to in-memory data. And when the in-memory data reaches a certain size, it's persisted.

Persisting Strategy

In-memory data gets added to a queue for being persisted. When the persisting completes, the data is removed from memory.

The queue has a fixed size and back pressure is applied to the data source when the queue is full.

Requesting Data

Client always request a slice of data. When a data request comes in, there are three cases:

  • The requested data is persisted and not in memory
  • The requested data is in memory
  • The requested data hasn't been produced yet

In the first case, the persisted data location is returned to the client. The client needs to fetch the data itself.

In the second case, the data is returned to the client.

In the third case, a watcher with the slice request is added to the data source.

Watcher Strategy

Upon data production, the data source checks if any slice from the watchers is satisfied. If there are, the data is sent and the watcher is removed.

Watcher has a timeout. If the timeout is reached, any available data is sent to the client, and the watcher is removed.

The client can use the timeout to request a very large future slice, and get whatever data that's produced within the timeout. The timeout balances the latency and the network request overhead. Larger timeout means lower overhead, but higher latency.

Future improvements

Persisted Data Compaction

Often data can be combined to reduce the volume of data. For example, an "Insert" followed by a "Delete" can be compacted to nothing.

A compaction job can run periodically to compact the persisted data. The data source must be notified of the compaction, so it can return correct data location to the client.

On-the-fly Data Compaction

Compaction can also happen on the fly as the data is being produced. The limitation here is that on-the-fly data can only be combined with in-memory data.

For example, if a "Delete" is produced, but the corresponding "Insert" is already persisted, the compaction has to wait until it's also persisted and is executed by the persisted data compaction job.