This document describes how the storage implementation for running Tessera on a POSIX-compliant filesystem is intended to work.
POSIX provides for a small number of atomic operations on compliant filesystems.
This design leverages those to safely maintain a Merkle tree log on disk, in a format
which can be exposed directly via a read-only endpoint to clients of the log (for example,
using nginx
or similar).
In contrast with some of other other storage backends, sequencing and integration of entries into the tree is synchronous.
The implementation uses a .state/
directory to coordinate operation.
This directory does not need to be visible to log clients, but it does not contain sensitive
data and so it isn't a problem if it is made visible.
In the description below, when we talk about writing to files - either appending or creating new ones, the actual process used always follows the following pattern:
- Create a temporary file on the same filesystem as the target location
- If we're appending data, copy the contents of the prefix location into the temporary file
- Write any new/additional data into the temporary file
- Close the temporary file
- Rename the temporary file into the target location.
The final step in the dance above is atomic according to the POSIX spec, so in performing this sequence of actions we can avoid corrupt or partially written files being part of the tree.
- Leaves are submitted by the binary built using Tessera via a call the storage's
Add
func. - The storage library batches these entries up, and, after a configurable period of time has elapsed
or the batch reaches a configurable size threshold, the batch is sequenced and integrated into the tree:
- An advisory lock is taken on
.state/treeState.lock
file. This helps prevent multiple frontends from stepping on each other, but isn't necesary for safety. - Flushed entries are assigned contiguous sequence numbers, and written out into entry bundle files.
- Integrate newly added leaves into Merkle tree, and write tiles out as files.
- Update
./state/treeState
file with the new size & root hash.
- An advisory lock is taken on
- Asynchronously, at an interval determined by the
WithCheckpointInterval
option, thecheckpoint
file will be updated:- An advisory lock is taken on
.state/publish.lock
- If the last-modified date of the
checkpoint
file is older than the checkpoint update interval, a new checkpoint which commits to the latest tree state is produced and written to thecheckpoint
file.
- An advisory lock is taken on
This implementation has been somewhat tested on both a local ext4
filesystem and on a distributed
CephFS instance on GCP, in both cases with multiple
personality binaries attempting to add new entries concurrently.