-
Notifications
You must be signed in to change notification settings - Fork 393
Cluster Metadata
The following is intended for those wishing to use the Cluster
Metadata subsystem inside of Riak or another riak_core
application. It is not intended for those wishing to support, debug
or contribute to the subsystem, although it may be helpful. For those wishing
to develop a deeper understanding of how Cluster Metadata works, please see
Cluster Metadata Internals.
Cluster Metadata is intended to be used by riak_core
applications
wishing to work with information stored cluster-wide. It is useful for
storing application metadata or any information that needs to be
read without blocking on communication over the network.
Cluster Metadata is a key-value store. It treats values as opaque
Erlang terms that are fully addressed by their "Full Prefix" and
"Key". A Full Prefix is a {atom() | binary(), atom() | binary()}
,
while a Key is any Erlang term. The first element of the Full Prefix
is referred to as the "Prefix" and the second as the "Sub-Prefix".
Values are stored on-disk and a full copy is also maintained in-memory. This allows reads to be performed only from memory, while writes are affected in both mediums.
Updates in Cluster Metadata are eventually consistent. Writing a value only requires acknowledgment from a single node and as previously mentioned, reads return values from the local node, only.
Updates are replicated to every node in the cluster, including nodes that join the cluster after the update has already reached all nodes in the previous set of members.
The interface to Cluster Metadata is provided by the
riak_core_metadata
module. The module's documentation is the official source for
information about the API, but some details are re-iterated here.
Reading the local value for a key can be done with the get/2,3
functions. Like most riak_core_metadata
functions, the higher arity
version takes a set of possible options, while the lower arity
function uses the defaults.
Updating a key is done using put/3.4
. Performing a put only blocks
until the write is affected locally. The broadcast of the update is
done asynchronously.
Deletion of keys is logical and tombstones are not
reaped. delete/2,3
act the same as put/3,4
with respect to
blocking and broadcast.
Along with reading individual keys, the API also allows Full Prefixes to be iterated over. Iterators can visit both keys and values. They are not ordered, nor are they read-isolated. However, they do guarantee that each key is seen at most once for the lifetime of an iterator.
See iterator/2
and the itr_*
functions.
Conflict resolution can be done on read or write.
On read, if the conflict is resolved, an option, allow_put
, passed
to get/3
or iterator/2
controls whether or not the resolved value
will be written back to local storage and broadcast asynchronously.
On write, conflicts are resolved by passing a function instead of a
new value to put/3,4
. The function is passed the list of existing
values and can use this and values captured within the closure to
produce a new value to store.
The prefix_hash/1
function can be polled to determined when groups
of keys, by Prefix or Full Prefix, have changed.
The following is by no means a complete list of things to keep in mind when developing on top of Cluster Metadata.
Cluster Metadata use dets
for on-disk storage. There is a dets
table per Full Prefix, which limits the amount of data stored under
each Full Prefix to 2GB. This size includes the overhead of
information stored alongside values, such as the logical clock and
key.
Since a full-copy of the data is kept in-memory, its usage must also be considered.
Cluster Metadata uses disterl for message delivery, like most Erlang applications. Standard caveats and issues with large and/or too frequent messages still apply.
The default conflict resolution strategy on read is last-write-wins. The usual caveats about the dangers of this method apply.
The extremely frequent writing back of resolved values after read in an eventually consistent store where acknowledgment is only required from one node for both types of operations can lead to an interesting pathological case where siblings continue to be produce (although the set does not grow unbounded). A very rough exploration of this can be found here.
If a riak_core
application is likely to have concurrent writes and
wishes to read extremely frequently, e.g. in the Riak request path, it
may be advisable to use {allow_put, false}
with get/3
.
Needs Content
Needs Content
See riak_Core_metadata_manager:start_link/1
.
Needs Content