Replies: 4 comments 6 replies
-
A Look at the Clarity DBThe astute reader will have noticed above that I was looking at the block headers MARF, which only contains about 230k keys. This was an oversight on my part. I will show the same information below for the Clarity DB MARF, which is a few orders of magnitude bigger, and will show how much overhead sqlite imposes on the step of resolving a MARF's value hash to a Clarity value. To do this, I added two more benchmark commands to
Does the size of the Clarity MARF impact read performance?The Clarity MARF is much, much bigger than the block headers MARF. With the recent chain tip hash I'm using here, there are 33 million Clarity values alone. I don't have the exact number for the number of MARF paths, but it's not much larger (recall that the MARF inserts five paths per block at minimum, so this number is no greater than 34 million). Here's the output of a benchmark for reading 100,000 MARF paths from the Clarity MARF that map to Clarity values. Note that we're using
This is on the same production VM as before, with the Intel Haswell CPU. As you can see, the time per MARF query isn't substantially different -- 29 us versus 25 us. The reason for this is that the MARF's largest nodes have a radix of 256. So, a path in a MARF with 34 million keys requires on average about 4.14 node visits (3.14 intermediate nodes, plus the leaf). For the block headers MARF tested above, it's about 3.22 node visits (2.22 intermediate nodes plus the leaf). The code is reading less than 1 additional MARF node per query. So the verdict here is "not really" -- the larger MARF's path read performance is still very fast. Here's the full flame graph of this benchmark: How much does Sqlite impact Clarity value lookups?Recall that once the Clarity VM has resolved the MARF path to the value hash, it queries its side store DB for the value. This data is stored in the
What's going on here? We went from 29 us to 69 us -- reading the serialized Clarity value from the side store adds a whopping 40 us of time. Where is that time getting spent? Let's look at the flamegraph. Look at all that overhead from calls into sqlite! So the verdict here is "sqlite somewhat impacts Clarity value lookups" Performance Improvement OpportunitiesLooking at the above data, the single biggest win I think would be to interleave MARF data into the MARF itself. Instead of storing MARF data inside a separate sqlite table, we could store it within the MARF trie flat file alongside the leaf nodes. This would not only get sqlite off of the critical read path, but also let us leverage better locality-of-reference: filling a page of RAM with disk block data when reading the MARF leaf would also bring in substantial parts (or even the entirety of) the associated Clarity value, thus saving us additional I/O. I'm going to experiment with doing this. |
Beta Was this translation helpful? Give feedback.
-
An update on this:
I implemented a data-interleaving scheme whereby MARF leaf data was directly written to the trie right next to the leaf node. You can find it in Here is today's behavior, which looks up a MARF value hash from the MARF and then queries the serialized Clarity value from the side store:
Here is the new behavior, which embeds the serialized Clarity value directly into the MARF itself:
I don't think it's a good use of time to pursue this further, but I suspect that the main reasons why the interleaving system is slower are:
I'll just leave the Anyway, it's refreshing to see that the act of resolving a MARF-indexed value from both the MARF index and side-storage is very fast -- on the order of 10s of microseconds. We can certainly afford to increase the |
Beta Was this translation helpful? Give feedback.
-
Just want to resurface this topic as it pertains to recent discussions on SIP calls for Nakamoto. A key take-away from this discussion was: MARF reads from Clarity should be much, much cheaper than they are currently assessed in block limits. Jude's benchmarking found MARF read + deserialization takes ~70 microseconds for an operation. Those currently are assessed at ~10ms in block limits (a 142x larger assessment). In theory, this means that the Practically speaking, this would need to be confirmed through further benchmarking of actual block validation performance, but it does mean that the current I/O bounds are likely very pessimistic, and could be increased. |
Beta Was this translation helpful? Give feedback.
-
@cylewitruk I believe this is the best place for you to consider benchmarking SQLlite to help Nakamoto blocks have the right number of transactions and then offer ideas. @kantai do you agree? |
Beta Was this translation helpful? Give feedback.
-
This discussion focuses on making chainstate DB reads faster.
Background
A fork-indexed read happens whenever Clarity reads some persisted data from a fork in the blockchain. This happens each time the code reads a data var or map entry, each time it calls another contract, and each time it runs
(at-block ...)
. It also happens internally in a few places, such as in theget-block-info?
andget-burn-block-info?
built-ins, as well as with block validation. Fork-indexed reads are an essential operation supported by the chainstate database: they permit the system to efficiently track multiple forks while ensuring that each transaction's Clarity code can only query state from the fork in which it runs.Looking at block execution budgets over the past year, the scarcest resource in the Stacks blockchain by far is the number of fork-aware reads permitted. This is tracked in the execution budget's
read_count
metric. Right now, only 15,000 such operations are allowed per block, and we regularly see over 50% utilization (often as high as 99%).Internally, fork-indexed reads happen through a specially-designed index data structure called a Merklized Adaptive-Radix Forest, or MARF. The MARF, when combined with a sqlite database, implements a fork-aware key/value store. It is designed to service the following two queries efficiently, in O(1) time:
Given the hash of a block and a key, look up the hash of the mapped value in the fork that the given block hash represents. That is, the value's hash will only be found if it was set by a previous MARF write in an ancestor Stacks block.
Given the hash of a block and a key, get a Merkle proof that proves that the mapped value exists in the fork that this block hash represents.
State of the MARF
As the name implies, a MARF is a collection of trees (specifically, hash array-mapped tries). The keys of the MARF are arbitrary strings; they are hashed to calculate a fixed-length path. The leaves of each trie are the hashes of values stored in the associated sqlite database. Once the caller has found a MARF leaf (the value's hash), it then uses it to query the database for the value itself. That is, querying data in the Stacks blockchain is a 3-step operation: (1) given the chain tip and key, hash the key to get a path, (2) walk the path to look up the value's hash in the MARF if it exists, and (3) if it does exist, query the value from the sqlite database. To service part (3) of the query, the sqlite database simply maintains a table with two columns -- the value hash, and the value.
I have spent a lot of time getting the MARF to be very, very fast at step (2) while also being crash-consistent. It achieves this by storing all of its tries in an append-only flat file on disk, and by relying on an associated sqlite database to index it (note that this sqlite database is not the same as the sqlite database that stores the MARF-indexed values). That is, the bulk of I/O in any MARF query happens directly through the VFS; sqlite is not involved on the read path.
The sqlite database records the block hash, block ID, and file offset and length of each trie in the flat file. Because chain state is append-only, all tries and index data is immutable once written -- the MARF will cache block hash, block ID, and trie offsets in RAM indefinitely, and will optionally cache trie nodes in RAM once loaded (thereby keeping sqlite off of the read path when traversing multiple tries).
Crash consistency is achieved by appending new tries, flushing the data and inode metadata to disk, and inserting a new record for the trie into the sqlite index: a trie only exists if the sqlite index says it exists, and sqlite's crash consistency ensures that each trie it represents was persisted to disk via the aforementioned disk flushes.
A concern that has cropped up multiple times over the past year is that the MARF is not "fast enough" to handle things like subnets or to achieve speed goals in the Nakamoto release. I hope now to show that this is a myth.
MARF path lookup are fast
I have created the branch
perf/marf-dump-and-bench
to help explore the runtime performance of the MARF index. In this branch,stacks-inspect
gains two new commands:marf-dump
: Given a block hash, a path to a MARF, and a positive integer N, dump the first N paths and leaves in lexicographical order by path. If N is greater than the number of leaves in the MARF at the given block hash, then all paths and leaves are dumped.marf-get-bench
: Given a block hash, a path to a MARF, and a list of paths, load each leaf and report the average number of milliseconds taken to read a leaf.Using these new commands, I dumped all of the leaves of the Stacks chainstate MARF (i.e. the on Clarity uses) as of the current chain tip, and ran
marf-get-bench
on them. Here's what I found:The host machine on which this test ran stores the MARF on an SSD, and has an Intel Haswell CPU with 8 GB RAM. The host machine also runs a Stacks node, so the MARF file blocks are more likely than not to be cached in the block layer already.
As you can see, even with a modest VM, it takes 25 microseconds on average to resolve a MARF path to its leaf. Keep in mind that this does not include the time taken to hash the key to its path, nor to look up the value from the leaf's value hash. This is just a measure of how long it takes to walk the MARF path to the value hash. But, we can do this step about 40,000 times per second!
So, clearly, MARF path resolution is not the limiting factor for a higher indexed read budget.
Profiling
For the record, where does
marf-get-bench
spend its time? Here's an annotated flame graph:As you can see, the node spends most of its time loading trie nodes from disk, as well as a modest amount of time in the MARF walking from node to node. Also, note that in this test, all of the block hashes, block IDs, and trie offsets were fetched and cached eagerly (this is a different but compatible behavior from
master
, and one I'd like to merge).You can reproduce the flame graph below:
Getting More Indexed Reads
How do we make the Stacks blockchain support more indexed reads? Recall there are two other steps in an indexed read operation:
Hashing the key to get a MARF path
Looking up the value from the hash returned by the MARF
The former is just the act of taking the SHA512/256 of the key, and we already use a hand-optimized x86_64 assembler implementation for this.
The latter could use some substantial improvement.
Beta Was this translation helpful? Give feedback.
All reactions