Chainstate DB performance #3777

jcnelson · 2023-07-03T19:52:03Z

jcnelson
Jul 3, 2023
Maintainer

This discussion focuses on making chainstate DB reads faster.

Background

A fork-indexed read happens whenever Clarity reads some persisted data from a fork in the blockchain. This happens each time the code reads a data var or map entry, each time it calls another contract, and each time it runs (at-block ...). It also happens internally in a few places, such as in the get-block-info? and get-burn-block-info? built-ins, as well as with block validation. Fork-indexed reads are an essential operation supported by the chainstate database: they permit the system to efficiently track multiple forks while ensuring that each transaction's Clarity code can only query state from the fork in which it runs.

Looking at block execution budgets over the past year, the scarcest resource in the Stacks blockchain by far is the number of fork-aware reads permitted. This is tracked in the execution budget's read_count metric. Right now, only 15,000 such operations are allowed per block, and we regularly see over 50% utilization (often as high as 99%).

Internally, fork-indexed reads happen through a specially-designed index data structure called a Merklized Adaptive-Radix Forest, or MARF. The MARF, when combined with a sqlite database, implements a fork-aware key/value store. It is designed to service the following two queries efficiently, in O(1) time:

Given the hash of a block and a key, look up the hash of the mapped value in the fork that the given block hash represents. That is, the value's hash will only be found if it was set by a previous MARF write in an ancestor Stacks block.
Given the hash of a block and a key, get a Merkle proof that proves that the mapped value exists in the fork that this block hash represents.

State of the MARF

As the name implies, a MARF is a collection of trees (specifically, hash array-mapped tries). The keys of the MARF are arbitrary strings; they are hashed to calculate a fixed-length path. The leaves of each trie are the hashes of values stored in the associated sqlite database. Once the caller has found a MARF leaf (the value's hash), it then uses it to query the database for the value itself. That is, querying data in the Stacks blockchain is a 3-step operation: (1) given the chain tip and key, hash the key to get a path, (2) walk the path to look up the value's hash in the MARF if it exists, and (3) if it does exist, query the value from the sqlite database. To service part (3) of the query, the sqlite database simply maintains a table with two columns -- the value hash, and the value.

I have spent a lot of time getting the MARF to be very, very fast at step (2) while also being crash-consistent. It achieves this by storing all of its tries in an append-only flat file on disk, and by relying on an associated sqlite database to index it (note that this sqlite database is not the same as the sqlite database that stores the MARF-indexed values). That is, the bulk of I/O in any MARF query happens directly through the VFS; sqlite is not involved on the read path.

The sqlite database records the block hash, block ID, and file offset and length of each trie in the flat file. Because chain state is append-only, all tries and index data is immutable once written -- the MARF will cache block hash, block ID, and trie offsets in RAM indefinitely, and will optionally cache trie nodes in RAM once loaded (thereby keeping sqlite off of the read path when traversing multiple tries).

Crash consistency is achieved by appending new tries, flushing the data and inode metadata to disk, and inserting a new record for the trie into the sqlite index: a trie only exists if the sqlite index says it exists, and sqlite's crash consistency ensures that each trie it represents was persisted to disk via the aforementioned disk flushes.

A concern that has cropped up multiple times over the past year is that the MARF is not "fast enough" to handle things like subnets or to achieve speed goals in the Nakamoto release. I hope now to show that this is a myth.

MARF path lookup are fast

I have created the branch perf/marf-dump-and-bench to help explore the runtime performance of the MARF index. In this branch, stacks-inspect gains two new commands:

marf-dump: Given a block hash, a path to a MARF, and a positive integer N, dump the first N paths and leaves in lexicographical order by path. If N is greater than the number of leaves in the MARF at the given block hash, then all paths and leaves are dumped.
marf-get-bench: Given a block hash, a path to a MARF, and a list of paths, load each leaf and report the average number of milliseconds taken to read a leaf.

Using these new commands, I dumped all of the leaves of the Stacks chainstate MARF (i.e. the on Clarity uses) as of the current chain tip, and ran marf-get-bench on them. Here's what I found:

$ ./target/debug/stacks-inspect marf-dump /data/mainnet/mainnet-2.1/chainstate/mainnet/chainstate/vm/index.sqlite b131989026f09dd54697d6f614d664f218f97ff9752419a1d93ea2f792f7f987 2000000000 > ./marf-keys.txt
$ cat ./marf-keys.txt | tail -n +2 | cut -d ',' -f 1 > ./marf-paths.txt
$ wc -l ./marf-paths.txt 
223213 ./marf-paths.txt
$ ./target/release/stacks-inspect marf-get-bench /data/mainnet/mainnet-2.1/chainstate/mainnet/chainstate/vm/index.sqlite b131989026f09dd54697d6f614d664f218f97ff9752419a1d93ea2f792f7f987 < ./marf-paths.txt 
Total time (ms):       5588.165621
Avg time per key (ms): 0.02503512618440682

The host machine on which this test ran stores the MARF on an SSD, and has an Intel Haswell CPU with 8 GB RAM. The host machine also runs a Stacks node, so the MARF file blocks are more likely than not to be cached in the block layer already.

As you can see, even with a modest VM, it takes 25 microseconds on average to resolve a MARF path to its leaf. Keep in mind that this does not include the time taken to hash the key to its path, nor to look up the value from the leaf's value hash. This is just a measure of how long it takes to walk the MARF path to the value hash. But, we can do this step about 40,000 times per second!

So, clearly, MARF path resolution is not the limiting factor for a higher indexed read budget.

Profiling

For the record, where does marf-get-bench spend its time? Here's an annotated flame graph:

As you can see, the node spends most of its time loading trie nodes from disk, as well as a modest amount of time in the MARF walking from node to node. Also, note that in this test, all of the block hashes, block IDs, and trie offsets were fetched and cached eagerly (this is a different but compatible behavior from master, and one I'd like to merge).

You can reproduce the flame graph below:

$ git clone https://github.com/brendangregg/FlameGraph
$ perf script -i perf.data | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl > flamegraph.svg

Getting More Indexed Reads

How do we make the Stacks blockchain support more indexed reads? Recall there are two other steps in an indexed read operation:

Hashing the key to get a MARF path
Looking up the value from the hash returned by the MARF

The former is just the act of taking the SHA512/256 of the key, and we already use a hand-optimized x86_64 assembler implementation for this.

The latter could use some substantial improvement.

jcnelson · 2023-07-05T16:19:07Z

jcnelson
Jul 5, 2023
Maintainer Author

A Look at the Clarity DB

The astute reader will have noticed above that I was looking at the block headers MARF, which only contains about 230k keys. This was an oversight on my part. I will show the same information below for the Clarity DB MARF, which is a few orders of magnitude bigger, and will show how much overhead sqlite imposes on the step of resolving a MARF's value hash to a Clarity value.

To do this, I added two more benchmark commands to stacks-inspect:

marf-dump-clarity-values: this command takes the path to the Clarity VM's MARF, a block hash, and a positive integer N, and outputs the paths, value hashes, and serialized Clarity values of the first N MARF paths that map to Clarity data (as opposed to MARF-internal data, like the block height and the parent block hash).
marf-get-clarity-values-bench: this command takes the path to the Clarity VM's MARF, a block hash, and a list of MARF paths from stdin, and runs a full lookup of the Clarity value. That is, it walks the MARF path to the value hash, and then queries the Clarity DB for the associated value.

Does the size of the Clarity MARF impact read performance?

The Clarity MARF is much, much bigger than the block headers MARF. With the recent chain tip hash I'm using here, there are 33 million Clarity values alone. I don't have the exact number for the number of MARF paths, but it's not much larger (recall that the MARF inserts five paths per block at minimum, so this number is no greater than 34 million).

Here's the output of a benchmark for reading 100,000 MARF paths from the Clarity MARF that map to Clarity values. Note that we're using marf-get-bench here -- this only measures MARF path resolution, not the Clarity value resolution.

$ ./target/release/stacks-inspect marf-dump-clarity-values /data/mainnet/mainnet-2.1/chainstate/mainnet/chainstate/vm/clarity/marf.sqlite 83d66afc4f45c6aca3485ee3581e4fe8a0ec1c8d5976ae9a7e0a286e8c5d1158 100000 | cut -d ',' -f 1 | tail -n +2 > /tmp/clarity-paths.txt
$ perf record -F 100000 --call-graph dwarf -g -- ./target/release/stacks-inspect marf-get-bench /data/mainnet/mainnet-2.1/chainstate/mainnet/chainstate/vm/clarity/marf.sqlite 5b42d40a03bf10067728466e6ff079df6e3e1fbbb4cf26a8a2f70cb38c20f181 < /tmp/clarity-paths.txt
Total time (ms):       2932.135937
Avg time per key (ms): 0.029321066159338407

This is on the same production VM as before, with the Intel Haswell CPU. As you can see, the time per MARF query isn't substantially different -- 29 us versus 25 us. The reason for this is that the MARF's largest nodes have a radix of 256. So, a path in a MARF with 34 million keys requires on average about 4.14 node visits (3.14 intermediate nodes, plus the leaf). For the block headers MARF tested above, it's about 3.22 node visits (2.22 intermediate nodes plus the leaf). The code is reading less than 1 additional MARF node per query.

So the verdict here is "not really" -- the larger MARF's path read performance is still very fast. Here's the full flame graph of this benchmark:

How much does Sqlite impact Clarity value lookups?

Recall that once the Clarity VM has resolved the MARF path to the value hash, it queries its side store DB for the value. This data is stored in the data_table in the Clarity VM's MARF.

$ perf record -F 100000 --call-graph dwarf -g -- ./target/release/stacks-inspect marf-get-clarity-values-bench /data/mainnet/mainnet-2.1/chainstate/mainnet/chainstate/vm/clarity/marf.sqlite 5b42d40a03bf10067728466e6ff079df6e3e1fbbb4cf26a8a2f70cb38c20f181 < /tmp/clarity-paths.txt
Total time (ms):       6921.824573
Avg time per key (ms): 0.06921755355446445

What's going on here? We went from 29 us to 69 us -- reading the serialized Clarity value from the side store adds a whopping 40 us of time.

Where is that time getting spent? Let's look at the flamegraph.

Look at all that overhead from calls into sqlite! So the verdict here is "sqlite somewhat impacts Clarity value lookups"

Performance Improvement Opportunities

Looking at the above data, the single biggest win I think would be to interleave MARF data into the MARF itself. Instead of storing MARF data inside a separate sqlite table, we could store it within the MARF trie flat file alongside the leaf nodes. This would not only get sqlite off of the critical read path, but also let us leverage better locality-of-reference: filling a page of RAM with disk block data when reading the MARF leaf would also bring in substantial parts (or even the entirety of) the associated Clarity value, thus saving us additional I/O.

I'm going to experiment with doing this.

5 replies

kantai Jul 6, 2023
Maintainer

Looking at the above data, the single biggest win I think would be to interleave MARF data into the MARF itself.

I disagree with this conclusion.

If the above indicates that the total database read time is on the order of 70 microseconds, then the previously measured time for Clarity reads of 10 ms indicates that the Clarity deserialization and value construction is 99.3% of the "cost" of a MARF read. Optimizing the 70 microseconds seems like the wrong conclusion in that case. That assumes those numbers are all correct: I'd suggest benchmarking an end-to-end clarity MARF read speed before making a conclusion, because if that is indeed 10ms, then optimizing the MARF lookup and sqlite read is really not likely to be productive.

kantai Jul 6, 2023
Maintainer

The main.rs CLI already has some functions that can be used to test Clarity value deserialization from the DB -- deserialize-db and check-deser-data. If you give deserialize-db a byte prefix, it will attempt to read and deserialize the value of every key with that prefix from the Clarity DB (since the key is a hash, any byte prefix should be roughly a representative slice of data). It doesn't hit the MARF, Clarity function calls, or Clarity variable lookup paths, but it does do the deserialization more or less exactly like a Clarity execution would. If that already has a much higher lookup time, that would indicate that optimizations to MARF storage are not a practical path to performance gains.

jcnelson Jul 6, 2023
Maintainer Author

then the previously measured time for Clarity reads of 10 ms

Can you point me to the benchmark code that reached this conclusion? I'm skeptical that the Clarity value codec is almost two orders of magnitude slower than fetching data from storage.

EDIT: ah, I posted at the same time you did. Thanks!

kantai Jul 6, 2023
Maintainer

Yeah -- I'm not actually sure where the prior numbers came from, I think it was collected from log timing during a genesis sync (this was during the 2.05 cost measurements I believe). That's why I think the next step is figuring out a good estimate for the end to end times of db reads from Clarity.

jcnelson Jul 7, 2023
Maintainer Author

I modified that benchmark slightly to first load all of the matching Clarity strings into RAM, and then record the number of nanoseconds required to decode them all. Clarity value deserialization seems pretty fast:

$ ./target/release/stacks-inspect deserialize-db /tmp/stacks-mainnet/mainnet/chainstate/vm/clarity/marf.sqlite 00 100000000
Deserialized 54172 Clarity values in 121992099 ns
Avg deser time: 2251.940098205715 ns (0.002251940098205715 ms)

2.25 microseconds on average -- about 10x as fast as loading it from storage.

jcnelson · 2023-07-10T15:19:25Z

jcnelson
Jul 10, 2023
Maintainer Author

An update on this:

Looking at the above data, the single biggest win I think would be to interleave MARF data into the MARF itself.

I implemented a data-interleaving scheme whereby MARF leaf data was directly written to the trie right next to the leaf node. You can find it in feat/interleaved-marf-storage. It's actually slower than the current behavior.

Here is today's behavior, which looks up a MARF value hash from the MARF and then queries the serialized Clarity value from the side store:

$ ./target/debug/stacks-inspect marf-get-clarity-values-bench /tmp/stacks-mainnet/mainnet/chainstate/vm/clarity/marf.sqlite 7c60fddfc5857028134b2c3a0241015fe2429d779992d3d2a6d9a2ce5dbfb82c < /tmp/clarity-paths.txt
Total time (ms):       2329.471403                                                                                                                                                                                                                                                                                             
Avg time per key (ms): 0.02329448108518915

Here is the new behavior, which embeds the serialized Clarity value directly into the MARF itself:

$ ./target/debug/stacks-inspect marf-get-clarity-values-bench /tmp/mainnet-interleaved/mainnet/chainstate/vm/clarity/marf.sqlite 7c60fddfc5857028134b2c3a0241015fe2429d779992d3d2a6d9a2ce5dbfb82c < /tmp/clarity-paths.txt
Total time (ms):       4402.911959                                                                                                                                                                                                                                                                                             
Avg time per key (ms): 0.04402867930320697

I don't think it's a good use of time to pursue this further, but I suspect that the main reasons why the interleaving system is slower are:

MARF data isn't page-aligned, so perhaps more unnecessary data gets copied from disk on read(2) than with sqlite
Sqlite does its own page-caching, including for b+-tree pages whereas the MARF relies solely on the VFS
Sqlite b+-trees have a branching factor of around 500, whereas the MARF's maximum branching factor is 256

I'll just leave the feat/interleaved-marf-storage around in case anyone wants to poke at this further.

Anyway, it's refreshing to see that the act of resolving a MARF-indexed value from both the MARF index and side-storage is very fast -- on the order of 10s of microseconds. We can certainly afford to increase the read_count budget in a hard fork.

0 replies

kantai · 2023-11-10T17:07:37Z

kantai
Nov 10, 2023
Maintainer

Just want to resurface this topic as it pertains to recent discussions on SIP calls for Nakamoto.

A key take-away from this discussion was:

MARF reads from Clarity should be much, much cheaper than they are currently assessed in block limits. Jude's benchmarking found MARF read + deserialization takes ~70 microseconds for an operation. Those currently are assessed at ~10ms in block limits (a 142x larger assessment). In theory, this means that the read_count limit in blocks could be increased 142-fold without any other work.

Practically speaking, this would need to be confirmed through further benchmarking of actual block validation performance, but it does mean that the current I/O bounds are likely very pessimistic, and could be increased.

0 replies

AshtonStephens · 2023-11-13T16:15:47Z

AshtonStephens
Nov 13, 2023
Maintainer

@cylewitruk I believe this is the best place for you to consider benchmarking SQLlite to help Nakamoto blocks have the right number of transactions and then offer ideas. @kantai do you agree?

1 reply

cylewitruk Nov 13, 2023
Collaborator

I don't have a problem keeping chainstate benchmarking-related items here :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chainstate DB performance #3777

{{title}}

Replies: 4 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Chainstate DB performance #3777

jcnelson Jul 3, 2023 Maintainer

Background

State of the MARF

MARF path lookup are fast

Profiling

Getting More Indexed Reads

Replies: 4 comments · 6 replies

jcnelson Jul 5, 2023 Maintainer Author

A Look at the Clarity DB

Does the size of the Clarity MARF impact read performance?

How much does Sqlite impact Clarity value lookups?

Performance Improvement Opportunities

kantai Jul 6, 2023 Maintainer

kantai Jul 6, 2023 Maintainer

jcnelson Jul 6, 2023 Maintainer Author

kantai Jul 6, 2023 Maintainer

jcnelson Jul 7, 2023 Maintainer Author

jcnelson Jul 10, 2023 Maintainer Author

kantai Nov 10, 2023 Maintainer

AshtonStephens Nov 13, 2023 Maintainer

cylewitruk Nov 13, 2023 Collaborator

jcnelson
Jul 3, 2023
Maintainer

Replies: 4 comments 6 replies

jcnelson
Jul 5, 2023
Maintainer Author

kantai Jul 6, 2023
Maintainer

kantai Jul 6, 2023
Maintainer

jcnelson Jul 6, 2023
Maintainer Author

kantai Jul 6, 2023
Maintainer

jcnelson Jul 7, 2023
Maintainer Author

jcnelson
Jul 10, 2023
Maintainer Author

kantai
Nov 10, 2023
Maintainer

AshtonStephens
Nov 13, 2023
Maintainer

cylewitruk Nov 13, 2023
Collaborator