Persistence Replacement -- squashed + rebased #1500

oten91 · 2022-11-14T18:33:05Z

tl;dr Better, faster, smaller, simpler persistence layer

Pocket Network Persistence Replacement

Background:

The IAVL store (Cosmos’ app level storage tree) is the legacy implementation that serves as both the merkle root integrity structure
and the general purpose database structure for reads. Though the structure is seemingly efficient at generating merkle roots for the
ABCI AppHash, Pocket Network observes a considerable downgrade in performance for historical queries, block processing, and overall
disk size management. The idea behind Pocket Network’s Persistence Replacement is to preserve the efficient state commitment root
generation while providing an alternative implementation for DB reads. The idea is to prune N-X IAVL versions in order to enable merkle
tree generation, while maintaining a completely separate storage layer for efficient storage, historical queries, and block processing.
The idea and design is very similar to Cosmos’ ADR-40.

Required Goals:

Simpler Codebase
Non-Consensus Breaking
Comparable syncing performance
Comparable disk usage

Additional Goals:

Approaches Considered / Tested / Combined for the query database:

Cache with warmup (load last N heights into memory).

A full mem go-leveldb compatible cache was created just for this test. The real challenge of creating an in memory goleveldb
is that it must match the exact behavior when deleting items while iterating. This behavior is often undefined / undocumented so
matching the behavior was riddled with trial and error testing. A weakness of the cache design is the only way to ensure the exact
behavior is to do a seek during each Next() call which hinders the OLog(n) performance of the BTree iterator implementation. The idea
with the cache is to write current heights and read from the last N heights. During commit() you persist the height to disk and get a
little speed up due to the batch write functionality of leveldb. In the end, little to no performance was seen with this cacheDB and was
omitted due to long ‘warmup times’ and code complexity.
NOTE: Go-leveldb comes with a memdb that is append only - don’t get your hopes up
NOTE: TenderminttDB comes with a memdb that is not iterator compatible with goleveldb - don’t get your hopes up

Postgres implementation

The idea was to implement a V1 standard postgresDB in order to take advantage of PgSQL’s considerable scalability and lookup capabilities.
Though the idea is logically sound, the issue is the way the Pocket Core V0 ABCI app is designed. Pocket Core inherits the Cosmos SDK architecture
which uses Protobuf (previously Amino) encoding and serves arbitrary bytes to the storage layer. This architecture leaves a big challenge for
the Postgres implementation: A custom key interpreter / ORM needs to be designed in order to appropriately assign the key /values to the
proper tables. With careful design, this overhead may be minimized from a performance standpoint, but it will never be zero. The other major
challenge with a Postgres implementation is that the current setups of Node Runners will need to be migrated to support a database engine as
opposed to a file system db. In early prototyping, block processing performance was heavily degraded for the Postgres implementation, however the
datadir size shrunk considerably. The design was ultimately discarded due to the above challenges and disappointing performance.

Straight K/V (AKA Waves)

A straightforward, naive approach to just store K/V in a golevel db for each height without any deduplication. Performance was by far the best with
this branch, seeing block production times and query performance improve by a minimum of 3X the RC-0.6.3 implementation. The issue with this branch,
is that without any de-duplication logic, the storage disk space requirement was dramatically increased (almost 2x). Unfortunately, this rendered ‘waves’
an unacceptable solution.

Deduplicated Waves Using Index Store

Spec

Two Datastores:
- Link Store (Indexing) - index keys recorded at every possible height that ‘links’ to the data blob
- Data Store (Data Bytes) - hash keys that contains the actual data value
- Link Store:
  - KEY: height uint32 + prefixByte byte + dataKey
  - VAL: 16 Byte Hash Of LinkKEY
- Data Store
  - KEY: 16 Byte Hash Of LinkKEY
  - VAL: Data Blob
Results:
Comparable datastore size, comparable performance for block processing, faster historical queries, and far simpler code complexity.
This is likely an ‘acceptable’ solution but not optimal and somewhat underwhelming given the overhaul.

Deduplication (2) Using 'Heights' Index

Spec

Two Datastores:
- Link Store (Indexing) - index keys recorded at every possible height that ‘links’ to the data blob
- Data Store (Data Bytes) - hash keys that contains the actual data value
- Link Store:
  - KEY: prefixByte byte + dataKey
  - VAL: Array of heights Ex. (0, 1, 2, 100)
- Data Store
  - KEY: 16 Byte Hash Of LinkKEY
  - VAL: Data Blob
Details

GET:
// first lookup
<prefix> + <dataKey> -> array of heights (choose latest compared to query)
// second lookup
<prefix> + <dataKey> + <latestHeight> -> data blob

SET/DEL
// first write
<prefix> + <dataKey> -> APPEND TO (array of heights 1,2,3,101)
// second write
<prefix> + <dataKey> + <newHeight> -> data blob (or DEL for deleted)

ITER (Issue: iterating over orphans)
// step 1 iterate over first write ^
<prefix> + <dataKey>
// step 2 upon first lookup (Next() call)
get latest height relative to query
// step 3 is ensure the value is not nil (deleted)
if nil (skip) -> Next() again

Results
Approximately 10x slower than current implementation with 40% disk savings. Deemed unacceptable due to performance

Dedup (3) AKA Slim ✅

Specification
No index, use reverse iterator 'seek' to get lexicographically ELEN sorted
heightKeys from the datastore. Load last N heights into memory and do nothing upon store Commit() besides increment height.
Use specially optimized functions: DeleteHeight() and PrepareHeight() which should take advantage of golangs
built in delete(map,key) and copy(dst,src) functions to increment the in-memory store.
Details

GET:
// get the latest version by taking advantage of goleveldb.Seek() and use a reverse iterator to find the next appropriate lexicographically sorted height key
REVERSE ITERATOR -> <prefix> + <dataKey> + <latestHeight> -> data blob

SET/DEL (2 sets per write)
// exists store (cached) <the main reason we do this is to have a space to iterate over>
<exists>+<prefix>+<dataKey> -> (true if set, false if del)
// ‘main’ set
<prefix> + <dataKey> + <newHeight> -> data blob (or DEL for deleted)

ITER (Issue: iterating over orphans)
// step 1 iterate over exists store
<prefix> + <dataKey>
// step 2 do the GET() logic to find the latest version key
latestKey = <prefix> + <dataKey> + <latestHeightRespectiveToQuery> -> data blob
// step 3 is ensure the value is not nil (deleted)
if DEL or future height-> skip

Result Data:

Block Processing (Sampled 15 heights starting on 44376)

Before:

Total: 154s

Avg: 10s

After:

Total: 135s

Avg: 9s

Historical Querying (20 historical queries while syncing (query nodes 4000<0-9> --nodeLimit=20000)

Before:

Total: 60.66s

Avg: 3.03s

After:

Total: 39.37s:

Avg: 1.97s

Disk Usage (Height 44335)

Before:

du -h data/

12G data//txindexer.db

63G data//application.db

…

TOTAL: 131G data/

After:

du -h data/

6.6G /txindexer.db

13G /application.db

…

TOTAL: 77G /data

41% Total reduction and 79% reduction in applicationDB

It is important to note that since the sample is taken on height 44335 and not current height 65K+,
we’ll likely see even more reduction onward - as the applicationDB is the fastest growing in the current iteration.

Conclusion:

Release:

Only one option: full archival (but storage requirements significantly reduced)
Requires a snapshot for others to download as syncing is still very time consuming
Non-consensus breaking change and uses leveldb, requires a config.json change

Learnings:

Retrofitting the behaviors of IAVL is the biggest challenge in this project
Working with 200GB datadirs for testing is difficult
The tradeoffs between all of the goals continuously conflict with each other
Developing against mainnet is the only real tool that exists for debugging the current store and the appHash error is no help
When you develop a persistence module for a blockchain, you should invest the most engineering effort on getting the storage absolutely optimized before launching

Fix a high memory consumption that also is part of the issue #1457. Under high load of requests (1000/rps or more) the RAM got crazy and scale up to 40GB or close to that. Now after the fix of #1457 with the worker pool, the node remains under 14gb of ram in my local tests.

… keep cpu in high load.

rearrange test ordering TestPrefixEndBytes fix removed benchmarking print made persisted database generic inorder to allow mem.DB when testing slim code complete completely removed memdb from codebase rolled back memdb cache with commenting codec fix for 30024 bug working memdb using btree really slow memdb with custom implementation stable with growing memory due to golevelmemdb added previous validators cache for session generation added signing info cache added codec cache updated proto for slim + optimized indexer working prototype with cache layer working prototype with no caching dedup implementation one, rollback point foundation laying for persistence replacement upstreamed waves cache-multi file consolidation removed proof / query functionality from iavl cleaned up store packages, removed height cache removed transient removed deprecated pruning, list, queue removed prefix store from store package removed trace removed gas

sfs fix adding -v flag to test in ci removing println and fixing tests update circle ci / go mod tidy Update Version

oten91 · 2022-12-04T02:23:48Z

merged to integration branch thorny-devil

Olshansk · 2023-07-05T19:22:43Z

A decision was made not to merge this body of work due to the risks / tradeoffs involved, as well as due to LeanPOKT solving a similar problem.

oten91 mentioned this pull request Nov 14, 2022

Persistence Replacement Bundle 1 #1467

Closed

18 tasks

oten91 changed the title ~~Persistence Replacemente -- squashed + rebased~~ Persistence Replacement -- squashed + rebased Nov 14, 2022

oten91 force-pushed the slim-squash branch 2 times, most recently from 6456fa6 to a40845d Compare November 25, 2022 01:01

oten91 force-pushed the slim-squash branch from a40845d to 69f92b6 Compare December 1, 2022 12:53

jorgecuesta and others added 8 commits December 1, 2022 08:53

Added /v1/health endpoint and update rpc spec.

cfe13ac

remove idle timeout in favor to use default (5 seg) because otherwise…

b8cc513

… keep cpu in high load.

Removed hardcoded values in favor of use GlobalPocketConfig

c19b452

fix nil pointer on cache

2fcb221

sfs fix adding -v flag to test in ci removing println and fixing tests update circle ci / go mod tidy Update Version

go mod tidy

e1371aa

go 1.19

1634c0c

oten91 force-pushed the slim-squash branch from 69f92b6 to 1634c0c Compare December 1, 2022 13:01

oten91 marked this pull request as ready for review December 5, 2022 21:03

Olshansk closed this Jul 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persistence Replacement -- squashed + rebased #1500

Persistence Replacement -- squashed + rebased #1500

oten91 commented Nov 14, 2022 •

edited

Loading

oten91 commented Dec 4, 2022

Olshansk commented Jul 5, 2023

Persistence Replacement -- squashed + rebased #1500

Persistence Replacement -- squashed + rebased #1500

Conversation

oten91 commented Nov 14, 2022 • edited Loading

Pocket Network Persistence Replacement

Background:

Approaches Considered / Tested / Combined for the query database:

Result Data:

Conclusion:

Release:

Learnings:

oten91 commented Dec 4, 2022

Olshansk commented Jul 5, 2023

oten91 commented Nov 14, 2022 •

edited

Loading