-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesign disk storage and checkpoint/scavenger processes #17
Comments
Draft DesignDisk Storage
I'll replace the local log files and live-hunk location files with single metadata DB. I'll use an embedded DB such as HanoiDB or LevelDB for it so that it will help me to unload some portion of the metadata from the RAM. I'll continue to use long-term log files to store the value part of key-values. I believe these embedded DBs are not good at handling large binary values so I want to keep it as is. Maintenance Processes
The scavenger will be merged to the write-back process, and the checkpoint process will no longer exists. Both write-back and scavenger (aka compactor) will be sequentially executed by single process to avoid race conditions between them (like the one causing hibari#33). |
Note that the compactor will no longer update the (in-memory) ETS table for the metadata after it moves some live-hunks to a long-term log file. Instead, it will only update the (on-disk) metadata DB. This will reduce the performance impact that the current compactor has. When a get request fails to locate the value because the value has been moved by the compactor and the ETS table has the stale location info, the brick server will read the updated location from the matadata DB, refreshes the ETS entry and finally read the value and returns it to the client. |
As for the upcoming release (v0.3.0), only issue # 1 above will be addressed. However, a major rework on storage is being done for v0.3.0, and this will help future releases to address other ones above. Also, v0.3.0 will no longer have checkpoint operation, and scavenger steps are reorganized for efficiency. Disk Storage for v0.3.0
*1: Paper: Don’t Settle For Eventual: Scalable Causal Consistency For Wide-Area Storage With COPS |
Status as of January 13th, 2014: Almost finished the metadata DB part. Once finished, I will work on brick private value blob store. Diff between dev HEAD and the topic branch HEAD: 6eec707...5901f65
|
Started to work on the following items:
Added various modules in this commit b5fba54a03 to implement above items.
|
I replaced LevelDB with HyperLevelDB which is a fork and drop-in replacement of LevelDB with two key improvements:
Hibari will not get much benefit from first point because it uses single writer (WAL write-back process) per brick metadata DB, but will get some benefit from second point. I loaded 1 million key-values to a Hibari table with 12 bricks, chain length 1, and so far, so good. |
I open-sourced the Erlang binding to HyperLevelDB. I'll update brick server's code to utilize it. |
Done.
|
Finished implementing new hlog format with the following key enhancements:
In January, I implemented a gen-server for writing WAL from scratch. Next step will be to implement write-back process from WAL to metadata DB (HyperLevelDB) and value blob hlog file. After that, I will update brick_ets server to utilize the new hlog format for writing and reading. Then finally, I will implement scavenger (aka compaction) process from scratch. |
I started to work on the above. I actually started from the latter one and Hibari can now bootstrap from the new hlog modules. (but it's not very useful without the write-back process and scavenger.) Also, before I start to commit these work-in-progress changes, I created a git annotate tag |
Merged recent changes on the dev branch (post v0.1.11) into gbrick-gh17-redesign-disk-storage branch.
|
After a long pause (Oct 2014 -- May 2015), I resumed working on this topic (to implement the write-back process). I made a couple of commits on the topic branch of gdss_brick and so far I confirmed that all WAL hunks written to a WAL file can be parsed back to hunk records. I'm trying to complete this task by the end of this month (May 2015). |
OK. I'm basically done with the above two TODO items. Now key-value's metadata (key, timestamp, user-provided property list and expiration-time) is stored in brick private metadata DB (HyperLevelDB), and value is stored in brick private hlog file.
Now the last big part will be re-implementing the scavenger (aka compaction process) from scratch. Hope I can finish it in two weeks. |
I spent last few days to the followings:
The new compaction process should be much more efficient than the current scavenger implementation in v0.1 series. Here is the current design:
Also, I'm planning to store small values in the metadata DB (HyperLevelDB) rather than the blob hlog files. HyperLevelDB has efficient compaction implementation in C++ so I hope this design change will improve the overall compaction efficiency too. |
As for the compaction process, I have implemented the following main functions: -spec estimate_live_hunk_ratio(brickname(), seqnum()) -> {ok, float()} | {error, term()}.
-spec compact_hlog_file(brickname(), seqnum()) -> ok | {error, term()}. The former estimates live/dead blob hunks ratio per hlof file by comparing randomly sampled keys against the metadata DB. The latter runs compaction on an hlog file to reclaim disk space and updates storage locations of live hunks. Next step will be to implement periodical task to estimate the live hunk ratio, score hlog files, and pick an hlog file to run a compaction. |
As the first step, I created a temporary function to estimate live hunk ratios of all value blob hlog files on a node (commit: 53a697e#diff-8bd48a5a77dd1a285b7729b3766317e0R110). Here is a sample run on a single-node Hibari with perf1 table (chain length = 3, number of chains = 8): (hibari@127.0.0.1)43> F = fun() -> lists:foreach(
(hibari@127.0.0.1)43> fun({B, S, unknown}) ->
(hibari@127.0.0.1)43> io:format("~s (~w): unknown~n", [B, S]);
(hibari@127.0.0.1)43> ({B, S, R}) ->
(hibari@127.0.0.1)43> io:format("~s (~w): ~.2f%~n", [B, S, R * 100])
(hibari@127.0.0.1)43> end, brick_blob_store_hlog_compaction:list_hlog_files())
(hibari@127.0.0.1)43> end.
#Fun<erl_eval.20.90072148>
(hibari@127.0.0.1)44> F().
bootstrap_copy1 (1): 21.43%
perf1_ch1_b1 (1): 4.91%
perf1_ch1_b1 (2): 26.89%
perf1_ch1_b1 (3): 65.67%
perf1_ch1_b2 (1): 3.35%
... Here are the numbers for perf1 chain 1 ordered by chain, brick and hlog sequence number: perf1_ch1_b1 (1): 4.91%
perf1_ch1_b2 (1): 3.35%
perf1_ch1_b3 (1): 5.84%
perf1_ch1_b1 (2): 26.89%
perf1_ch1_b2 (2): 25.12%
perf1_ch1_b3 (2): 25.17%
perf1_ch1_b1 (3): 65.82%
perf1_ch1_b2 (3): 82.28%
perf1_ch1_b3 (3): 72.60%
All bricks (b1, b2, b3) in a chain should have the exact same contents for each hlog files with sequence numbers (1, 2, 3). However the estimated live hunk ratios are different (for example, 4.91%, 3.35%, 5.84%, or 65.82%, 82.28%, 72.60%). This is because the estimation is done by randomly sampled keys. It's currently using about 5% of keys in an hlog file for the estimation. I think current setting is still providing estimated ratios in good enough precision. |
I have implemented a very basic version of above. (The last commit was: de7c990) There are lots of places to improve but now the new disk storage format has a complete set of functions. I'll shift my focus to other tasks but will continue improving this feature too. I'm planning to release Hibari v0.3 with this feature sometime in this fall (2015). |
I found and fixed a bug in the write-back process that would miss some log hunks in the WAL when group commit is enabled: Commit: 32428e4 |
It seems HyperLevelDB is not actively developed anymore; the last commit was made in Sep 2014. I'm thinking to switch to RocksDB that is another fork of LevelDB. It is very actively developed and has a large user base. |
I found and fixed a couple of bugs related to node restart (91b62e0...9f9efe9). I also removed obsolete gmt_hlog* and scavenger modules from v0.3 stream (44fd692), and updated the app.src of gdss_brick application (e0e71ea). |
1 similar comment
I found and fixed a couple of bugs related to node restart (91b62e0...9f9efe9). I also removed obsolete gmt_hlog* and scavenger modules from v0.3 stream (44fd692), and updated the app.src of gdss_brick application (e0e71ea). |
I found and fixed a couple of bugs related to node restart (91b62e0...9f9efe9). I also removed obsolete gmt_hlog* and scavenger modules from v0.3 stream (44fd692), and updated the app.src of gdss_brick application (e0e71ea). |
Redesign and re-implement disk storage and maintenance processes to address the issues Hibari is having right now (v0.3RC1).
Issues
Disk Storage
Maintenance Processes
The text was updated successfully, but these errors were encountered: