Redesign disk storage and checkpoint/scavenger processes #17

tatsuya6502 · 2013-08-01T12:29:53Z

Redesign and re-implement disk storage and maintenance processes to address the issues Hibari is having right now (v0.3RC1).

Issues

Rename operation broke the scavenger process (hibari#33)
The scavenger process has major performance impact to read/write operations
metadata for all key-values has to be kept in memory

Disk Storage

common-log files (write-ahead-log)
local log files (brick private metadata store)
long-term log files (value store)
live-hunk location files (scavenger work files)

Maintenance Processes

common-log write-back process
checkpoint process
scavenger

tatsuya6502 · 2013-08-01T12:44:19Z

Draft Design

Disk Storage

common-log files (write-ahead-log)
metadata DB (HanoiDB or LevelDB)
long-term log files (value store)

I'll replace the local log files and live-hunk location files with single metadata DB. I'll use an embedded DB such as HanoiDB or LevelDB for it so that it will help me to unload some portion of the metadata from the RAM. I'll continue to use long-term log files to store the value part of key-values. I believe these embedded DBs are not good at handling large binary values so I want to keep it as is.

Maintenance Processes

common-log write-back and continuous compactor process

The scavenger will be merged to the write-back process, and the checkpoint process will no longer exists. Both write-back and scavenger (aka compactor) will be sequentially executed by single process to avoid race conditions between them (like the one causing hibari#33).

tatsuya6502 · 2013-08-01T13:32:30Z

Note that the compactor will no longer update the (in-memory) ETS table for the metadata after it moves some live-hunks to a long-term log file. Instead, it will only update the (on-disk) metadata DB. This will reduce the performance impact that the current compactor has.

When a get request fails to locate the value because the value has been moved by the compactor and the ETS table has the stale location info, the brick server will read the updated location from the matadata DB, refreshes the ETS entry and finally read the value and returns it to the client.

tatsuya6502 · 2014-01-08T19:39:41Z

Issues

Rename operation broke the scavenger process (hibari#33)

The scavenger process has major performance impact to read/write operations

metadata for all key-values has to be kept in memory

As for the upcoming release (v0.3.0), only issue # 1 above will be addressed. However, a major rework on storage is being done for v0.3.0, and this will help future releases to address other ones above. Also, v0.3.0 will no longer have checkpoint operation, and scavenger steps are reorganized for efficiency.

Disk Storage for v0.3.0

common-log files (write ahead log)
- Internal format change will be done for more efficient and controlled write-back process.
brick private, metadata DB (LevelDB)
- This DB replaces brick private hlog files.
- This DB eliminates checkpoint process and shadow ETS table.
- In a future version, this DB will help to implement on-disk metadata. (metadata tiering)
- In a future version, this DB can be used key-value versioning. (The key for metadata is {key(), reversed_timestamp()} encoded by sext. v0.3.0 will use 0 for reversed timestamp, which is older than actual time.) Key-value versioning will work great for wide-area replication with causal+ consistency level [*1].
brick private, value blob store (long-term hlog files)
- Changing long-term logs from shared to brick private will help scavenger to become more efficient because it only has to check the usage of value hunks within a brick.
- This will also make it easier to relocate a brick from one server to another without service interruption. A future version may automate this process as an alternative of chain migration.

*1: Paper: Don’t Settle For Eventual: Scalable Causal Consistency For Wide-Area Storage With COPS

tatsuya6502 · 2014-01-13T04:39:17Z

Status as of January 13th, 2014:

Almost finished the metadata DB part. Once finished, I will work on brick private value blob store. Diff between dev HEAD and the topic branch HEAD: 6eec707...5901f65

common-log files (write ahead log)
- Change internal format for more efficient and controlled write-back process. -- TODO
brick private, metadata DB (LevelDB)
- Add metadata DB and write-back process for it. -- DONE
- Remove brick private hlog files. -- DONE
- Remove checkpoint process and shadow ETS table. -- DONE except updating test cases.
- Update brick_ets to load metadata records (store tuples) from metadata DB. -- DONE
brick private, value blob store (long-term hlog files)
- Change long-term logs from shared to brick private. -- TODO
- Note: scavenger relating codes have been moved to a new module brick_hlog_scavenger

tatsuya6502 · 2014-01-19T10:23:38Z

Started to work on the following items:

common-log files (write ahead log)
- Change internal format for more efficient and controlled write-back process.
brick private, value blob store (long-term hlog files)
- Change long-term logs from shared to brick private.

Added various modules in this commit b5fba54a03 to implement above items.

gdss-brick >> gh17 Redesign disk storage:

Introduce new hunk log format (brick_hlog_hunk module):

Unlike the generic gmt_hlog format, the new format is specialized
for Hibari usage. It has four hunk types.

metadata (for WAL files, many metadata blobs in one hunk)

blob_wal (for WAL files, one value blob in one hunk)

blob_single (for brick private files, one value blob in one hunk)

blob_multi (for brick private files, many value blobs in one hunk)

To make WAL (Write Ahead Log) write-back process simpler, these
WAL hunks have a dedicated 'brick_name' field.

Add API to read a value blob from a blob_* hunk without parsing
the hunk header.

Introduce new WAL and store file modules that will replace the old
gmt_hlog_* modules. (in progress)

brick_hlog_wal -
provides access to the WAL files including group committing.

brick_hlog_writeback -
does write back from a WAL file to the metadata and blob stores.

brick_hlog_scavenger -
(was introduced in an earlier commit) reclaims unused space from
hunk log based store files.

brick_metadata_store - is the common interface of metadata store.

brick_blob_store - is the common interface of value blob store.

brick_metadata_store_leveldb -
is a LevelDB implementation of the metadata store.

brick_blob_store_hlog -
is an hunk log implementation of the value blob store.

DETAILS:
...

tatsuya6502 · 2014-05-14T17:37:13Z

brick private, metadata DB (LevelDB)

Add metadata DB and write-back process for it. -- DONE

Remove brick private hlog files. -- DONE

Remove checkpoint process and shadow ETS table. -- DONE except updating test cases.

Update brick_ets to load metadata records (store tuples) from metadata DB. -- DONE

I replaced LevelDB with HyperLevelDB which is a fork and drop-in replacement of LevelDB with two key improvements:

Improved parallelism: HyperLevelDB uses more fine-grained locking internally to provide higher throughput for multiple writer threads.
Improved compaction: HyperLevelDB uses a different method of compaction that achieves higher throughput for write-heavy workloads, even as the database grows.

Hibari will not get much benefit from first point because it uses single writer (WAL write-back process) per brick metadata DB, but will get some benefit from second point. I loaded 1 million key-values to a Hibari table with 12 bricks, chain length 1, and so far, so good.

tatsuya6502 · 2014-10-13T06:17:08Z

I open-sourced the Erlang binding to HyperLevelDB. I'll update brick server's code to utilize it.
https://github.com/leveldb-erlang/h2leveldb

tatsuya6502 · 2014-10-16T12:14:12Z

#17 (comment)

I open-sourced the Erlang binding to HyperLevelDB. I'll update brick server's code to utilize it.
https://github.com/leveldb-erlang/h2leveldb

Done.

hibari, gbrick-gh17-redesign-disk-storage branch: hibari/hibari@0bbd6fb
gdss-brick, gbrick-gh17-redesign-disk-storage branch: ff68e07

tatsuya6502 · 2014-10-18T15:41:45Z

#17 (comment)

common-log files (write ahead log)

Change internal format for more efficient and controlled write-back process.

brick private, value blob store (long-term hlog files)

Change long-term logs from shared to brick private.

Finished implementing new hlog format with the following key enhancements:

Reading a value blob no longer requires to deserialize Erlang external term format. It's now just single file:pread/3 call. (Eliminated ~50 lines of Erlang codes for reading.)
MD5 checksum can be turned on/off per single key-value, instead of per brick.
To save disk space, the hunk for value blob now supports packing multiple values in one hunk. (Only one MD5 checksum will be added against the entire packed values in the hunk). It's still read efficient because each value in the hunk can still be read by single file:pread/3 call, no unpacking is needed. Note that I'll make no-MD5 checking will be performed on each read as the default setting. Instead, scavenger and background scrub process will do it when scanning entire hlog file.
Instead of having only one hunk format, introduced four different hunk formats for optimal read performance and space efficiency. These formats are: metadata, blob_wal, blob_single and blob_multi. metadata and blob_wal are used in WAL. blob_single and blob_multi are used in brick private blob storage, and the former is for a larger blob (e.g. > 4KB) and the latter is for smaller blobs.
The header part of the hunk is now fixed length and the hunk size can be calculated from the header. This will makes it easier for scanning through an hlog file.
The header and footer parts has magic numbers, and footer contains an index of packed value blobs. They will make it easier to recover broken hlog file.

In January, I implemented a gen-server for writing WAL from scratch. Next step will be to implement write-back process from WAL to metadata DB (HyperLevelDB) and value blob hlog file.

After that, I will update brick_ets server to utilize the new hlog format for writing and reading. Then finally, I will implement scavenger (aka compaction) process from scratch.

tatsuya6502 · 2014-10-19T23:56:49Z

#17 (comment)

Next step will be to implement write-back process from WAL to metadata DB (HyperLevelDB) and value blob hlog file.

After that, I will update brick_ets server to utilize the new hlog format for writing and reading.

I started to work on the above. I actually started from the latter one and Hibari can now bootstrap from the new hlog modules. (but it's not very useful without the write-back process and scavenger.)

Also, before I start to commit these work-in-progress changes, I created a git annotate tag 2014-10-metadatadb on the current branch head of gdss_brick and gdss_admin projects. That revision was tested with basho bench and had no error in six-hour runs.

tatsuya6502 · 2015-05-17T02:33:11Z

Merged recent changes on the dev branch (post v0.1.11) into gbrick-gh17-redesign-disk-storage branch.

tatsuya6502 · 2015-05-18T00:35:16Z

#17 (comment)

Next step will be to implement write-back process from WAL to metadata DB (HyperLevelDB) and value blob hlog file.

After that, I will update brick_ets server to utilize the new hlog format for writing and reading.

I started to work on the above. I actually started from the latter one and Hibari can now bootstrap from the new hlog modules. (but it's not very useful without the write-back process and scavenger.)

After a long pause (Oct 2014 -- May 2015), I resumed working on this topic (to implement the write-back process). I made a couple of commits on the topic branch of gdss_brick and so far I confirmed that all WAL hunks written to a WAL file can be parsed back to hunk records.

I'm trying to complete this task by the end of this month (May 2015).

tatsuya6502 · 2015-05-19T17:45:17Z

#17 (comment)

Status as of January 13th, 2014:

common-log files (write ahead log)

Change internal format for more efficient and controlled write-back process. -- TODO

brick private, metadata DB (LevelDB)

Add metadata DB and write-back process for it. -- DONE

Remove brick private hlog files. -- DONE

Remove checkpoint process and shadow ETS table. -- DONE except updating test cases.

Update brick_ets to load metadata records (store tuples) from metadata DB. -- DONE

brick private, value blob store (long-term hlog files)

Change long-term logs from shared to brick private. -- TODO

Note: scavenger relating codes have been moved to a new module brick_hlog_scavenger

OK. I'm basically done with the above two TODO items. Now key-value's metadata (key, timestamp, user-provided property list and expiration-time) is stored in brick private metadata DB (HyperLevelDB), and value is stored in brick private hlog file.

% find data/ -type file | sort
data/brick/bootstrap_copy1/blob/000000000001.BLOB
data/brick/bootstrap_copy1/metadata/leveldb/000003.sst
data/brick/bootstrap_copy1/metadata/leveldb/000006.sst
data/brick/bootstrap_copy1/metadata/leveldb/000008.log
data/brick/bootstrap_copy1/metadata/leveldb/CURRENT
data/brick/bootstrap_copy1/metadata/leveldb/LOCK
data/brick/bootstrap_copy1/metadata/leveldb/LOG
data/brick/bootstrap_copy1/metadata/leveldb/LOG.old
data/brick/bootstrap_copy1/metadata/leveldb/MANIFEST-000007
data/brick/bootstrap_copy1/metadata/leveldb/lost/...
data/brick/perf1_ch10_b1/blob/000000000001.BLOB
data/brick/perf1_ch10_b1/metadata/leveldb/000003.sst
data/brick/perf1_ch10_b1/metadata/leveldb/000006.sst
data/brick/perf1_ch10_b1/metadata/leveldb/000008.log
data/brick/perf1_ch10_b1/metadata/leveldb/CURRENT
data/brick/perf1_ch10_b1/metadata/leveldb/LOCK
data/brick/perf1_ch10_b1/metadata/leveldb/LOG
data/brick/perf1_ch10_b1/metadata/leveldb/LOG.old
data/brick/perf1_ch10_b1/metadata/leveldb/MANIFEST-000007
data/brick/perf1_ch10_b1/metadata/leveldb/lost/...
data/brick/perf1_ch10_b2/blob/000000000001.BLOB
data/brick/perf1_ch10_b2/metadata/leveldb/000003.sst
data/brick/perf1_ch10_b2/metadata/leveldb/000006.sst
data/brick/perf1_ch10_b2/metadata/leveldb/000008.log
data/brick/perf1_ch10_b2/metadata/leveldb/CURRENT
data/brick/perf1_ch10_b2/metadata/leveldb/LOCK
data/brick/perf1_ch10_b2/metadata/leveldb/LOG
data/brick/perf1_ch10_b2/metadata/leveldb/LOG.old
data/brick/perf1_ch10_b2/metadata/leveldb/MANIFEST-000007
data/brick/perf1_ch10_b2/metadata/leveldb/lost/...
...
data/wal_hlog/000000000002.HLOG
data/wal_hlog/000000000003.HLOG

Now the last big part will be re-implementing the scavenger (aka compaction process) from scratch. Hope I can finish it in two weeks.

tatsuya6502 · 2015-05-24T19:16:06Z

I spent last few days to the followings:

Brushing up the write-back process. It was done at commit: ad68753
Re-implementing the compaction process (aka scavenger process) from scratch.

The new compaction process should be much more efficient than the current scavenger implementation in v0.1 series. Here is the current design:

The write-back process now maintains look-up tables from blob hunk to Hibari key and timestamp. Look-up table (disk_log in OTP) is created for each blob hlog file and used by the compaction process to find live hunks. This eliminates the scavenger work directory because no more on-disk sorting is required.
The compaction server periodically estimates live/dead blob hunks ratio by randomly sampling blob hunks in the look-up tables and comparing them to the metadata DB. If a blob hlog file has a higher ratio than the threshold, the compaction server triggers a compaction process on that hlog file. There is no more daily major compaction, which had a big impact to the performance.
After coping live blob hunks to a new blob hlog file, the compaction process updates the storage locations in the metadata DB without interfering to brick server's read/write path. It is done asynchronously by the write-back process. This will greatly improve read/write key-value latencies while a compaction process is running.

Also, I'm planning to store small values in the metadata DB (HyperLevelDB) rather than the blob hlog files. HyperLevelDB has efficient compaction implementation in C++ so I hope this design change will improve the overall compaction efficiency too.

tatsuya6502 · 2015-05-28T00:13:10Z

As for the compaction process, I have implemented the following main functions:

-spec estimate_live_hunk_ratio(brickname(), seqnum()) -> {ok, float()} | {error, term()}.
-spec compact_hlog_file(brickname(), seqnum()) -> ok | {error, term()}.

The former estimates live/dead blob hunks ratio per hlof file by comparing randomly sampled keys against the metadata DB. The latter runs compaction on an hlog file to reclaim disk space and updates storage locations of live hunks.

Next step will be to implement periodical task to estimate the live hunk ratio, score hlog files, and pick an hlog file to run a compaction.

tatsuya6502 · 2015-05-29T00:08:16Z

Next step will be to implement periodical task to estimate the live hunk ratio, score hlog files, and pick an hlog file to run a compaction.

As the first step, I created a temporary function to estimate live hunk ratios of all value blob hlog files on a node (commit: 53a697e#diff-8bd48a5a77dd1a285b7729b3766317e0R110).

Here is a sample run on a single-node Hibari with perf1 table (chain length = 3, number of chains = 8):

(hibari@127.0.0.1)43> F = fun() -> lists:foreach(
(hibari@127.0.0.1)43>                fun({B, S, unknown}) ->
(hibari@127.0.0.1)43>                        io:format("~s (~w): unknown~n", [B, S]);
(hibari@127.0.0.1)43>                   ({B, S, R}) ->
(hibari@127.0.0.1)43>                        io:format("~s (~w): ~.2f%~n", [B, S, R * 100])
(hibari@127.0.0.1)43>                end, brick_blob_store_hlog_compaction:list_hlog_files())
(hibari@127.0.0.1)43>     end.
#Fun<erl_eval.20.90072148>
(hibari@127.0.0.1)44> F().
bootstrap_copy1 (1): 21.43%
perf1_ch1_b1 (1): 4.91%
perf1_ch1_b1 (2): 26.89%
perf1_ch1_b1 (3): 65.67%
perf1_ch1_b2 (1): 3.35%
...

Here are the numbers for perf1 chain 1 ordered by chain, brick and hlog sequence number:

perf1_ch1_b1 (1): 4.91%
perf1_ch1_b2 (1): 3.35%
perf1_ch1_b3 (1): 5.84%

perf1_ch1_b1 (2): 26.89%
perf1_ch1_b2 (2): 25.12%
perf1_ch1_b3 (2): 25.17%

perf1_ch1_b1 (3): 65.82%
perf1_ch1_b2 (3): 82.28%
perf1_ch1_b3 (3): 72.60%

All bricks (b1, b2, b3) in a chain should have the exact same contents for each hlog files with sequence numbers (1, 2, 3). However the estimated live hunk ratios are different (for example, 4.91%, 3.35%, 5.84%, or 65.82%, 82.28%, 72.60%). This is because the estimation is done by randomly sampled keys. It's currently using about 5% of keys in an hlog file for the estimation.

I think current setting is still providing estimated ratios in good enough precision.

tatsuya6502 · 2015-06-09T17:53:42Z

Next step will be to implement periodical task to estimate the live hunk ratio, score hlog files, and pick an hlog file to run a compaction.

I have implemented a very basic version of above. (The last commit was: de7c990) There are lots of places to improve but now the new disk storage format has a complete set of functions.

I'll shift my focus to other tasks but will continue improving this feature too. I'm planning to release Hibari v0.3 with this feature sometime in this fall (2015).

tatsuya6502 · 2015-07-11T12:38:52Z

I found and fixed a bug in the write-back process that would miss some log hunks in the WAL when group commit is enabled:

Commit: 32428e4

tatsuya6502 · 2015-08-06T23:28:42Z

#17 (comment)

I open-sourced the Erlang binding to HyperLevelDB. I'll update brick server's code to utilize it.
https://github.com/leveldb-erlang/h2leveldb

Done.

It seems HyperLevelDB is not actively developed anymore; the last commit was made in Sep 2014. I'm thinking to switch to RocksDB that is another fork of LevelDB. It is very actively developed and has a large user base.

tatsuya6502 · 2015-11-25T00:00:37Z

I found and fixed a couple of bugs related to node restart (91b62e0...9f9efe9).

I also removed obsolete gmt_hlog* and scavenger modules from v0.3 stream (44fd692), and updated the app.src of gdss_brick application (e0e71ea).

tatsuya6502 · 2015-11-25T00:02:10Z

I found and fixed a couple of bugs related to node restart (91b62e0...9f9efe9).

I also removed obsolete gmt_hlog* and scavenger modules from v0.3 stream (44fd692), and updated the app.src of gdss_brick application (e0e71ea).

tatsuya6502 · 2015-11-25T00:02:54Z

I found and fixed a couple of bugs related to node restart (91b62e0...9f9efe9).

I also removed obsolete gmt_hlog* and scavenger modules from v0.3 stream (44fd692), and updated the app.src of gdss_brick application (e0e71ea).

ghost assigned tatsuya6502 Aug 1, 2013

tatsuya6502 mentioned this issue Aug 1, 2013

Enhancements on hlog compaction (aka scavenger) for Hibari v0.3.0 hibari/hibari#34

Closed

This was referenced Jan 1, 2014

Update for Erlang/OTP R16(A) hibari/hibari#24

Closed

Enhancements on hlog compaction for the hybrid metadata storage hibari/hibari#35

Closed

tatsuya6502 mentioned this issue May 14, 2014

Experimental brick server alternative with on-disk metadata DB #22

Open

tatsuya6502 mentioned this issue Oct 30, 2014

brick_server client API enhancement - get/4 with byte range option #23

Open

tatsuya6502 mentioned this issue Mar 16, 2015

[Hibari release candidate] bad sequence file 'enoent' on a common log file deleted by the scavenger hibari/hibari#33

Closed

tatsuya6502 added the Improvement label Mar 29, 2015

tatsuya6502 mentioned this issue Nov 24, 2015

gmt_hlog families should track each hlog's location (short-term v.s. long-term) #16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesign disk storage and checkpoint/scavenger processes #17

Redesign disk storage and checkpoint/scavenger processes #17

tatsuya6502 commented Aug 1, 2013

tatsuya6502 commented Aug 1, 2013

tatsuya6502 commented Aug 1, 2013

tatsuya6502 commented Jan 8, 2014

tatsuya6502 commented Jan 13, 2014

tatsuya6502 commented Jan 19, 2014

tatsuya6502 commented May 14, 2014

tatsuya6502 commented Oct 13, 2014

tatsuya6502 commented Oct 16, 2014

tatsuya6502 commented Oct 18, 2014

tatsuya6502 commented Oct 19, 2014

tatsuya6502 commented May 17, 2015

tatsuya6502 commented May 18, 2015

tatsuya6502 commented May 19, 2015

tatsuya6502 commented May 24, 2015

tatsuya6502 commented May 28, 2015

tatsuya6502 commented May 29, 2015

tatsuya6502 commented Jun 9, 2015

tatsuya6502 commented Jul 11, 2015

tatsuya6502 commented Aug 6, 2015

tatsuya6502 commented Nov 25, 2015

tatsuya6502 commented Nov 25, 2015

tatsuya6502 commented Nov 25, 2015

Redesign disk storage and checkpoint/scavenger processes #17

Redesign disk storage and checkpoint/scavenger processes #17

Comments

tatsuya6502 commented Aug 1, 2013

Issues

Disk Storage

Maintenance Processes

tatsuya6502 commented Aug 1, 2013

Draft Design

Disk Storage

Maintenance Processes

tatsuya6502 commented Aug 1, 2013

tatsuya6502 commented Jan 8, 2014

tatsuya6502 commented Jan 13, 2014

tatsuya6502 commented Jan 19, 2014

tatsuya6502 commented May 14, 2014

tatsuya6502 commented Oct 13, 2014

tatsuya6502 commented Oct 16, 2014

tatsuya6502 commented Oct 18, 2014

tatsuya6502 commented Oct 19, 2014

tatsuya6502 commented May 17, 2015

tatsuya6502 commented May 18, 2015

tatsuya6502 commented May 19, 2015

tatsuya6502 commented May 24, 2015

tatsuya6502 commented May 28, 2015

tatsuya6502 commented May 29, 2015

tatsuya6502 commented Jun 9, 2015

tatsuya6502 commented Jul 11, 2015

tatsuya6502 commented Aug 6, 2015

tatsuya6502 commented Nov 25, 2015

tatsuya6502 commented Nov 25, 2015

tatsuya6502 commented Nov 25, 2015