Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign disk storage and checkpoint/scavenger processes #17

Open
tatsuya6502 opened this issue Aug 1, 2013 · 22 comments
Open

Redesign disk storage and checkpoint/scavenger processes #17

tatsuya6502 opened this issue Aug 1, 2013 · 22 comments
Assignees
Milestone

Comments

@tatsuya6502
Copy link
Member

Redesign and re-implement disk storage and maintenance processes to address the issues Hibari is having right now (v0.3RC1).

Issues

  • Rename operation broke the scavenger process (hibari#33)
  • The scavenger process has major performance impact to read/write operations
  • metadata for all key-values has to be kept in memory

Disk Storage

  • common-log files (write-ahead-log)
  • local log files (brick private metadata store)
  • long-term log files (value store)
  • live-hunk location files (scavenger work files)

Maintenance Processes

  • common-log write-back process
  • checkpoint process
  • scavenger
@ghost ghost assigned tatsuya6502 Aug 1, 2013
@tatsuya6502
Copy link
Member Author

Draft Design

Disk Storage

  • common-log files (write-ahead-log)
  • metadata DB (HanoiDB or LevelDB)
  • long-term log files (value store)

I'll replace the local log files and live-hunk location files with single metadata DB. I'll use an embedded DB such as HanoiDB or LevelDB for it so that it will help me to unload some portion of the metadata from the RAM. I'll continue to use long-term log files to store the value part of key-values. I believe these embedded DBs are not good at handling large binary values so I want to keep it as is.

Maintenance Processes

  • common-log write-back and continuous compactor process

The scavenger will be merged to the write-back process, and the checkpoint process will no longer exists. Both write-back and scavenger (aka compactor) will be sequentially executed by single process to avoid race conditions between them (like the one causing hibari#33).

@tatsuya6502
Copy link
Member Author

Note that the compactor will no longer update the (in-memory) ETS table for the metadata after it moves some live-hunks to a long-term log file. Instead, it will only update the (on-disk) metadata DB. This will reduce the performance impact that the current compactor has.

When a get request fails to locate the value because the value has been moved by the compactor and the ETS table has the stale location info, the brick server will read the updated location from the matadata DB, refreshes the ETS entry and finally read the value and returns it to the client.

@tatsuya6502
Copy link
Member Author

Issues

  1. Rename operation broke the scavenger process (hibari#33)
  2. The scavenger process has major performance impact to read/write operations
  3. metadata for all key-values has to be kept in memory

As for the upcoming release (v0.3.0), only issue # 1 above will be addressed. However, a major rework on storage is being done for v0.3.0, and this will help future releases to address other ones above. Also, v0.3.0 will no longer have checkpoint operation, and scavenger steps are reorganized for efficiency.

Disk Storage for v0.3.0

  • common-log files (write ahead log)
    • Internal format change will be done for more efficient and controlled write-back process.
  • brick private, metadata DB (LevelDB)
    • This DB replaces brick private hlog files.
    • This DB eliminates checkpoint process and shadow ETS table.
    • In a future version, this DB will help to implement on-disk metadata. (metadata tiering)
    • In a future version, this DB can be used key-value versioning. (The key for metadata is {key(), reversed_timestamp()} encoded by sext. v0.3.0 will use 0 for reversed timestamp, which is older than actual time.) Key-value versioning will work great for wide-area replication with causal+ consistency level [*1].
  • brick private, value blob store (long-term hlog files)
    • Changing long-term logs from shared to brick private will help scavenger to become more efficient because it only has to check the usage of value hunks within a brick.
    • This will also make it easier to relocate a brick from one server to another without service interruption. A future version may automate this process as an alternative of chain migration.

*1: Paper: Don’t Settle For Eventual: Scalable Causal Consistency For Wide-Area Storage With COPS

@tatsuya6502
Copy link
Member Author

Status as of January 13th, 2014:

Almost finished the metadata DB part. Once finished, I will work on brick private value blob store. Diff between dev HEAD and the topic branch HEAD: 6eec707...5901f65

  • common-log files (write ahead log)
    • Change internal format for more efficient and controlled write-back process. -- TODO
  • brick private, metadata DB (LevelDB)
    • Add metadata DB and write-back process for it. -- DONE
    • Remove brick private hlog files. -- DONE
    • Remove checkpoint process and shadow ETS table. -- DONE except updating test cases.
    • Update brick_ets to load metadata records (store tuples) from metadata DB. -- DONE
  • brick private, value blob store (long-term hlog files)
    • Change long-term logs from shared to brick private. -- TODO
    • Note: scavenger relating codes have been moved to a new module brick_hlog_scavenger

@tatsuya6502
Copy link
Member Author

Started to work on the following items:

  • common-log files (write ahead log)
    • Change internal format for more efficient and controlled write-back process.
  • brick private, value blob store (long-term hlog files)
    • Change long-term logs from shared to brick private.

Added various modules in this commit b5fba54a03 to implement above items.

gdss-brick >> gh17 Redesign disk storage:

  • Introduce new hunk log format (brick_hlog_hunk module):
    • Unlike the generic gmt_hlog format, the new format is specialized
      for Hibari usage. It has four hunk types.
      • metadata (for WAL files, many metadata blobs in one hunk)
      • blob_wal (for WAL files, one value blob in one hunk)
      • blob_single (for brick private files, one value blob in one hunk)
      • blob_multi (for brick private files, many value blobs in one hunk)
    • To make WAL (Write Ahead Log) write-back process simpler, these
      WAL hunks have a dedicated 'brick_name' field.
    • Add API to read a value blob from a blob_* hunk without parsing
      the hunk header.
  • Introduce new WAL and store file modules that will replace the old
    gmt_hlog_* modules. (in progress)
    • brick_hlog_wal -
      provides access to the WAL files including group committing.
    • brick_hlog_writeback -
      does write back from a WAL file to the metadata and blob stores.
    • brick_hlog_scavenger -
      (was introduced in an earlier commit) reclaims unused space from
      hunk log based store files.
    • brick_metadata_store - is the common interface of metadata store.
    • brick_blob_store - is the common interface of value blob store.
    • brick_metadata_store_leveldb -
      is a LevelDB implementation of the metadata store.
    • brick_blob_store_hlog -
      is an hunk log implementation of the value blob store.

DETAILS:
...

@tatsuya6502
Copy link
Member Author

  • brick private, metadata DB (LevelDB)
    • Add metadata DB and write-back process for it. -- DONE
    • Remove brick private hlog files. -- DONE
    • Remove checkpoint process and shadow ETS table. -- DONE except updating test cases.
    • Update brick_ets to load metadata records (store tuples) from metadata DB. -- DONE

I replaced LevelDB with HyperLevelDB which is a fork and drop-in replacement of LevelDB with two key improvements:

  • Improved parallelism: HyperLevelDB uses more fine-grained locking internally to provide higher throughput for multiple writer threads.
  • Improved compaction: HyperLevelDB uses a different method of compaction that achieves higher throughput for write-heavy workloads, even as the database grows.

Hibari will not get much benefit from first point because it uses single writer (WAL write-back process) per brick metadata DB, but will get some benefit from second point. I loaded 1 million key-values to a Hibari table with 12 bricks, chain length 1, and so far, so good.

@tatsuya6502
Copy link
Member Author

I open-sourced the Erlang binding to HyperLevelDB. I'll update brick server's code to utilize it.
https://github.com/leveldb-erlang/h2leveldb

@tatsuya6502
Copy link
Member Author

#17 (comment)

I open-sourced the Erlang binding to HyperLevelDB. I'll update brick server's code to utilize it.
https://github.com/leveldb-erlang/h2leveldb

Done.

@tatsuya6502
Copy link
Member Author

#17 (comment)

  • common-log files (write ahead log)
    • Change internal format for more efficient and controlled write-back process.
  • brick private, value blob store (long-term hlog files)
    • Change long-term logs from shared to brick private.

Finished implementing new hlog format with the following key enhancements:

  • Reading a value blob no longer requires to deserialize Erlang external term format. It's now just single file:pread/3 call. (Eliminated ~50 lines of Erlang codes for reading.)
  • MD5 checksum can be turned on/off per single key-value, instead of per brick.
  • To save disk space, the hunk for value blob now supports packing multiple values in one hunk. (Only one MD5 checksum will be added against the entire packed values in the hunk). It's still read efficient because each value in the hunk can still be read by single file:pread/3 call, no unpacking is needed. Note that I'll make no-MD5 checking will be performed on each read as the default setting. Instead, scavenger and background scrub process will do it when scanning entire hlog file.
  • Instead of having only one hunk format, introduced four different hunk formats for optimal read performance and space efficiency. These formats are: metadata, blob_wal, blob_single and blob_multi. metadata and blob_wal are used in WAL. blob_single and blob_multi are used in brick private blob storage, and the former is for a larger blob (e.g. > 4KB) and the latter is for smaller blobs.
  • The header part of the hunk is now fixed length and the hunk size can be calculated from the header. This will makes it easier for scanning through an hlog file.
  • The header and footer parts has magic numbers, and footer contains an index of packed value blobs. They will make it easier to recover broken hlog file.

In January, I implemented a gen-server for writing WAL from scratch. Next step will be to implement write-back process from WAL to metadata DB (HyperLevelDB) and value blob hlog file.

After that, I will update brick_ets server to utilize the new hlog format for writing and reading. Then finally, I will implement scavenger (aka compaction) process from scratch.

@tatsuya6502
Copy link
Member Author

#17 (comment)

Next step will be to implement write-back process from WAL to metadata DB (HyperLevelDB) and value blob hlog file.

After that, I will update brick_ets server to utilize the new hlog format for writing and reading.

I started to work on the above. I actually started from the latter one and Hibari can now bootstrap from the new hlog modules. (but it's not very useful without the write-back process and scavenger.)

Also, before I start to commit these work-in-progress changes, I created a git annotate tag 2014-10-metadatadb on the current branch head of gdss_brick and gdss_admin projects. That revision was tested with basho bench and had no error in six-hour runs.

@tatsuya6502
Copy link
Member Author

Merged recent changes on the dev branch (post v0.1.11) into gbrick-gh17-redesign-disk-storage branch.

@tatsuya6502
Copy link
Member Author

#17 (comment)

Next step will be to implement write-back process from WAL to metadata DB (HyperLevelDB) and value blob hlog file.

After that, I will update brick_ets server to utilize the new hlog format for writing and reading.

I started to work on the above. I actually started from the latter one and Hibari can now bootstrap from the new hlog modules. (but it's not very useful without the write-back process and scavenger.)

After a long pause (Oct 2014 -- May 2015), I resumed working on this topic (to implement the write-back process). I made a couple of commits on the topic branch of gdss_brick and so far I confirmed that all WAL hunks written to a WAL file can be parsed back to hunk records.

I'm trying to complete this task by the end of this month (May 2015).

@tatsuya6502
Copy link
Member Author

#17 (comment)

Status as of January 13th, 2014:

  • common-log files (write ahead log)
    • Change internal format for more efficient and controlled write-back process. -- TODO
  • brick private, metadata DB (LevelDB)
    • Add metadata DB and write-back process for it. -- DONE
    • Remove brick private hlog files. -- DONE
    • Remove checkpoint process and shadow ETS table. -- DONE except updating test cases.
    • Update brick_ets to load metadata records (store tuples) from metadata DB. -- DONE
  • brick private, value blob store (long-term hlog files)
    • Change long-term logs from shared to brick private. -- TODO
    • Note: scavenger relating codes have been moved to a new module brick_hlog_scavenger

OK. I'm basically done with the above two TODO items. Now key-value's metadata (key, timestamp, user-provided property list and expiration-time) is stored in brick private metadata DB (HyperLevelDB), and value is stored in brick private hlog file.

% find data/ -type file | sort
data/brick/bootstrap_copy1/blob/000000000001.BLOB
data/brick/bootstrap_copy1/metadata/leveldb/000003.sst
data/brick/bootstrap_copy1/metadata/leveldb/000006.sst
data/brick/bootstrap_copy1/metadata/leveldb/000008.log
data/brick/bootstrap_copy1/metadata/leveldb/CURRENT
data/brick/bootstrap_copy1/metadata/leveldb/LOCK
data/brick/bootstrap_copy1/metadata/leveldb/LOG
data/brick/bootstrap_copy1/metadata/leveldb/LOG.old
data/brick/bootstrap_copy1/metadata/leveldb/MANIFEST-000007
data/brick/bootstrap_copy1/metadata/leveldb/lost/...
data/brick/perf1_ch10_b1/blob/000000000001.BLOB
data/brick/perf1_ch10_b1/metadata/leveldb/000003.sst
data/brick/perf1_ch10_b1/metadata/leveldb/000006.sst
data/brick/perf1_ch10_b1/metadata/leveldb/000008.log
data/brick/perf1_ch10_b1/metadata/leveldb/CURRENT
data/brick/perf1_ch10_b1/metadata/leveldb/LOCK
data/brick/perf1_ch10_b1/metadata/leveldb/LOG
data/brick/perf1_ch10_b1/metadata/leveldb/LOG.old
data/brick/perf1_ch10_b1/metadata/leveldb/MANIFEST-000007
data/brick/perf1_ch10_b1/metadata/leveldb/lost/...
data/brick/perf1_ch10_b2/blob/000000000001.BLOB
data/brick/perf1_ch10_b2/metadata/leveldb/000003.sst
data/brick/perf1_ch10_b2/metadata/leveldb/000006.sst
data/brick/perf1_ch10_b2/metadata/leveldb/000008.log
data/brick/perf1_ch10_b2/metadata/leveldb/CURRENT
data/brick/perf1_ch10_b2/metadata/leveldb/LOCK
data/brick/perf1_ch10_b2/metadata/leveldb/LOG
data/brick/perf1_ch10_b2/metadata/leveldb/LOG.old
data/brick/perf1_ch10_b2/metadata/leveldb/MANIFEST-000007
data/brick/perf1_ch10_b2/metadata/leveldb/lost/...
...
data/wal_hlog/000000000002.HLOG
data/wal_hlog/000000000003.HLOG

Now the last big part will be re-implementing the scavenger (aka compaction process) from scratch. Hope I can finish it in two weeks.

@tatsuya6502
Copy link
Member Author

I spent last few days to the followings:

  1. Brushing up the write-back process. It was done at commit: ad68753
  2. Re-implementing the compaction process (aka scavenger process) from scratch.

The new compaction process should be much more efficient than the current scavenger implementation in v0.1 series. Here is the current design:

  • The write-back process now maintains look-up tables from blob hunk to Hibari key and timestamp. Look-up table (disk_log in OTP) is created for each blob hlog file and used by the compaction process to find live hunks. This eliminates the scavenger work directory because no more on-disk sorting is required.
  • The compaction server periodically estimates live/dead blob hunks ratio by randomly sampling blob hunks in the look-up tables and comparing them to the metadata DB. If a blob hlog file has a higher ratio than the threshold, the compaction server triggers a compaction process on that hlog file. There is no more daily major compaction, which had a big impact to the performance.
  • After coping live blob hunks to a new blob hlog file, the compaction process updates the storage locations in the metadata DB without interfering to brick server's read/write path. It is done asynchronously by the write-back process. This will greatly improve read/write key-value latencies while a compaction process is running.

Also, I'm planning to store small values in the metadata DB (HyperLevelDB) rather than the blob hlog files. HyperLevelDB has efficient compaction implementation in C++ so I hope this design change will improve the overall compaction efficiency too.

@tatsuya6502
Copy link
Member Author

As for the compaction process, I have implemented the following main functions:

-spec estimate_live_hunk_ratio(brickname(), seqnum()) -> {ok, float()} | {error, term()}.
-spec compact_hlog_file(brickname(), seqnum()) -> ok | {error, term()}.

The former estimates live/dead blob hunks ratio per hlof file by comparing randomly sampled keys against the metadata DB. The latter runs compaction on an hlog file to reclaim disk space and updates storage locations of live hunks.

Next step will be to implement periodical task to estimate the live hunk ratio, score hlog files, and pick an hlog file to run a compaction.

@tatsuya6502
Copy link
Member Author

Next step will be to implement periodical task to estimate the live hunk ratio, score hlog files, and pick an hlog file to run a compaction.

As the first step, I created a temporary function to estimate live hunk ratios of all value blob hlog files on a node (commit: 53a697e#diff-8bd48a5a77dd1a285b7729b3766317e0R110).

Here is a sample run on a single-node Hibari with perf1 table (chain length = 3, number of chains = 8):

(hibari@127.0.0.1)43> F = fun() -> lists:foreach(
(hibari@127.0.0.1)43>                fun({B, S, unknown}) ->
(hibari@127.0.0.1)43>                        io:format("~s (~w): unknown~n", [B, S]);
(hibari@127.0.0.1)43>                   ({B, S, R}) ->
(hibari@127.0.0.1)43>                        io:format("~s (~w): ~.2f%~n", [B, S, R * 100])
(hibari@127.0.0.1)43>                end, brick_blob_store_hlog_compaction:list_hlog_files())
(hibari@127.0.0.1)43>     end.
#Fun<erl_eval.20.90072148>
(hibari@127.0.0.1)44> F().
bootstrap_copy1 (1): 21.43%
perf1_ch1_b1 (1): 4.91%
perf1_ch1_b1 (2): 26.89%
perf1_ch1_b1 (3): 65.67%
perf1_ch1_b2 (1): 3.35%
...

Here are the numbers for perf1 chain 1 ordered by chain, brick and hlog sequence number:

perf1_ch1_b1 (1): 4.91%
perf1_ch1_b2 (1): 3.35%
perf1_ch1_b3 (1): 5.84%

perf1_ch1_b1 (2): 26.89%
perf1_ch1_b2 (2): 25.12%
perf1_ch1_b3 (2): 25.17%

perf1_ch1_b1 (3): 65.82%
perf1_ch1_b2 (3): 82.28%
perf1_ch1_b3 (3): 72.60%

All bricks (b1, b2, b3) in a chain should have the exact same contents for each hlog files with sequence numbers (1, 2, 3). However the estimated live hunk ratios are different (for example, 4.91%, 3.35%, 5.84%, or 65.82%, 82.28%, 72.60%). This is because the estimation is done by randomly sampled keys. It's currently using about 5% of keys in an hlog file for the estimation.

I think current setting is still providing estimated ratios in good enough precision.

@tatsuya6502
Copy link
Member Author

Next step will be to implement periodical task to estimate the live hunk ratio, score hlog files, and pick an hlog file to run a compaction.

I have implemented a very basic version of above. (The last commit was: de7c990) There are lots of places to improve but now the new disk storage format has a complete set of functions.

I'll shift my focus to other tasks but will continue improving this feature too. I'm planning to release Hibari v0.3 with this feature sometime in this fall (2015).

@tatsuya6502
Copy link
Member Author

I found and fixed a bug in the write-back process that would miss some log hunks in the WAL when group commit is enabled:

Commit: 32428e4

@tatsuya6502
Copy link
Member Author

#17 (comment)

I open-sourced the Erlang binding to HyperLevelDB. I'll update brick server's code to utilize it.
https://github.com/leveldb-erlang/h2leveldb

Done.

It seems HyperLevelDB is not actively developed anymore; the last commit was made in Sep 2014. I'm thinking to switch to RocksDB that is another fork of LevelDB. It is very actively developed and has a large user base.

@tatsuya6502
Copy link
Member Author

I found and fixed a couple of bugs related to node restart (91b62e0...9f9efe9).

I also removed obsolete gmt_hlog* and scavenger modules from v0.3 stream (44fd692), and updated the app.src of gdss_brick application (e0e71ea).

1 similar comment
@tatsuya6502
Copy link
Member Author

I found and fixed a couple of bugs related to node restart (91b62e0...9f9efe9).

I also removed obsolete gmt_hlog* and scavenger modules from v0.3 stream (44fd692), and updated the app.src of gdss_brick application (e0e71ea).

@tatsuya6502
Copy link
Member Author

I found and fixed a couple of bugs related to node restart (91b62e0...9f9efe9).

I also removed obsolete gmt_hlog* and scavenger modules from v0.3 stream (44fd692), and updated the app.src of gdss_brick application (e0e71ea).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant