Compaction tools #35

maxux · 2018-05-22T08:15:07Z

Please see comments for update, this comment follow the original idea but it's not how it's implemented now.

Introduction

The database is always append, after a while the database will growing with (expected) lot of garbage and not-used data anymore. Some compaction needs to comes.

The idea behind compaction process is something which runs in an external process, does everything on copies and untouch original database. In a first step, let's do everything offline and replace the existing database with the new one compacted, and restart the 0-db. The next step is providing 0-db a way to reload/rehash new database on-the-fly, but this won't be related to this feature request.

EDIT: reload was implemented in the meantime.

Datafiles are splitted by chunk of 256 MB (this is configurable)
If there are 8 files, at the end there still needs to be 8 files, even if some of them are empty.
The idea is really cleaning these files, no reordering are done.

EDIT: nothing will be empty, no entries will be deleted.

Expected workflow

We need to read the whole index to know what's the last state of the index
Now, we know for each keys, which one is the last version
Rewrite datafile and discarding not used entries anymore
Update index file with new offset on datafile if data moved

We need to support ability to not drop anything before a specific timestamp. Each entry have a timestamp, if some kind of snapshot has been made, up to a specific timestamp we need to keep the history to be able to rollback to that state.

Example

Original state

INDEX FILE

Entry 1: Key 'A': offset/length: 1, 5
Entry 2: Key 'B': offset/length: 6, 10
Entry 3: Key 'A': offset/length: 17, 2
Entry 4: Key 'C': offset/length: 19, 4
Entry 5: Key 'B': offset/length: 23, 8
Entry 6: Key 'C': offset/length: 0, 0 [flag deleted enabled]
Entry 7: Key 'D': offset/length: 31, 7

DATA FILE

Offset 1 : Key 'A': [...data...]
Offset 6 : Key 'B': [...data...]
Offset 17: Key 'A': [...data...]
Offset 19: Key 'C': [...data...]
Offset 23: Key 'B': [...data...]
Offset 31: Key 'D': [...data...]

Compacted version

INDEX FILE

Entry 1: Key 'A': offset/length: 1, 2
Entry 2: Key 'B': offset/length: 3, 8
Entry 3: Key 'D': offset/length: 11, 7

DATA FILE

Offset 1: Key 'A': [...data...]
Offset 3: Key 'B': [...data...]
Offset 7: Key 'D': [...data...]

The text was updated successfully, but these errors were encountered:

maxux · 2018-06-26T13:02:58Z

Additional information

The workflow described is not exactly what can happen, here is a more real one

Preliminary information

We have two kind of dataset to change: data and index. In addition, we need to support all modes (user-family mode, and direct-family basically).

Data files are, in any point of time, always append, nothing is modified inside.
Index files are immuable too, except in direct mode, when a suppression occurs, the flag is changed in place in the index (EDIT: and history pointers, ...)

In direct mode, the index needs to contains always the exact amount of entries, existing or not. Keys used in direct mode point to the entry id in the index, if this entry is removed, the whole index will be shifted and everything will be broken.

To make this working in all mode without having tricky part (to keep code simple and generic), let says we don't remove completely an entry when compacting, we just truncate it's payload to 0, and drop the id. Using this, we will always have the same amount of entries, without consuming too much space, and index and data can always be sync, in any mode.

Don't forget we need to be able to rebuild an index with only data files.

Real example

Imagine a db with 100,000,000 entries, with 4 KB payload per entries, and each id is 6 bytes. (This is about the reality for direct-mode used for storage).

Payload: 390,000 MB
Overhead: 3051 MB
Total size: 393,051 MB

Overhead is computed: (26 bytes fixed header + 6 bytes for id) * entries

Imagine we remove half of it (50,000,000 keys).

Payload: 195,312 MB
Overhead: 2765 MB
Total size: 198,077 MB

Overhead is computed: (26 bytes fixed header + 6 bytes for id) * entries + (26 byes fixed header still there) * removed entries

By removing half of the keys, we removed 99.2% of the data on the real storage, without loosing the structure of the datafile (always the same amount of entries, index still can be rebuilt in any mode).

Implementation

Idea is implementing the data compaction in a standalone process. When data are compacted, you call an index-rebuild process, and you'll have fresh new db, compacted.

Keeping the index rebuild as an extra step allows to keep this tool for multiple purpose (rebuild an index if index is broken, based on data file, or just rebuild the index after compaction, anyway it's the same process, why complicate the compaction by adding this step).

maxux · 2018-12-27T12:46:43Z

Hot Compaction

In order to provide hot-compaction available, this implementation can be used:

Doing offline compaction of all datafiles already full (all except the current one)
Reload namespace when compacted

In order to avoid any concurrency, we should implement something new:

FREEZE and UNFREEZE commands
The freeze would set the namespace in a special state where any action on this specific namespace are refused but with a special message telling the source to try again later (not an error)

With this workflow, we can freeze the namespace, replace files, reload-and-unfreeze namespace without risk of corruption.

cc @zaibon

maxux · 2019-12-05T04:03:40Z

In addition in this issue, the index-rebuild tool need to be working in order to rebuild index after data-compaction was done.

maxux · 2019-12-26T17:22:55Z

Here is how I currently implement compaction (in progress):

Reading the full dataset forward (beginning to end), entry per entry, and filling an in-memory-only index
The index will be used to know which entry is the latest
Re-read the full dataset forward, checking if entry is the latest in the in-memory index, if it is, coping it to new dataset, otherwise truncating it's payload. (in addition, some timestamp check can be made to keep history up to a specific time point)

When this process is done, index-rebuild tool can be used on the new dataset to rebuild the index. This tool is now stable.

The current implementation use the same index used on zdb itself (it uses the memory-side of index by using the libzdb) and take advantage of the same code, nothing new is implemented, this take advantage of fixes applied to upstream lib.

Reading the dataset is done the same way it's done on integrity-check tools, which also use the library intensively, but not enough. Some change on the lib is made to make this even more generic.

dumblob · 2022-03-24T10:43:37Z

Any plans to support "incremental" partial compaction? That is to compact the DB only partially to increase the maximum possible utilization of the disk space?

The problem I see is that - borrowing the example above - the whole DB (metadata + data) can have a maximum of 0.5x the size of the underlying storage (or less than that due to underlying filesystem behavior & overhead) to guarantee the compaction be successful.

It's because it's unknown enough (it's known only approximately due to the underlying filesystem behavior & overhead) how much can be actually "compacted away".

If there was partial compaction support, we can dynamically determine how much space the compacted part of the DB will need and change the 0.5 factor to something much bigger (perhaps up to 0.98 or so in practice in case of very frequent partial compactions due to DB being "too big" for the underlying storage).

This could also alleviate the hot reload freeze time as if we reloaded only part of the DB (i.e. only the part which was compacted). Second this could significantly increase median troughput (measured as short-term - few seconds long - measurements at random points in time) of the underlying storage. As it would not need to perform the end-user writes at 50% speed because of the compaction process (writing large amounts of data for long times) taking the other 50% of storage speed.

When implementing this, care must be taken to which key-value pairs are being compacted as blindly selecting them would harm e.g. the common use case of "adding terabytes of data slowly for months and then deleting half of them in one second" (which'd produce lots of "delete" entries at the recent end of the DB and if partial compaction chose only the oldest part of the DB to be compacted, then there would be hardly any "delete" entries and thus almost nothing would get compacted in the end). Therefore more sophisticated approach will be needed.

One more thing - the hot reload could be made much faster if the compacted DB would not need to be read again to build the index. Perhaps make the DB to have the index in a shared memory and make the compaction process also use shared memory with identical structure but allocated separately and in the end just pass the compacted one in a nanosecond to the DB process?

This "shared memory" optimization will presumably be a bit more memory hungry (because the compaction process probably uses a slightly more efficient index in memory) but I'm 100% sure the difference can not be that huge (perhaps 5-20%, IDK 😉) and the gain would be IMHO tremendous.

Btw. you can also consider running two threads instead of processes to make it all even more efficient. But maybe you already do it, so take this with a grain of salt 😉.

maxux · 2022-03-24T11:06:42Z

Compaction process were always intended to be run offline (by an external process) and rewrite database files to new files. New files don't needs to be on the same storage medium. Running this process offline doesn't need to lock the database (you can read the database from the beginning to the end without interfering the running zdb server), which is quite important for a large database, even more if large part of the database needs to be rewrite. This allow the process to be run on another machine and don't affect running production machine memory usage as well.

The reload time only affect user-key mode since sequential-key mode doesn't have any in memory keys. Keep in-sync a compaction array and the live array to know which key are in use or not won't take any advantage IMO except introduce new bugs. Having a second in memory index to hot-swap won't be a solution if the index in memory if eg ~200G and you don't have 200G free memory.

Btw, zdb is single threaded by design :)

dumblob · 2022-05-12T07:28:02Z

Oh, I see - there are multiple obstackles to my proposal. So my understanding is unless the design of zdb (e.g. being single threaded) will not change any time soon, there is less need to try what I proposed.

maxux added the type_feature label May 22, 2018

maxux mentioned this issue May 24, 2018

Reload database (index, namespace) on runtime #38

Closed

maxux self-assigned this Jun 4, 2018

maxux added this to the 0.1 milestone Jun 7, 2018

maxux added the state_inprogress label Jun 20, 2018

despiegk modified the milestones: 1.0.0, 1.1 Jul 24, 2018

rkhamis removed state_inprogress labels Jul 30, 2018

zaibon mentioned this issue Feb 26, 2019

Grid launch phase 1 threefoldtech/home#101

Closed

36 tasks

maxux mentioned this issue Apr 16, 2019

Stabilize 0-db threefoldtech/home#170

Closed

This was referenced Apr 23, 2019

Stabilize 0-db threefoldtecharchive/digitalmeX#192

Closed

Stabilize 0-db threefoldtecharchive/jumpscaleX_archived#368

Closed

zaibon removed this from the 0.9.3 milestone Nov 28, 2019

zaibon added the type_feature label Dec 2, 2019

maxux added this to the later milestone Dec 14, 2020

maxux mentioned this issue Dec 23, 2020

0-DB: Finalize tools for database inspection and maintenance #92

Open

3 tasks

maxux modified the milestones: later, next Dec 29, 2020

despiegk modified the milestones: next, later Feb 18, 2021

maxux mentioned this issue May 30, 2022

ZDB append only - any pruning/cleanup/garbage collection? #133

Open

maxux removed this from the later milestone Jul 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compaction tools #35

Compaction tools #35

maxux commented May 22, 2018 •

edited

Loading

maxux commented Jun 26, 2018 •

edited

Loading

maxux commented Dec 27, 2018

maxux commented Dec 5, 2019

maxux commented Dec 26, 2019

dumblob commented Mar 24, 2022

maxux commented Mar 24, 2022

dumblob commented May 12, 2022

Compaction tools #35

Compaction tools #35

Comments

maxux commented May 22, 2018 • edited Loading

Introduction

Expected workflow

Example

Original state

Compacted version

maxux commented Jun 26, 2018 • edited Loading

Additional information

Preliminary information

Real example

Implementation

maxux commented Dec 27, 2018

Hot Compaction

maxux commented Dec 5, 2019

maxux commented Dec 26, 2019

dumblob commented Mar 24, 2022

maxux commented Mar 24, 2022

dumblob commented May 12, 2022

maxux commented May 22, 2018 •

edited

Loading

maxux commented Jun 26, 2018 •

edited

Loading