Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compaction tools #35

Open
Tracked by #92
maxux opened this issue May 22, 2018 · 7 comments
Open
Tracked by #92

Compaction tools #35

maxux opened this issue May 22, 2018 · 7 comments
Assignees

Comments

@maxux
Copy link
Collaborator

maxux commented May 22, 2018

Please see comments for update, this comment follow the original idea but it's not how it's implemented now.

Introduction

The database is always append, after a while the database will growing with (expected) lot of garbage and not-used data anymore. Some compaction needs to comes.

The idea behind compaction process is something which runs in an external process, does everything on copies and untouch original database. In a first step, let's do everything offline and replace the existing database with the new one compacted, and restart the 0-db. The next step is providing 0-db a way to reload/rehash new database on-the-fly, but this won't be related to this feature request.

EDIT: reload was implemented in the meantime.

Datafiles are splitted by chunk of 256 MB (this is configurable)
If there are 8 files, at the end there still needs to be 8 files, even if some of them are empty.
The idea is really cleaning these files, no reordering are done.

EDIT: nothing will be empty, no entries will be deleted.

Expected workflow

  • We need to read the whole index to know what's the last state of the index
  • Now, we know for each keys, which one is the last version
  • Rewrite datafile and discarding not used entries anymore
  • Update index file with new offset on datafile if data moved

We need to support ability to not drop anything before a specific timestamp. Each entry have a timestamp, if some kind of snapshot has been made, up to a specific timestamp we need to keep the history to be able to rollback to that state.

Example

Original state

INDEX FILE

Entry 1: Key 'A': offset/length: 1, 5
Entry 2: Key 'B': offset/length: 6, 10
Entry 3: Key 'A': offset/length: 17, 2
Entry 4: Key 'C': offset/length: 19, 4
Entry 5: Key 'B': offset/length: 23, 8
Entry 6: Key 'C': offset/length: 0, 0 [flag deleted enabled]
Entry 7: Key 'D': offset/length: 31, 7

DATA FILE

Offset 1 : Key 'A': [...data...]
Offset 6 : Key 'B': [...data...]
Offset 17: Key 'A': [...data...]
Offset 19: Key 'C': [...data...]
Offset 23: Key 'B': [...data...]
Offset 31: Key 'D': [...data...]

Compacted version

INDEX FILE

Entry 1: Key 'A': offset/length: 1, 2
Entry 2: Key 'B': offset/length: 3, 8
Entry 3: Key 'D': offset/length: 11, 7

DATA FILE

Offset 1: Key 'A': [...data...]
Offset 3: Key 'B': [...data...]
Offset 7: Key 'D': [...data...]
@maxux maxux self-assigned this Jun 4, 2018
@maxux maxux added this to the 0.1 milestone Jun 7, 2018
@maxux
Copy link
Collaborator Author

maxux commented Jun 26, 2018

Additional information

The workflow described is not exactly what can happen, here is a more real one

Preliminary information

We have two kind of dataset to change: data and index. In addition, we need to support all modes (user-family mode, and direct-family basically).

  • Data files are, in any point of time, always append, nothing is modified inside.
  • Index files are immuable too, except in direct mode, when a suppression occurs, the flag is changed in place in the index (EDIT: and history pointers, ...)

In direct mode, the index needs to contains always the exact amount of entries, existing or not. Keys used in direct mode point to the entry id in the index, if this entry is removed, the whole index will be shifted and everything will be broken.

To make this working in all mode without having tricky part (to keep code simple and generic), let says we don't remove completely an entry when compacting, we just truncate it's payload to 0, and drop the id. Using this, we will always have the same amount of entries, without consuming too much space, and index and data can always be sync, in any mode.

Don't forget we need to be able to rebuild an index with only data files.

Real example

Imagine a db with 100,000,000 entries, with 4 KB payload per entries, and each id is 6 bytes. (This is about the reality for direct-mode used for storage).

  • Payload: 390,000 MB
  • Overhead: 3051 MB
  • Total size: 393,051 MB

Overhead is computed: (26 bytes fixed header + 6 bytes for id) * entries

Imagine we remove half of it (50,000,000 keys).

  • Payload: 195,312 MB
  • Overhead: 2765 MB
  • Total size: 198,077 MB

Overhead is computed: (26 bytes fixed header + 6 bytes for id) * entries + (26 byes fixed header still there) * removed entries

By removing half of the keys, we removed 99.2% of the data on the real storage, without loosing the structure of the datafile (always the same amount of entries, index still can be rebuilt in any mode).

Implementation

Idea is implementing the data compaction in a standalone process. When data are compacted, you call an index-rebuild process, and you'll have fresh new db, compacted.

Keeping the index rebuild as an extra step allows to keep this tool for multiple purpose (rebuild an index if index is broken, based on data file, or just rebuild the index after compaction, anyway it's the same process, why complicate the compaction by adding this step).

@despiegk despiegk modified the milestones: 1.0.0, 1.1 Jul 24, 2018
@maxux
Copy link
Collaborator Author

maxux commented Dec 27, 2018

Hot Compaction

In order to provide hot-compaction available, this implementation can be used:

  • Doing offline compaction of all datafiles already full (all except the current one)
  • Reload namespace when compacted

In order to avoid any concurrency, we should implement something new:

  • FREEZE and UNFREEZE commands
  • The freeze would set the namespace in a special state where any action on this specific namespace are refused but with a special message telling the source to try again later (not an error)

With this workflow, we can freeze the namespace, replace files, reload-and-unfreeze namespace without risk of corruption.

cc @zaibon

@maxux
Copy link
Collaborator Author

maxux commented Dec 5, 2019

In addition in this issue, the index-rebuild tool need to be working in order to rebuild index after data-compaction was done.

@maxux
Copy link
Collaborator Author

maxux commented Dec 26, 2019

Here is how I currently implement compaction (in progress):

  • Reading the full dataset forward (beginning to end), entry per entry, and filling an in-memory-only index
  • The index will be used to know which entry is the latest
  • Re-read the full dataset forward, checking if entry is the latest in the in-memory index, if it is, coping it to new dataset, otherwise truncating it's payload. (in addition, some timestamp check can be made to keep history up to a specific time point)

When this process is done, index-rebuild tool can be used on the new dataset to rebuild the index. This tool is now stable.

The current implementation use the same index used on zdb itself (it uses the memory-side of index by using the libzdb) and take advantage of the same code, nothing new is implemented, this take advantage of fixes applied to upstream lib.

Reading the dataset is done the same way it's done on integrity-check tools, which also use the library intensively, but not enough. Some change on the lib is made to make this even more generic.

@maxux maxux added this to the later milestone Dec 14, 2020
@maxux maxux modified the milestones: later, next Dec 29, 2020
@despiegk despiegk modified the milestones: next, later Feb 18, 2021
@dumblob
Copy link

dumblob commented Mar 24, 2022

Any plans to support "incremental" partial compaction? That is to compact the DB only partially to increase the maximum possible utilization of the disk space?

The problem I see is that - borrowing the example above - the whole DB (metadata + data) can have a maximum of 0.5x the size of the underlying storage (or less than that due to underlying filesystem behavior & overhead) to guarantee the compaction be successful.

It's because it's unknown enough (it's known only approximately due to the underlying filesystem behavior & overhead) how much can be actually "compacted away".

If there was partial compaction support, we can dynamically determine how much space the compacted part of the DB will need and change the 0.5 factor to something much bigger (perhaps up to 0.98 or so in practice in case of very frequent partial compactions due to DB being "too big" for the underlying storage).

This could also alleviate the hot reload freeze time as if we reloaded only part of the DB (i.e. only the part which was compacted). Second this could significantly increase median troughput (measured as short-term - few seconds long - measurements at random points in time) of the underlying storage. As it would not need to perform the end-user writes at 50% speed because of the compaction process (writing large amounts of data for long times) taking the other 50% of storage speed.

When implementing this, care must be taken to which key-value pairs are being compacted as blindly selecting them would harm e.g. the common use case of "adding terabytes of data slowly for months and then deleting half of them in one second" (which'd produce lots of "delete" entries at the recent end of the DB and if partial compaction chose only the oldest part of the DB to be compacted, then there would be hardly any "delete" entries and thus almost nothing would get compacted in the end). Therefore more sophisticated approach will be needed.

One more thing - the hot reload could be made much faster if the compacted DB would not need to be read again to build the index. Perhaps make the DB to have the index in a shared memory and make the compaction process also use shared memory with identical structure but allocated separately and in the end just pass the compacted one in a nanosecond to the DB process?

This "shared memory" optimization will presumably be a bit more memory hungry (because the compaction process probably uses a slightly more efficient index in memory) but I'm 100% sure the difference can not be that huge (perhaps 5-20%, IDK 😉) and the gain would be IMHO tremendous.

Btw. you can also consider running two threads instead of processes to make it all even more efficient. But maybe you already do it, so take this with a grain of salt 😉.

@maxux
Copy link
Collaborator Author

maxux commented Mar 24, 2022

Compaction process were always intended to be run offline (by an external process) and rewrite database files to new files. New files don't needs to be on the same storage medium. Running this process offline doesn't need to lock the database (you can read the database from the beginning to the end without interfering the running zdb server), which is quite important for a large database, even more if large part of the database needs to be rewrite. This allow the process to be run on another machine and don't affect running production machine memory usage as well.

The reload time only affect user-key mode since sequential-key mode doesn't have any in memory keys. Keep in-sync a compaction array and the live array to know which key are in use or not won't take any advantage IMO except introduce new bugs. Having a second in memory index to hot-swap won't be a solution if the index in memory if eg ~200G and you don't have 200G free memory.

Btw, zdb is single threaded by design :)

@dumblob
Copy link

dumblob commented May 12, 2022

Oh, I see - there are multiple obstackles to my proposal. So my understanding is unless the design of zdb (e.g. being single threaded) will not change any time soon, there is less need to try what I proposed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants