-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compaction tools #35
Comments
Additional informationThe workflow described is not exactly what can happen, here is a more real one Preliminary informationWe have two kind of dataset to change:
In direct mode, the index needs to contains always the exact amount of entries, existing or not. Keys used in direct mode point to the entry id in the index, if this entry is removed, the whole index will be shifted and everything will be broken. To make this working in all mode without having tricky part (to keep code simple and generic), let says we don't remove completely an entry when compacting, we just truncate it's payload to 0, and drop the id. Using this, we will always have the same amount of entries, without consuming too much space, and index and data can always be sync, in any mode. Don't forget we need to be able to rebuild an index with only data files. Real exampleImagine a db with 100,000,000 entries, with 4 KB payload per entries, and each id is 6 bytes. (This is about the reality for
Overhead is computed: Imagine we remove half of it (50,000,000 keys).
Overhead is computed: By removing half of the keys, we removed 99.2% of the data on the real storage, without loosing the structure of the datafile (always the same amount of entries, index still can be rebuilt in any mode). ImplementationIdea is implementing the data compaction in a standalone process. When data are compacted, you call an index-rebuild process, and you'll have fresh new db, compacted. Keeping the index rebuild as an extra step allows to keep this tool for multiple purpose (rebuild an index if index is broken, based on data file, or just rebuild the index after compaction, anyway it's the same process, why complicate the compaction by adding this step). |
Hot CompactionIn order to provide hot-compaction available, this implementation can be used:
In order to avoid any concurrency, we should implement something new:
With this workflow, we can freeze the namespace, replace files, reload-and-unfreeze namespace without risk of corruption. cc @zaibon |
In addition in this issue, the |
Here is how I currently implement compaction (in progress):
When this process is done, The current implementation use the same index used on zdb itself (it uses the memory-side of index by using the libzdb) and take advantage of the same code, nothing new is implemented, this take advantage of fixes applied to upstream lib. Reading the dataset is done the same way it's done on |
Any plans to support "incremental" partial compaction? That is to compact the DB only partially to increase the maximum possible utilization of the disk space? The problem I see is that - borrowing the example above - the whole DB (metadata + data) can have a maximum of 0.5x the size of the underlying storage (or less than that due to underlying filesystem behavior & overhead) to guarantee the compaction be successful. It's because it's unknown enough (it's known only approximately due to the underlying filesystem behavior & overhead) how much can be actually "compacted away". If there was partial compaction support, we can dynamically determine how much space the compacted part of the DB will need and change the 0.5 factor to something much bigger (perhaps up to 0.98 or so in practice in case of very frequent partial compactions due to DB being "too big" for the underlying storage). This could also alleviate the hot reload freeze time as if we reloaded only part of the DB (i.e. only the part which was compacted). Second this could significantly increase median troughput (measured as short-term - few seconds long - measurements at random points in time) of the underlying storage. As it would not need to perform the end-user writes at 50% speed because of the compaction process (writing large amounts of data for long times) taking the other 50% of storage speed. When implementing this, care must be taken to which key-value pairs are being compacted as blindly selecting them would harm e.g. the common use case of "adding terabytes of data slowly for months and then deleting half of them in one second" (which'd produce lots of "delete" entries at the recent end of the DB and if partial compaction chose only the oldest part of the DB to be compacted, then there would be hardly any "delete" entries and thus almost nothing would get compacted in the end). Therefore more sophisticated approach will be needed. One more thing - the hot reload could be made much faster if the compacted DB would not need to be read again to build the index. Perhaps make the DB to have the index in a shared memory and make the compaction process also use shared memory with identical structure but allocated separately and in the end just pass the compacted one in a nanosecond to the DB process? This "shared memory" optimization will presumably be a bit more memory hungry (because the compaction process probably uses a slightly more efficient index in memory) but I'm 100% sure the difference can not be that huge (perhaps 5-20%, IDK 😉) and the gain would be IMHO tremendous. Btw. you can also consider running two threads instead of processes to make it all even more efficient. But maybe you already do it, so take this with a grain of salt 😉. |
Compaction process were always intended to be run offline (by an external process) and rewrite database files to new files. New files don't needs to be on the same storage medium. Running this process offline doesn't need to lock the database (you can read the database from the beginning to the end without interfering the running zdb server), which is quite important for a large database, even more if large part of the database needs to be rewrite. This allow the process to be run on another machine and don't affect running production machine memory usage as well. The reload time only affect user-key mode since sequential-key mode doesn't have any in memory keys. Keep in-sync a compaction array and the live array to know which key are in use or not won't take any advantage IMO except introduce new bugs. Having a second in memory index to hot-swap won't be a solution if the index in memory if eg ~200G and you don't have 200G free memory. Btw, |
Oh, I see - there are multiple obstackles to my proposal. So my understanding is unless the design of |
Introduction
The database is always append, after a while the database will growing with (expected) lot of garbage and not-used data anymore. Some compaction needs to comes.
The idea behind compaction process is something which runs in an external process, does everything on copies and untouch original database. In a first step, let's do everything offline and replace the existing database with the new one compacted, and restart the 0-db. The next step is providing 0-db a way to reload/rehash new database on-the-fly, but this won't be related to this feature request.
EDIT: reload was implemented in the meantime.
Datafiles are splitted by chunk of 256 MB (this is configurable)
If there are 8 files, at the end there still needs to be 8 files, even if some of them are empty.
The idea is really cleaning these files, no reordering are done.
EDIT: nothing will be empty, no entries will be deleted.
Expected workflow
We need to support ability to not drop anything before a specific timestamp. Each entry have a timestamp, if some kind of snapshot has been made, up to a specific timestamp we need to keep the history to be able to rollback to that state.
Example
Original state
INDEX FILE
DATA FILE
Compacted version
INDEX FILE
DATA FILE
The text was updated successfully, but these errors were encountered: