-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: improve estimated disk usage #1676
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall the PR looks good
@@ -252,7 +256,7 @@ impl<TContentKey: OverlayContentKey> IdIndexedV1Store<TContentKey> { | |||
|
|||
let content_id = content_id.to_vec(); | |||
let content_key = content_key.to_bytes().to_vec(); | |||
let content_size = content_id.len() + content_key.len() + content_value.len(); | |||
let content_size = Self::calculate_content_size(&content_id, &content_key, &content_value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this helper function seems a little overkill
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the implication that we might (someday) calculate it differently for different content stores?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to extract it into separate function because currently:
- we use it in two different places (here and in tests), and it has to be the same value
- this is the value that is stored in the
column_size
column in the db, and I don't always remember how the value is calculated (i.e. does it take into consideration other columns, likedistance_short
andcolumn_size
itself, iscontent_id
counted twice because it's part of another index, etc.). So instead of looking forinsert
function to see how this value is calculated, I would rather have dedicated function
I documented the function to explain its purpose better.
On the side note, we will need new store for ephemeral content, and I might need some of this logic there as well. I think that this function should probably go together extra_disk_usage_per_content_bytes
as they are somewhat related. So I might extract them into separate file. Stay tuned for future PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 🚢
big storage nodes - These would be impacted the most because their defragmentation is very low (so far they only insert and never delete content). Running this PR would cause them to delete 1/3 of their content, which would:
run pruning very long (making them unresponsive for quite some time)
Do you think this is something we need to address/prevent? One option would be to simultaneously increase the config of how much data the big nodes are allowed to store, so we don't trigger a huge pruning. But maybe it's just fine to trigger a huge prune.
I always get a little nervous when there is a refactor mixed in with a functional change, that there will be bug that looks like an innocuous part of the refactor. It would help me if it were split into commits next time, if you don't want to split into different PRs.
@@ -252,7 +256,7 @@ impl<TContentKey: OverlayContentKey> IdIndexedV1Store<TContentKey> { | |||
|
|||
let content_id = content_id.to_vec(); | |||
let content_key = content_key.to_bytes().to_vec(); | |||
let content_size = content_id.len() + content_key.len() + content_value.len(); | |||
let content_size = Self::calculate_content_size(&content_id, &content_key, &content_value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the implication that we might (someday) calculate it differently for different content stores?
Interesting how big an effect the vacuum can have, from just the fragmentation! I guess I can make some sense of that, when each record can have such different size. That would make it harder to reuse a deleted record, than for a more uniform database. I don't think it's a top priority, but I can imagine wanting to schedule an occasional vacuum. I still lean against turning autovacuum on, though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this is something we need to address/prevent? One option would be to simultaneously increase the config of how much data the big nodes are allowed to store, so we don't trigger a huge pruning. But maybe it's just fine to trigger a huge prune.
We will have to update flags. Currently we specify capacity to be 500 GB, because estimated capacity is ~400 GB while actual usage is ~640 GB. This was done so that when estimated capacity grows up to 500 GB, with decent amount of fragmentation, we stay under disk limit.
With this change, estimated disk usage would grow to 1.4 GB (while we would still use only 640GB).
If I set storage capacity to 900GB (that is the plan), we would have to prune around 1/3 of the content.
Rollout plan would be the following:
- pick one big-storage node
- backup db on my machine (just in case, hopefully not needed)
- deploy and observe logs while pruning is happening
- once it's done, check radius, remaining content, estimated disk usage and verify that they are at expected level
- if not, rollback original db and debug
- if everything went ok and pruning takes too long (most likely), deploy in 2-3 phases to other big-storage nodes
I always get a little nervous when there is a refactor mixed in with a functional change, that there will be bug that looks like an innocuous part of the refactor. It would help me if it were split into commits next time, if you don't want to split into different PRs.
I will try to do it in the future.
@@ -252,7 +256,7 @@ impl<TContentKey: OverlayContentKey> IdIndexedV1Store<TContentKey> { | |||
|
|||
let content_id = content_id.to_vec(); | |||
let content_key = content_key.to_bytes().to_vec(); | |||
let content_size = content_id.len() + content_key.len() + content_value.len(); | |||
let content_size = Self::calculate_content_size(&content_id, &content_key, &content_value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to extract it into separate function because currently:
- we use it in two different places (here and in tests), and it has to be the same value
- this is the value that is stored in the
column_size
column in the db, and I don't always remember how the value is calculated (i.e. does it take into consideration other columns, likedistance_short
andcolumn_size
itself, iscontent_id
counted twice because it's part of another index, etc.). So instead of looking forinsert
function to see how this value is calculated, I would rather have dedicated function
I documented the function to explain its purpose better.
On the side note, we will need new store for ephemeral content, and I might need some of this logic there as well. I think that this function should probably go together extra_disk_usage_per_content_bytes
as they are somewhat related. So I might extract them into separate file. Stay tuned for future PRs.
What was wrong?
We underestimate the disk usage. See #1653 for details.
How was it fixed?
Based on the analysis in #1653, I added fixed number of bytes per content item that should be used when estimating disk usage (together with already counted number of bytes for content_id, content_key and content_value).
This way, the data that is actually store in the db (
content_size = content_id.len() + content_key.len() + content_value.len()
) doesn't have to change and we can easily update our estimate in the future if we need to.How was it tested?
I had local trin client that was running for some time (~20h) last week. Storage was configured to use 1GB of data, and all 3 subnetworks are enabled (state and history had 500MB capacity each). The actual size of the DB was 1447 MB (45% than desired).
Starting trin client with this PR would trigger pruning at startup. Pruning would run until our estimated disk usage is below 1GB. This however doesn't affect the actual disk usage. To do so, we would have to run "VACUUM".
Before running this code, I made a copy of db. That way I was able to re-run the code, and confirm that results are stable.
Run 1: Prune + Vacuum
I started new trin client (with this PR) and observed that:
Run 2: Vacuum + Prune + Vacuum
I run Vacuum on the original DB which decreased db size to 1154 MB.
Then I repeated steps from the first run and observed that:
Conclusion
It's worth observing that running Vacuum on the original db decreased it's size from 1447 MB to 1154 MB. That means that defragmentation added ~25% disk usage. If we assume the same defragmentation, resulting db would occupy 984 MB (instead of 785 MB), which would leave us under close to the target capacity (with a bit of room).
This defragmentation ratio is not fixed. It depends on the content that is stored (it's different between state and history content) and how long client is running (longer running client tends to have higher defragmentation, but probably there is a limit). The original DB had radius between 2% and 4% for both history and state network, which implies that there was significant defragmentataion (if I remember correctly, I run the code before the max 5% radius PR was merged), which would imply that defragmentation might not be significantly higher even in the worse case.
Things to consider
This would result in all our nodes (or at least the ones that reached their capacity) to trim content at startup. Different nodes are affected differently: