feat: improve estimated disk usage #1676

morph-dev · 2025-02-09T15:16:57Z

What was wrong?

We underestimate the disk usage. See #1653 for details.

How was it fixed?

Based on the analysis in #1653, I added fixed number of bytes per content item that should be used when estimating disk usage (together with already counted number of bytes for content_id, content_key and content_value).

This way, the data that is actually store in the db (content_size = content_id.len() + content_key.len() + content_value.len()) doesn't have to change and we can easily update our estimate in the future if we need to.

How was it tested?

I had local trin client that was running for some time (~20h) last week. Storage was configured to use 1GB of data, and all 3 subnetworks are enabled (state and history had 500MB capacity each). The actual size of the DB was 1447 MB (45% than desired).

Starting trin client with this PR would trigger pruning at startup. Pruning would run until our estimated disk usage is below 1GB. This however doesn't affect the actual disk usage. To do so, we would have to run "VACUUM".

Before running this code, I made a copy of db. That way I was able to re-run the code, and confirm that results are stable.

Run 1: Prune + Vacuum

I started new trin client (with this PR) and observed that:

estimated disk usage was:
- 1063 MB for state + 526 MB for history (compared to 500 MB each before) = 1590 MB (compared to actual 1447 MB)
pruning executed at startup
estimated disk usage after pruning: 498 MB for state + 495 MB for history (both under specified capacity)
- the actual DB size didn't decrease because we don't use Vacuum
I stop trin and run Vacuum manually
- the db size decreased to 785 MB

Run 2: Vacuum + Prune + Vacuum

I run Vacuum on the original DB which decreased db size to 1154 MB.
Then I repeated steps from the first run and observed that:

The estimated disk usage was the same (as expected)
Pruning ended with the same estimated usage
DB size after pruning was still 1.154 MB (expected, again because of the vacuum)
DB size after vacuum was also 785 MB.

Conclusion

It's worth observing that running Vacuum on the original db decreased it's size from 1447 MB to 1154 MB. That means that defragmentation added ~25% disk usage. If we assume the same defragmentation, resulting db would occupy 984 MB (instead of 785 MB), which would leave us under close to the target capacity (with a bit of room).

This defragmentation ratio is not fixed. It depends on the content that is stored (it's different between state and history content) and how long client is running (longer running client tends to have higher defragmentation, but probably there is a limit). The original DB had radius between 2% and 4% for both history and state network, which implies that there was significant defragmentataion (if I remember correctly, I run the code before the max 5% radius PR was merged), which would imply that defragmentation might not be significantly higher even in the worse case.

Things to consider

This would result in all our nodes (or at least the ones that reached their capacity) to trim content at startup. Different nodes are affected differently:

history nodes - they wouldn't be significantly affected, as disk usage estimate wasn't as bad (in comparison to state)
- in my local example, I deleted only ~6% of history content
regular state nodes - Since they were restarted recently with 5% max radius, they didn't reach they capacity yet. This change would just change how much capacity they estimated they use, but they wouldn't prune anything (unless they get full by the time this PR is deployed)
big storage nodes - These would be impacted the most because their defragmentation is very low (so far they only insert and never delete content). Running this PR would cause them to delete 1/3 of their content, which would:
- run pruning very long (making them unresponsive for quite some time)
- leave it with significant overestimated disk usage (I believe good in the long run)

KolbyML

overall the PR looks good

KolbyML · 2025-02-13T18:32:46Z

crates/storage/src/versioned/id_indexed_v1/store.rs

@@ -252,7 +256,7 @@ impl<TContentKey: OverlayContentKey> IdIndexedV1Store<TContentKey> {

        let content_id = content_id.to_vec();
        let content_key = content_key.to_bytes().to_vec();
-        let content_size = content_id.len() + content_key.len() + content_value.len();
+        let content_size = Self::calculate_content_size(&content_id, &content_key, &content_value);


this helper function seems a little overkill

Is the implication that we might (someday) calculate it differently for different content stores?

I wanted to extract it into separate function because currently:

we use it in two different places (here and in tests), and it has to be the same value

this is the value that is stored in the column_size column in the db, and I don't always remember how the value is calculated (i.e. does it take into consideration other columns, like distance_short and column_size itself, is content_id counted twice because it's part of another index, etc.). So instead of looking for insert function to see how this value is calculated, I would rather have dedicated function

I documented the function to explain its purpose better.

On the side note, we will need new store for ephemeral content, and I might need some of this logic there as well. I think that this function should probably go together extra_disk_usage_per_content_bytes as they are somewhat related. So I might extract them into separate file. Stay tuned for future PRs.

carver

LGTM! 🚢

big storage nodes - These would be impacted the most because their defragmentation is very low (so far they only insert and never delete content). Running this PR would cause them to delete 1/3 of their content, which would:
run pruning very long (making them unresponsive for quite some time)

Do you think this is something we need to address/prevent? One option would be to simultaneously increase the config of how much data the big nodes are allowed to store, so we don't trigger a huge pruning. But maybe it's just fine to trigger a huge prune.

I always get a little nervous when there is a refactor mixed in with a functional change, that there will be bug that looks like an innocuous part of the refactor. It would help me if it were split into commits next time, if you don't want to split into different PRs.

carver · 2025-02-13T23:41:30Z

crates/storage/src/versioned/id_indexed_v1/store.rs

@@ -252,7 +256,7 @@ impl<TContentKey: OverlayContentKey> IdIndexedV1Store<TContentKey> {

        let content_id = content_id.to_vec();
        let content_key = content_key.to_bytes().to_vec();
-        let content_size = content_id.len() + content_key.len() + content_value.len();
+        let content_size = Self::calculate_content_size(&content_id, &content_key, &content_value);


Is the implication that we might (someday) calculate it differently for different content stores?

carver · 2025-02-14T00:13:19Z

Interesting how big an effect the vacuum can have, from just the fragmentation! I guess I can make some sense of that, when each record can have such different size. That would make it harder to reuse a deleted record, than for a more uniform database.

I don't think it's a top priority, but I can imagine wanting to schedule an occasional vacuum. I still lean against turning autovacuum on, though.

morph-dev

Do you think this is something we need to address/prevent? One option would be to simultaneously increase the config of how much data the big nodes are allowed to store, so we don't trigger a huge pruning. But maybe it's just fine to trigger a huge prune.

We will have to update flags. Currently we specify capacity to be 500 GB, because estimated capacity is ~400 GB while actual usage is ~640 GB. This was done so that when estimated capacity grows up to 500 GB, with decent amount of fragmentation, we stay under disk limit.

With this change, estimated disk usage would grow to 1.4 GB (while we would still use only 640GB).
If I set storage capacity to 900GB (that is the plan), we would have to prune around 1/3 of the content.

Rollout plan would be the following:

pick one big-storage node
- backup db on my machine (just in case, hopefully not needed)
- deploy and observe logs while pruning is happening
- once it's done, check radius, remaining content, estimated disk usage and verify that they are at expected level
- if not, rollback original db and debug
if everything went ok and pruning takes too long (most likely), deploy in 2-3 phases to other big-storage nodes

I always get a little nervous when there is a refactor mixed in with a functional change, that there will be bug that looks like an innocuous part of the refactor. It would help me if it were split into commits next time, if you don't want to split into different PRs.

I will try to do it in the future.

morph-dev · 2025-02-14T12:46:34Z

crates/storage/src/versioned/id_indexed_v1/store.rs

@@ -252,7 +256,7 @@ impl<TContentKey: OverlayContentKey> IdIndexedV1Store<TContentKey> {

        let content_id = content_id.to_vec();
        let content_key = content_key.to_bytes().to_vec();
-        let content_size = content_id.len() + content_key.len() + content_value.len();
+        let content_size = Self::calculate_content_size(&content_id, &content_key, &content_value);


I wanted to extract it into separate function because currently:

we use it in two different places (here and in tests), and it has to be the same value

this is the value that is stored in the column_size column in the db, and I don't always remember how the value is calculated (i.e. does it take into consideration other columns, like distance_short and column_size itself, is content_id counted twice because it's part of another index, etc.). So instead of looking for insert function to see how this value is calculated, I would rather have dedicated function

I documented the function to explain its purpose better.

On the side note, we will need new store for ephemeral content, and I might need some of this logic there as well. I think that this function should probably go together extra_disk_usage_per_content_bytes as they are somewhat related. So I might extract them into separate file. Stay tuned for future PRs.

feat: improve estimated disk usage

6aa1f79

morph-dev self-assigned this Feb 9, 2025

morph-dev requested review from carver, njgheorghita and KolbyML February 9, 2025 15:17

KolbyML approved these changes Feb 13, 2025

View reviewed changes

carver approved these changes Feb 14, 2025

View reviewed changes

docs: add docs to calculate_content_size

b99216d

morph-dev commented Feb 14, 2025

View reviewed changes

morph-dev merged commit 7cb2063 into ethereum:master Feb 14, 2025
14 checks passed

morph-dev deleted the storage_estimate branch February 14, 2025 13:20

morph-dev mentioned this pull request Feb 14, 2025

Storage capacity estimates #1653

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: improve estimated disk usage #1676

feat: improve estimated disk usage #1676

morph-dev commented Feb 9, 2025

KolbyML left a comment

KolbyML Feb 13, 2025

carver Feb 13, 2025

morph-dev Feb 14, 2025

carver left a comment

carver Feb 13, 2025

carver commented Feb 14, 2025

morph-dev left a comment

morph-dev Feb 14, 2025

feat: improve estimated disk usage #1676

feat: improve estimated disk usage #1676

Conversation

morph-dev commented Feb 9, 2025

What was wrong?

How was it fixed?

How was it tested?

Run 1: Prune + Vacuum

Run 2: Vacuum + Prune + Vacuum

Conclusion

Things to consider

KolbyML left a comment

Choose a reason for hiding this comment

KolbyML Feb 13, 2025

Choose a reason for hiding this comment

carver Feb 13, 2025

Choose a reason for hiding this comment

morph-dev Feb 14, 2025

Choose a reason for hiding this comment

carver left a comment

Choose a reason for hiding this comment

carver Feb 13, 2025

Choose a reason for hiding this comment

carver commented Feb 14, 2025

morph-dev left a comment

Choose a reason for hiding this comment

morph-dev Feb 14, 2025

Choose a reason for hiding this comment