Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CCIP-1496 Moving pruning to a separate loop and respecting paging there #12060

Merged
merged 11 commits into from
Feb 23, 2024

Conversation

mateusz-sekara
Copy link
Collaborator

@mateusz-sekara mateusz-sekara commented Feb 16, 2024

Motivation

Current implementation of pruning expired logs has two major drawbacks:

  • it removes all expired logs within a single database call. First of all, this could generate high pressure (especially IO) when removing large number of logs at once (> 100k).
  • for long prunes/deletes, regular PollAndSave is not executed in parallel, because there is one LogPoller's routine per chain. That might impact product's availability/

Solution

Solution here is to remove data in batches instead of everything at once. Paging not only limits locking but also gives us finer control over the operation, allowing the database to handle other transactions more efficiently between batches. To increase isolation between LogPoller's main routine we've also moved pruning to a separate routine.
Besides that, we've added information to deletes about number of records deleted (it's also tracked with Prometheus metric). This should increase our visibility on how fast logs are created/deleted, based on this data we should be able to adjust page size/prune threshold if necessary.

To make it backward compatible, prune paging is enabled only if LogPrunePageSize is defined. That being said, once merged it should not impact any existing products until LogPrunePageSize is defined.

Copy link
Contributor

I see that you haven't updated any README files. Would it make sense to do so?

@@ -147,6 +147,7 @@ func NewLogPoller(orm ORM, ec Client, lggr logger.Logger, pollPeriod time.Durati
backfillBatchSize: backfillBatchSize,
rpcBatchSize: rpcBatchSize,
keepFinalizedBlocksDepth: keepFinalizedBlocksDepth,
logPrunePageSize: logsPrunePageSize,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this param as a backward compatibility layer. Products already using "delete all" queries will not be affected by this PR. Product has to explicitly specify that they need paging

@reductionista
Copy link
Contributor

Since this adds a new toml config param, we should also mention it in docs/CHANGELOG.md

@mateusz-sekara mateusz-sekara changed the title Moving pruning to a separate loop and respecting paging there CCIP-1496 Moving pruning to a separate loop and respecting paging there Feb 21, 2024
@mateusz-sekara mateusz-sekara marked this pull request as ready for review February 21, 2024 15:31
@mateusz-sekara mateusz-sekara requested review from a team as code owners February 21, 2024 15:31
@@ -272,7 +272,7 @@ func (o *DbORM) DeleteExpiredLogs(limit int64, qopts ...pg.QOpt) (int64, error)
GROUP BY evm_chain_id, address, event
HAVING NOT 0 = ANY(ARRAY_AGG(retention))
) r ON l.evm_chain_id = $1 AND l.address = r.address AND l.event_sig = r.event
AND l.block_timestamp <= STATEMENT_TIMESTAMP() - (r.retention / 10^9 * interval '1 second')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious to hear your reasoning for the block_timestamp -> created_at change.

I was thinking block_timestamp makes more sense, because how long we retain a log seems like it should be based on when the log was emitted rather than when we happened to find it. For example, if a log is a week old and the retention is 24 hours, but the node happens to have been down for a week... when we spin up the node it would backfill that log and then retain it for another 24 hours. Seems like it ought to just get rid of it immediately since it's been way more than 24 hours since it was emitted.

Copy link
Contributor

@reductionista reductionista Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That actually makes me think of another optimization I should add: if a log comes back from the rpc server and it's already past its retention time we shouldn't save it to the db in the first place; I don't think that check is in there yet, but it should be. (Unless we do change it to created_at, I'd like to hear if there is a use case for that.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first i wanted that to use block_timestamp #12060, but I'm not sure about replaying logs older than retention, they will be immediately wiped out from db

@mateusz-sekara mateusz-sekara added this pull request to the merge queue Feb 22, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 22, 2024
@mateusz-sekara mateusz-sekara added this pull request to the merge queue Feb 23, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 23, 2024
@mateusz-sekara mateusz-sekara added this pull request to the merge queue Feb 23, 2024
Merged via the queue into develop with commit 5f212bb Feb 23, 2024
97 checks passed
@mateusz-sekara mateusz-sekara deleted the lp-deletin-in-batches branch February 23, 2024 07:49
mateusz-sekara added a commit to smartcontractkit/ccip that referenced this pull request Mar 20, 2024
## Motivation

The goal of this PR is to reduce the number of logs and blocks we keep
in the database by utilizing a built-in retention mechanism in
LogPoller.
Requires paging for smooth deployment
smartcontractkit/chainlink#12060

## Solution

This PR enables retention for all the LogPoller's filters registered by
CCIP. Additionally, to avoid pushing too much pressure during deletion
(especially the first run will have a lot of logs to remove) we've
updated `LogPrunePageSize` to 10k. Please see the original PR in the
chainlink repo to learn more about paging and its impact on the
database. `LogPrunePageSize` is altered in the `fallback.toml` to avoid
the necessity of setting this value for every chain that CCIP is
deployed on.
asoliman92 pushed a commit to smartcontractkit/ccip that referenced this pull request Jul 31, 2024
## Motivation

The goal of this PR is to reduce the number of logs and blocks we keep
in the database by utilizing a built-in retention mechanism in
LogPoller.
Requires paging for smooth deployment
smartcontractkit/chainlink#12060

## Solution

This PR enables retention for all the LogPoller's filters registered by
CCIP. Additionally, to avoid pushing too much pressure during deletion
(especially the first run will have a lot of logs to remove) we've
updated `LogPrunePageSize` to 10k. Please see the original PR in the
chainlink repo to learn more about paging and its impact on the
database. `LogPrunePageSize` is altered in the `fallback.toml` to avoid
the necessity of setting this value for every chain that CCIP is
deployed on.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants