Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory consumption of RecentBlockCache #2102

Merged
merged 12 commits into from
Dec 1, 2023
Merged

Conversation

MartinquaXD
Copy link
Contributor

Description

Our RecentBlockCache works like this:

  1. somebody requests liquidity
  2. cache checks if it's already known
  3. if it's not in the cache query the blockchain
  4. store in cache
  5. remember requested liquidity source for updating it in the background

Whenever we see a new block we fetch the current liquidity for all the liquidity sources and write them to the cache together with their block. We have a max cache duration. Whenever the cached state exceeds that duration we remove the oldest entries.

This implementation uses unnecessarily much memory in 2 ways:

  1. We can fetch liquidity for quotes. For those requests it's okay to return liquidity that is not 100% up-to-date. However, we still remember the requested liquidity source for future updates. This is not great because we can receive quote requests for all sorts of random tokens we'll never see again.
  2. We cache state for the same liquidity source for multiple blocks. But the cache only has 2 access patterns:
    • "Give me the most recent available on the blockchain"
    • "Give me the most recent available in the cache"
      There is no access pattern "Give me cached liquidity specifically from an older block with number X"
      That means it's enough to keep the most recent data for any liquidity pool cached at any point.

We can see these 2 things at play with this log.
After ~1h of operation it shows a single RecentBlockCache holding ~20K items. On an average auction we can fetch ~800 uni v2 sources. We currently have a configuration where we cache up to 10 blocks worth of data. Meaning we have roughly 8K cache entries for liquidity that is needed in auction and 12K entries that's only needed for quotes.
Also this is only for a single univ2 like liquidity source. In total we have 4 different ones configured in our backend.

Changes

We address 1 by not remembering liquidity sources for background updates for quote requests.
We address 2 by throwing away all the duplicated data.

How to test

I did a manual set up where I run an autopilot locally in shadow mode (fetch auction from prod mainnet) and a driver with all liquidity sources enabled.
I collected 3 graphs in total to measure the impact of this change on the memory.

  1. This graph is the status quo (very noisy and not really reproducable across runs)
0_current_no_optimizations
  1. This graph applies one optimization that is not part of this PR to make the memory consumption more predictable across runs. I want to merge that optimization as well but right now it's very hacky. However, I will include this optimization in all my graphs because it makes the impact of each optimization easier to spot.
1_with_univ2_call_optimization
  1. The effects of this PR's optimization. The memory usage is more stable over all and grows less over time.
2_cache_eviction

Copy link
Contributor

@sunce86 sunce86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(2) is a safe one, I don't think anyone would care if liquidity for older blocks is gone.

For (1), do you expect to significantly affect the liquidity fetching time even for deep and well known liquidity pools like ETH/USDC and similar?
Btw, I know it's not a popular way of solving problems in our system, but I do think (at some point) we should optimize our system to be

  1. super fast and reliable for quoting and solving most used and well known tokens
  2. less reliable for quoting and solving rare tokens.

@MartinquaXD
Copy link
Contributor Author

For (1), do you expect to significantly affect the liquidity fetching time even for deep and well known liquidity pools like ETH/USDC and similar?

Liquidity fetching times for any tokens that are part of auctions should not be affected at all and I expect the mentioned pool to always be part of an auction.
And for those fringe tokens that only show up during quotes we'll fetch the pool once and keep reusing it for a while. This means that a user that stays on the page and waits for the price to change will likely receive the exact same price from quasimodo and baseline until we evict the cached pool after 10 blocks but other price estimators that don't rely on the liquidity we fetch will continue to provide the most up-to-date prices.

Copy link
Contributor

@fleupold fleupold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find, but I'm a bit sceptical if this will fix the steady memory increase we see over a 3h time horizons in shadow (in your graph I also don't really see the last graph being significantly flatter than the 2nd).

I'd expect the fact that we already limit the number of keys we keep up to date to 1000 and the existing remove_cached_blocks_older_than logic to still evict pools regularly and lead to some upper bound of memory usage that should be reached after a few minutes (not hours).

image

E.g. look at these two logs:

  1. from 09:30UTC: DEBUG uniswap_v2_cache: shared::recent_block_cache: the cache now contains entries for 27713 block-key combinations
  2. from 12:30 UTC: DEBUG uniswap_v2_cache: shared::recent_block_cache: the cache now contains entries for 22954 block-key combinations

Less values in the latter, yet a memory consumption 2x higher

// remember the key for future background updates of the cached
// liquidity.
for key in found_keys {
mutexed.recently_used.cache_set(key, ());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also call cache_set in get. Should the same logic apply there? (also a 🤓 reminder that having this cache_set in two code paths is error prone)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I'm not sure how we can get around the 2 cache_set()s nicely though.
If we only have a single one at the end of the main function we'll not remember any keys for background updates when we time out requests after having to fetch all liquidity fresh when we restart the system.
Will leave the 2 cache_set() for now and keep thinking about ways to improve that.

let mut cached_keys = HashSet::new();
for ((_block, key), values) in self.entries.iter_mut().rev() {
if !cached_keys.insert(key) {
*values = vec![];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that for new entries this vector is getting longer and longer for some reason (ie a bug with the BalancerPoolFetcher)?

Since we are now iterating over all entries anyways, maybe we can log the number of total values remaining after pruning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that for new entries this vector is getting longer and longer for some reason (ie a bug with the BalancerPoolFetcher)?

I have no evidence that this is impossible. I also had the hunch that we could have a bug along those lines so I temporarily had an assertion verifying that all values are unique and it didn't cause any crashes in my 5 minutes test so I'd say we probably don't have an issues with that.

Since we are now iterating over all entries anyways, maybe we can log the number of total values remaining after pruning?

Makes sense, will add that.

@fleupold
Copy link
Contributor

🤯 speaking of this log, I noticed we never actually see a log for BalancerV2 liquidity, could it be we never actually prune the recent block cache for this data source? This would explain the leak.

It looks like the run_maintenance of UniswapV2 pool_cache is called here:

tracing::info_span!("maintenance", block)
.in_scope(|| async move {
if let Err(err) = pool_cache.run_maintenance().await {
tracing::warn!(?err, "error updating pool cache");
}
})
.await;

Whereas I don't see run_maintenance being called for balancer v2 pool_fetching:

image

Also I'm not sure how we handle Uni v3 liquidity?

@MartinquaXD MartinquaXD enabled auto-merge (squash) December 1, 2023 09:02
@MartinquaXD MartinquaXD merged commit 1fe4c44 into main Dec 1, 2023
8 checks passed
@MartinquaXD MartinquaXD deleted the recent-block-cache-gc branch December 1, 2023 09:46
@github-actions github-actions bot locked and limited conversation to collaborators Dec 1, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants