-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: Colocated solvers timing out on prod mainnet #2041
Comments
# Description The driver has a routine to prioritize orders and filter orders based on the balance the trader currently has. With the introduction of pre-interactions this check became much trickier because a user might not have any sell_tokens at the moment but unstakes the needed tokens in a pre-interaction. Because of that we introduced a way to get the actual tradable balance by simulating a balance check using a helper contract. This is very accurate but unfortunately also very slow. (~10s just for fetching the balances of all traders in an auction) Luckily most of our orders don't have pre-interactions so we can fall back to a simpler and faster way to query the current balance. # Changes Reimplement parts of the balance query of [balances.sol](https://github.com/cowprotocol/services/blob/main/crates/contracts/solidity/Balances.sol#L59-L76) using simple `eth_call`s and use that for any order group that does not contain any pre-interactions. While testing locally this reduced the time to fetch all balances to ~ 1.2 - 2 seconds. ## How to test covered by e2e tests ## Related Issues #2041
A bit more information:
There are several issues here:
I would suggest the following:
This way, 20s would be given to |
# Description Tackles checkboxes 1 and 2 from #2041 (comment) ![timeouts](https://github.com/cowprotocol/services/assets/19595624/0f7490e7-6c36-4278-bbcf-9a4ca0f91f1a) # Changes This is my attempt to make timeouts more explicit and clear. First, we split the deadline given to driver into 3 pieces: 1. Solving time given to solvers 2. Competition time, used to validate, merge, simulate, score and rank solutions. 3. Http delay, to cover potential slow propagating of http response back to autopilot (`http_delay`) The time is split in a following way: competition time and http delay are values read from the configuration (so hardcoded), while the solving time is whatever is left after deducting those two from the deadline given to driver. It's important to note that: 1. Http delay is reduced from driver deadline at the very beginning of the solve/quote handling, because it's a non-domain information. Core domain logic doesn't care about network delays. 2. Competition time is reduced from deadline in the domain, because how we split given time to solving and competition is actually domain related. For some solvers we want 19s for solving and 1s for competition, while for some others like Baseline, we might want 10s for solving and 10s for competition. Default values are set to something like: /solve: 20s given to driver, http delay is 0.5s, competition is 4.5s, solving time 15s /quote: 3s given to driver, http delay is 0.5s, competition time 1s, solving time 1.5s Release notes: check the default timeouts and if they need to be adjusted.
# Description One of the things that makes liquidity fetching unnecessarily slow at the moment is that we try to fetch data for pools that we know are missing over and over again. # Changes The lowest level components that fetch balancer and uni v2 liquidity now keep track of pools that could not be found. Whenever a request for those pools comes it it gets omitted. Currently we are disallowing pools forever. Eventually we should have a background task that periodically forgets those deny-listed pools just in case somebody deploys a new one. ## Related Issues #2041
# Description Because all the traits and APIs around liquidity fetching are very weird it's pretty hard to keep track of which liquidity pools are not worth fetching. # Changes 1. This PR introduces a request sharing logic in the `RecentBlockCache`. With that it shouldn't matter how many solvers have to fetch liquidity in parallel as they should all share the underlying RPC requests. 2. Additionally I modified the logic that keeps track of recently used keys to only contain ids of pools we were able to fetch. This is important because those keys are used to update the cache in a background task and if we have too many useless keys in there we'd waste a lot of work during those background updates. 3. Moved `RequestSharing` garbage collection to background task instead of at every insert. This is mostly relevant because we now use this component for liquidity fetching which issues **MANY** requests. This improvement is actually pretty huge. Those changes (together with the previous 2 liquidity changes) brought down the liquidity fetching time from ~8s to 0.7-1.5s which should be fast enough for the gnosis solvers. Fixes: #2041
Problem
When testing colocated drivers on prod mainnet we saw basically all of them timing out.
This is concerning for 2 reasons:
driver
is not working. Ideally we should never see timeouts on the autopilot side. Instead thedriver
should abort the computation right before the deadline and return 0 solutions.Impact
Colocation is basically not usable like this on prod.
To reproduce
Only way is to apply colocation on prod mainnet because that is where we have the required traffic to trigger the issues.
Expected behaviour
The driver should never time out and we should not see a reduced number of proposed solutions compared to the current setup.
Screenshots/logs
logs
The text was updated successfully, but these errors were encountered: