bug: Colocated solvers timing out on prod mainnet #2041

MartinquaXD · 2023-11-06T08:37:39Z

Problem

When testing colocated drivers on prod mainnet we saw basically all of them timing out.
This is concerning for 2 reasons:

it looks like something is making the whole process surprisingly slow (maybe contention on some lock 🤷‍♂️)
our timeout management in the driver is not working. Ideally we should never see timeouts on the autopilot side. Instead the driver should abort the computation right before the deadline and return 0 solutions.

Impact

Colocation is basically not usable like this on prod.

To reproduce

Only way is to apply colocation on prod mainnet because that is where we have the required traffic to trigger the issues.

Expected behaviour

The driver should never time out and we should not see a reduced number of proposed solutions compared to the current setup.

Screenshots/logs

logs

The text was updated successfully, but these errors were encountered:

# Description The driver has a routine to prioritize orders and filter orders based on the balance the trader currently has. With the introduction of pre-interactions this check became much trickier because a user might not have any sell_tokens at the moment but unstakes the needed tokens in a pre-interaction. Because of that we introduced a way to get the actual tradable balance by simulating a balance check using a helper contract. This is very accurate but unfortunately also very slow. (~10s just for fetching the balances of all traders in an auction) Luckily most of our orders don't have pre-interactions so we can fall back to a simpler and faster way to query the current balance. # Changes Reimplement parts of the balance query of [balances.sol](https://github.com/cowprotocol/services/blob/main/crates/contracts/solidity/Balances.sol#L59-L76) using simple `eth_call`s and use that for any order group that does not contain any pre-interactions. While testing locally this reduced the time to fetch all balances to ~ 1.2 - 2 seconds. ## How to test covered by e2e tests ## Related Issues #2041

sunce86 · 2023-11-15T19:24:10Z

A bit more information:

Autopilot gives a driver a deadline to return the solution. Currently, 15s are given to driver.
Driver then gives 14s to solvers for calculation. Approximately ~1s is left for driver to simulate solutions, do the merging, simulate merged settlements, do the scoring...

There are several issues here:

Driver is left with 1sec, but this deadline is not enforced anywhere, leading to timeout responses in autopilot.
1sec is hardcoded and cant be changed via config.
1sec is same for all solvers, which is not appropriate. Some solvers return a single solution while others return a lot of solutions. Some solvers return early, others spend full time given to them (14s).
Network delay for sending the response back to autopilot is not taken into consideration.

I would suggest the following:

Add time buffer for network delay (500ms) and make it configurable. The value is the same for all solvers. Subtract this time from the auction::Deadline on the driver side. Refactor driver timeouts #2061
Make 1sec time configurable per solver (if easy, if not, just make it configurable).
Respect the deadline in driver and return partial result in case of timeout. Partially tackled with Respect driver deadline in solution processing #2059
Change the deadline from 15s -> 20s.

This way, 20s would be given to driver, 0.5s reserved for network delay, 14s for solving, 4,5s for encoding, simulations, merging, scoring.

# Description Tackles checkboxes 1 and 2 from #2041 (comment) ![timeouts](https://github.com/cowprotocol/services/assets/19595624/0f7490e7-6c36-4278-bbcf-9a4ca0f91f1a) # Changes This is my attempt to make timeouts more explicit and clear. First, we split the deadline given to driver into 3 pieces: 1. Solving time given to solvers 2. Competition time, used to validate, merge, simulate, score and rank solutions. 3. Http delay, to cover potential slow propagating of http response back to autopilot (`http_delay`) The time is split in a following way: competition time and http delay are values read from the configuration (so hardcoded), while the solving time is whatever is left after deducting those two from the deadline given to driver. It's important to note that: 1. Http delay is reduced from driver deadline at the very beginning of the solve/quote handling, because it's a non-domain information. Core domain logic doesn't care about network delays. 2. Competition time is reduced from deadline in the domain, because how we split given time to solving and competition is actually domain related. For some solvers we want 19s for solving and 1s for competition, while for some others like Baseline, we might want 10s for solving and 10s for competition. Default values are set to something like: /solve: 20s given to driver, http delay is 0.5s, competition is 4.5s, solving time 15s /quote: 3s given to driver, http delay is 0.5s, competition time 1s, solving time 1.5s Release notes: check the default timeouts and if they need to be adjusted.

# Description One of the things that makes liquidity fetching unnecessarily slow at the moment is that we try to fetch data for pools that we know are missing over and over again. # Changes The lowest level components that fetch balancer and uni v2 liquidity now keep track of pools that could not be found. Whenever a request for those pools comes it it gets omitted. Currently we are disallowing pools forever. Eventually we should have a background task that periodically forgets those deny-listed pools just in case somebody deploys a new one. ## Related Issues #2041

# Description Because all the traits and APIs around liquidity fetching are very weird it's pretty hard to keep track of which liquidity pools are not worth fetching. # Changes 1. This PR introduces a request sharing logic in the `RecentBlockCache`. With that it shouldn't matter how many solvers have to fetch liquidity in parallel as they should all share the underlying RPC requests. 2. Additionally I modified the logic that keeps track of recently used keys to only contain ids of pools we were able to fetch. This is important because those keys are used to update the cache in a background task and if we have too many useless keys in there we'd waste a lot of work during those background updates. 3. Moved `RequestSharing` garbage collection to background task instead of at every insert. This is mostly relevant because we now use this component for liquidity fetching which issues **MANY** requests. This improvement is actually pretty huge. Those changes (together with the previous 2 liquidity changes) brought down the liquidity fetching time from ~8s to 0.7-1.5s which should be fast enough for the gnosis solvers. Fixes: #2041

MartinquaXD added bug Something isn't working E:3.1 Driver Colocation See https://github.com/cowprotocol/pm/issues/14 for details labels Nov 6, 2023

MartinquaXD mentioned this issue Nov 7, 2023

Speed up balance fetching #2045

Merged

sunce86 mentioned this issue Nov 16, 2023

Refactor driver timeouts #2061

Merged

This was referenced Nov 16, 2023

Omit requests for known missing pools #2063

Merged

Share liquidity fetch requests #2065

Merged

fleupold assigned MartinquaXD Nov 20, 2023

fleupold linked a pull request Nov 20, 2023 that will close this issue

Refactor driver timeouts #2061

Merged

sunce86 closed this as completed in #2061 Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Colocated solvers timing out on prod mainnet #2041

bug: Colocated solvers timing out on prod mainnet #2041

MartinquaXD commented Nov 6, 2023

sunce86 commented Nov 15, 2023 •

edited

Loading

bug: Colocated solvers timing out on prod mainnet #2041

bug: Colocated solvers timing out on prod mainnet #2041

Comments

MartinquaXD commented Nov 6, 2023

Problem

Impact

To reproduce

Expected behaviour

Screenshots/logs

sunce86 commented Nov 15, 2023 • edited Loading

sunce86 commented Nov 15, 2023 •

edited

Loading