Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Colocated solvers timing out on prod mainnet #2041

Closed
MartinquaXD opened this issue Nov 6, 2023 · 1 comment · Fixed by #2061 or #2065
Closed

bug: Colocated solvers timing out on prod mainnet #2041

MartinquaXD opened this issue Nov 6, 2023 · 1 comment · Fixed by #2061 or #2065
Assignees
Labels
bug Something isn't working E:3.1 Driver Colocation See https://github.com/cowprotocol/pm/issues/14 for details

Comments

@MartinquaXD
Copy link
Contributor

Problem

When testing colocated drivers on prod mainnet we saw basically all of them timing out.
This is concerning for 2 reasons:

  • it looks like something is making the whole process surprisingly slow (maybe contention on some lock 🤷‍♂️)
  • our timeout management in the driver is not working. Ideally we should never see timeouts on the autopilot side. Instead the driver should abort the computation right before the deadline and return 0 solutions.

Impact

Colocation is basically not usable like this on prod.

To reproduce

Only way is to apply colocation on prod mainnet because that is where we have the required traffic to trigger the issues.

Expected behaviour

The driver should never time out and we should not see a reduced number of proposed solutions compared to the current setup.

Screenshots/logs

logs

@MartinquaXD MartinquaXD added bug Something isn't working E:3.1 Driver Colocation See https://github.com/cowprotocol/pm/issues/14 for details labels Nov 6, 2023
MartinquaXD added a commit that referenced this issue Nov 7, 2023
# Description
The driver has a routine to prioritize orders and filter orders based on
the balance the trader currently has.
With the introduction of pre-interactions this check became much
trickier because a user might not have any sell_tokens at the moment but
unstakes the needed tokens in a pre-interaction. Because of that we
introduced a way to get the actual tradable balance by simulating a
balance check using a helper contract.
This is very accurate but unfortunately also very slow. (~10s just for
fetching the balances of all traders in an auction)
Luckily most of our orders don't have pre-interactions so we can fall
back to a simpler and faster way to query the current balance.

# Changes
Reimplement parts of the balance query of
[balances.sol](https://github.com/cowprotocol/services/blob/main/crates/contracts/solidity/Balances.sol#L59-L76)
using simple `eth_call`s and use that for any order group that does not
contain any pre-interactions.
While testing locally this reduced the time to fetch all balances to ~
1.2 - 2 seconds.

## How to test
covered by e2e tests

## Related Issues
#2041
@sunce86
Copy link
Contributor

sunce86 commented Nov 15, 2023

A bit more information:

Autopilot gives a driver a deadline to return the solution. Currently, 15s are given to driver.
Driver then gives 14s to solvers for calculation. Approximately ~1s is left for driver to simulate solutions, do the merging, simulate merged settlements, do the scoring...

There are several issues here:

  1. Driver is left with 1sec, but this deadline is not enforced anywhere, leading to timeout responses in autopilot.
  2. 1sec is hardcoded and cant be changed via config.
  3. 1sec is same for all solvers, which is not appropriate. Some solvers return a single solution while others return a lot of solutions. Some solvers return early, others spend full time given to them (14s).
  4. Network delay for sending the response back to autopilot is not taken into consideration.

I would suggest the following:

  • Add time buffer for network delay (500ms) and make it configurable. The value is the same for all solvers. Subtract this time from the auction::Deadline on the driver side. Refactor driver timeouts #2061
  • Make 1sec time configurable per solver (if easy, if not, just make it configurable).
  • Respect the deadline in driver and return partial result in case of timeout. Partially tackled with Respect driver deadline in solution processing #2059
  • Change the deadline from 15s -> 20s.

This way, 20s would be given to driver, 0.5s reserved for network delay, 14s for solving, 4,5s for encoding, simulations, merging, scoring.

@fleupold fleupold linked a pull request Nov 20, 2023 that will close this issue
sunce86 added a commit that referenced this issue Nov 21, 2023
# Description
Tackles checkboxes 1 and 2 from
#2041 (comment)


![timeouts](https://github.com/cowprotocol/services/assets/19595624/0f7490e7-6c36-4278-bbcf-9a4ca0f91f1a)

# Changes
This is my attempt to make timeouts more explicit and clear.
First, we split the deadline given to driver into 3 pieces:
1. Solving time given to solvers
2. Competition time, used to validate, merge, simulate, score and rank
solutions.
3. Http delay, to cover potential slow propagating of http response back
to autopilot (`http_delay`)

The time is split in a following way: competition time and http delay
are values read from the configuration (so hardcoded), while the solving
time is whatever is left after deducting those two from the deadline
given to driver.

It's important to note that:
1. Http delay is reduced from driver deadline at the very beginning of
the solve/quote handling, because it's a non-domain information. Core
domain logic doesn't care about network delays.
2. Competition time is reduced from deadline in the domain, because how
we split given time to solving and competition is actually domain
related. For some solvers we want 19s for solving and 1s for
competition, while for some others like Baseline, we might want 10s for
solving and 10s for competition.

Default values are set to something like:
/solve: 20s given to driver, http delay is 0.5s, competition is 4.5s,
solving time 15s
/quote: 3s given to driver, http delay is 0.5s, competition time 1s,
solving time 1.5s

Release notes: check the default timeouts and if they need to be
adjusted.
MartinquaXD added a commit that referenced this issue Nov 22, 2023
# Description
One of the things that makes liquidity fetching unnecessarily slow at
the moment is that we try to fetch data for pools that we know are
missing over and over again.

# Changes
The lowest level components that fetch balancer and uni v2 liquidity now
keep track of pools that could not be found.
Whenever a request for those pools comes it it gets omitted.

Currently we are disallowing pools forever. Eventually we should have a
background task that periodically forgets those deny-listed pools just
in case somebody deploys a new one.

## Related Issues
#2041
MartinquaXD added a commit that referenced this issue Nov 27, 2023
# Description
Because all the traits and APIs around liquidity fetching are very weird
it's pretty hard to keep track of which liquidity pools are not worth
fetching.

# Changes
1. This PR introduces a request sharing logic in the `RecentBlockCache`.
With that it shouldn't matter how many solvers have to fetch liquidity
in parallel as they should all share the underlying RPC requests.
2. Additionally I modified the logic that keeps track of recently used
keys to only contain ids of pools we were able to fetch. This is
important because those keys are used to update the cache in a
background task and if we have too many useless keys in there we'd waste
a lot of work during those background updates.
3. Moved `RequestSharing` garbage collection to background task instead
of at every insert. This is mostly relevant because we now use this
component for liquidity fetching which issues **MANY** requests. This
improvement is actually pretty huge.

Those changes (together with the previous 2 liquidity changes) brought
down the liquidity fetching time from ~8s to 0.7-1.5s which should be
fast enough for the gnosis solvers.

Fixes: #2041
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working E:3.1 Driver Colocation See https://github.com/cowprotocol/pm/issues/14 for details
Projects
None yet
2 participants