Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notifies solver if their solution simulates successfully at least once #2077

Merged

Conversation

marcovc
Copy link
Contributor

@marcovc marcovc commented Nov 23, 2023

Description

As a solver, I am interested in automatically identifying "bad" tokens so that they can be automatically censored. A "bad" token is one that has some logic that makes it fail most of the times. When I receive a simulation failure, I run an online learning procedure to try to identify them. However, some (most?) simulation failures happen just due to volatilty, in which case this procedure should be skipped, or otherwise there can be false positives.

Changes

This PR adds a boolean to the notification sent to solvers, stating if the driver was able to simulate the solver solution at least once. If so, then this is a very strong indicator that the solution was rejected due to volatility.

I considered other options to avoid this PR:

  1. Send the liquidity fetch block number in the instance.json. Since the notification comes with a block number, then comparing the notification block number with the liquidity fetch block number could indicate that it simulated once at least. But this would still require a PR.
  2. Same as previous, but fetch the liquidity block number from the solver competition endpoint. However, I recall that that endpoint was (is?) significantly behind the notification.
  3. Use tenderly myself. This seems like a waste of resources just for doing something the driver is already doing anyway.

How to test

I did not test it since I forgot how to. In fact I would be very surprised if I changed things in the right places. Anyway, if I failed to do so, consider the PR as an elaborate feature request :)

Thanks!

@marcovc marcovc requested a review from a team as a code owner November 23, 2023 20:24
Copy link

github-actions bot commented Nov 23, 2023

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

@marcovc
Copy link
Contributor Author

marcovc commented Nov 23, 2023

I have read the CLA Document and I hereby sign the CLA

github-actions bot added a commit that referenced this pull request Nov 23, 2023
@marcovc
Copy link
Contributor Author

marcovc commented Nov 24, 2023

I just realized that the information I need is actually being logged internally here:

} else {
tracing::debug!(
name = solver.name,
?error_at_latest_block,
"simulation only failed on the latest block but not on the block the \
auction started",

So I guess the changes suggested in the PR are not what they should :(

Copy link
Contributor

@sunce86 sunce86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good direction.

The important part is that we are missing the implementation in the solvers crate and populating the SolverRejectionReason with the new data.

The code you referred to in solver crate is for the legacy driver and not relevant for this PR.

while let Some(block) = stream.next().await {
if let Err(err) = self.simulate_settlement(&settlement).await {
observe::winner_voided(block, &err);
*score_ref = None;
*self.settlement.lock().unwrap() = None;
if let Some(id) = settlement.notify_id() {
notify::simulation_failed(&self.solver, auction.id(), id, &err);
notify::simulation_failed(&self.solver, auction.id(), id, &err, simulated_once);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for simulated_once, just call
notify::simulation_failed(&self.solver, auction.id(), id, &err, true)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I am misinterpreting the code then. So if the condition on (new) line 195 is true in the first iteration of the loop, then it would return that it simulated successfully at least once, where in fact it hasn't, right? What am I missing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first ever simulation happens when the solution is encoded, so the successful simulation is the prerequisite for a solution to be encoded.

This whole code block is not even executed if the first simulation failed. So, if the simulation fails here, it means that is the second, third .... attempt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got you, thanks!

SimulationFailed(eth::BlockNo, Transaction),
/// Failed simulation during competition. Last parameter is true
/// if has simulated at least once.
SimulationFailed(eth::BlockNo, Transaction, bool),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Would suggest this to be something like
SimulationFailed(eth::BlockNo, Transaction, SucceededAtLeastOnce),
where one would define this type as
type SucceededAtLeastOnce = bool;

@@ -103,7 +104,7 @@ pub enum Kind {
Timeout,
EmptySolution,
DuplicatedSolutionId,
SimulationFailed(BlockNo, Tx),
SimulationFailed(BlockNo, Tx, bool),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, substitute bool with more readable type.

@marcovc
Copy link
Contributor Author

marcovc commented Nov 24, 2023

Thanks for the help. I tried to follow your directions, but I am a bit confused as it seems to be at least two different versions of the code in the same repo.

I was wondering that maybe a better name than SimulatesSuccessfullyAtLeastOnce could be SimulatesSuccesfullyWhenEncoded, or SimulatesSuccessfullyAtLiquidityBlock, since like you explained if it passes encoding than the flag is true.

Anyone, please let me know what are the next changes :)

@sunce86
Copy link
Contributor

sunce86 commented Nov 27, 2023

The next step would be to also update the solvers side here: https://github.com/cowprotocol/services/blob/main/crates/solvers/src/api/routes/notify/dto/notification.rs#L106

@marcovc
Copy link
Contributor Author

marcovc commented Nov 27, 2023

I think I did here. No?

@fleupold
Copy link
Contributor

The change looks overall good to me. It would be good to run it end to end to see if you receive the expected notifications. We are currently in the process of creating a playground to facilitate running the whole system easily. In the meantime you can

  1. Launch the autopilot in shadow mode: cargo run --bin autopilot -- --skip-event-sync true --node-url <YOUR_ETH_RPC> --shadow https://api.cow.fi/mainnet --native-price-estimators Baseline --enable-colocation true --drivers "mysolver|http://localhost:11088/mysolver
  2. Launch a driver: cargo run -p driver -- --config driver.config.toml --ethrpc <YOUR_ETH_RPC>
  3. Launch a solver engine pointing at your endpoint: cargo run -p solvers -- legacy --config ./crates/solvers/config/example.legacy.toml

For correct configuration files, please look at the example .toml files.

@sunce86 do you have any remaining comments?

Copy link
Contributor

@sunce86 sunce86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lg

@marcovc
Copy link
Contributor Author

marcovc commented Dec 1, 2023

Thanks for this walkthrough. I have some questions:

  1. Launch the autopilot in shadow mode: cargo run --bin autopilot -- --skip-event-sync true --node-url <YOUR_ETH_RPC> --shadow https://api.cow.fi/mainnet --native-price-estimators Baseline --enable-colocation true --drivers "mysolver|http://localhost:11088/mysolver

What is "mysolver" here? I thought this was my solver, but after running it I think it is what you call "solver engine" in step 3 (which is not my solver engine)? If so, what should I set "mysolver" to? It can't be the "solver-name" inside the toml, since these have spaces.

  1. Launch a driver: cargo run -p driver -- --config driver.config.toml --ethrpc <YOUR_ETH_RPC>

I can't find any driver.config.toml in any directory of the services repo. Am I missing it?

  1. Launch a solver engine pointing at your endpoint: cargo run -p solvers -- legacy --config ./crates/solvers/config/example.legacy.toml

This is clear.

@fleupold
Copy link
Contributor

fleupold commented Dec 1, 2023

What is "mysolver" here?

It's a placeholder for the name of the solver engine you are spawning in step 3. This one connects to your Quasilabs API. The colocated setup looks roughly:

autopilot -> driver -> solver engine -> legacy solver (Quasilabs)

You can cally mysolver what you want, but need to make sure the autopilot connects to the http://driver_ip:driver_port/solver_name and the driver contains a solver_name entry forwarding traffic http://solver_engine_ip:solver_engine_port

I can't find any driver.config.toml in any directory of the services repo. Am I missing it?

@marcovc
Copy link
Contributor Author

marcovc commented Dec 1, 2023

@fleupold Thanks for clarifying, I was able to run the full stack. Well, sort of, it seems the solver engine does too many requests per second:

0: RPC error: Error { code: ServerError(429), message: "Your app has exceeded its compute units per second capacity. If you have retries enabled, you can safely ignore this message. If not, check out https://docs.alchemy.com/reference/throughput", data: None }

Also I was wondering, since I don't have access to my solver's private key, will the (local) driver/solver engine be able to simulate the solutions I'm going to send?

@marcovc
Copy link
Contributor Author

marcovc commented Dec 1, 2023

I've exhausted both my infura and my alchemy usage for the day after using the driver for 5 minutes :)

Another question, it seems the time_limit I am getting for solving are in the order of 2 seconds. I found out about the "--solve-deadline" parameter in the component of step 1 (autopilot?) but that is set to 15.

Anyway, although it is very valuable to get to know the new stack, maybe this is not the place to ask for help. Since it seems it is not so easy to set it up, I guess I'll open another issue or contact you on telegram.

@fleupold
Copy link
Contributor

fleupold commented Dec 4, 2023

cc @mfw78 who is wokring on making running the system locally e2e much easier.

@marcovc
Copy link
Contributor Author

marcovc commented Dec 4, 2023

Do you think we can merge this meanwhile?

@fleupold
Copy link
Contributor

fleupold commented Dec 4, 2023

Sure, would you mind serving traffic in staging this week then to test this change?

@marcovc
Copy link
Contributor Author

marcovc commented Dec 5, 2023

I just run

cargo clippy --locked --workspace --all-features --all-targets -- -D warnings

and it seems there are no changes to commit (?)

My cargo version seems pretty recent:
cargo 1.74.0 (ecb9851af 2023-10-18)

@fleupold
Copy link
Contributor

fleupold commented Dec 6, 2023

You can see what it complains about in https://github.com/cowprotocol/services/actions/runs/7100150555/job/19362557587?pr=2077 and should be able to reproduce locally

cargo 1.74.0 (ecb9851af 2023-10-18)
rustfmt 1.7.0-nightly (e9013ac 2023-12-05)
clippy 0.1.74 (79e9716 2023-11-13)

cargo +nightly fmt --all -- --check

@fleupold fleupold enabled auto-merge (squash) December 6, 2023 10:44
@fleupold fleupold merged commit 3b6403f into cowprotocol:main Dec 6, 2023
8 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Dec 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants