Skip to content
This repository has been archived by the owner on May 28, 2022. It is now read-only.

refactor: isolate faulty channels and retry channel task on faults #187

Closed
wants to merge 5 commits into from

Conversation

luketchang
Copy link
Collaborator

@luketchang luketchang commented Feb 8, 2022

High Level Changes:

  • Allows kathy, relayer, and processor to isolate failures at the channel level and retry channel task instead of crashing whole agent if one channel fails
  • Note that isolating channel failures is not relevant to the updater (updater only touches home)
  • This behavior did not seem desirable for the watcher

Code Changes

  • Agent::run no longer borrows &self and instead takes an agent-specific <Agent>Channel struct that defines all data types needed to run one home <> replica channel
  • Agent::run_many builds an <Agent>Channel struct and hands this off to an Agent::run task; if the run task errors out, it will log error and try to start it again instead of returning error to top level
  • Watcher and updater ignore this pattern, as they must overwrite Agent::run_all

TODO:
[ ] add unit tests to mock faulty RPC
[x] add exponential backoff for retries
[x] metric to track channel number of channel faults

Closes #161

@luketchang luketchang self-assigned this Feb 8, 2022
@luketchang luketchang changed the title refactor: isolate single channel failures and retry on failure refactor: isolate agent channel failures and retry channel task on failure Feb 8, 2022
@luketchang luketchang changed the title refactor: isolate agent channel failures and retry channel task on failure refactor: isolate faulty channels and retry channel task on faults Feb 9, 2022
@luketchang
Copy link
Collaborator Author

@arnaud036 @yourbuddyconner lets run this PR in dev before merging

@luketchang
Copy link
Collaborator Author

This is up-to-date equivalent of nomad-xyz/rust#1

@luketchang
Copy link
Collaborator Author

Closed in favor of nomad-xyz/rust#1

@luketchang luketchang closed this Mar 1, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

refactor: isolate agent channel failures
1 participant