refactor: isolate faulty channels and retry channel task on faults #187

luketchang · 2022-02-08T22:58:30Z

High Level Changes:

Allows kathy, relayer, and processor to isolate failures at the channel level and retry channel task instead of crashing whole agent if one channel fails
Note that isolating channel failures is not relevant to the updater (updater only touches home)
This behavior did not seem desirable for the watcher

Code Changes

Agent::run no longer borrows &self and instead takes an agent-specific <Agent>Channel struct that defines all data types needed to run one home <> replica channel
Agent::run_many builds an <Agent>Channel struct and hands this off to an Agent::run task; if the run task errors out, it will log error and try to start it again instead of returning error to top level
Watcher and updater ignore this pattern, as they must overwrite Agent::run_all

TODO:
[ ] add unit tests to mock faulty RPC
[x] add exponential backoff for retries
[x] metric to track channel number of channel faults

Closes #161

…ad of vecs

…k number of faults

luketchang · 2022-02-22T20:55:54Z

@arnaud036 @yourbuddyconner lets run this PR in dev before merging

luketchang · 2022-02-22T20:56:32Z

This is up-to-date equivalent of nomad-xyz/rust#1

luketchang · 2022-03-01T22:15:43Z

Closed in favor of nomad-xyz/rust#1

luketchang added agents rust 🦀 refactor labels Feb 8, 2022

luketchang self-assigned this Feb 8, 2022

luketchang changed the title ~~refactor: isolate single channel failures and retry on failure~~ refactor: isolate agent channel failures and retry channel task on failure Feb 8, 2022

luketchang changed the title ~~refactor: isolate agent channel failures and retry channel task on failure~~ refactor: isolate faulty channels and retry channel task on faults Feb 9, 2022

luketchang mentioned this pull request Feb 9, 2022

feature: dirty retry and channels not canceled #185

Closed

prestwich mentioned this pull request Feb 9, 2022

refactor: isolate faulty channels and retry channel task on faults nomad-xyz/rust#1

Merged

luketchang mentioned this pull request Feb 10, 2022

refactor: every agent implements an agent-specific AgentMetrics struct nomad-xyz/rust#2

Closed

luketchang added 5 commits February 22, 2022 13:48

refactor: agents now build channels and give to run tasks

6616e02

refactor: channel-specific agent structs take counter and gauge inste…

0fe7628

…ad of vecs

refactor: use decl_channel! macro

48c6e18

feature: add exponential retry for channel faults and metrics to trac…

2a2d124

…k number of faults

fix: pull nomad-xyz/agent changes into monorepo

c333cc9

luketchang force-pushed the luke/run-refactor branch from 786f0aa to c333cc9 Compare February 22, 2022 20:53

luketchang closed this Mar 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: isolate faulty channels and retry channel task on faults #187

refactor: isolate faulty channels and retry channel task on faults #187

luketchang commented Feb 8, 2022 •

edited

Loading

luketchang commented Feb 22, 2022

luketchang commented Feb 22, 2022

luketchang commented Mar 1, 2022

refactor: isolate faulty channels and retry channel task on faults #187

refactor: isolate faulty channels and retry channel task on faults #187

Conversation

luketchang commented Feb 8, 2022 • edited Loading

luketchang commented Feb 22, 2022

luketchang commented Feb 22, 2022

luketchang commented Mar 1, 2022

luketchang commented Feb 8, 2022 •

edited

Loading