refactor: isolate faulty channels and retry channel task on faults #1

prestwich · 2022-02-09T22:58:25Z

High Level Changes:

Allows kathy, relayer, and processor to isolate failures at the channel level and retry channel task instead of crashing whole agent if one channel fails
Note that isolating channel failures is not relevant to the updater (updater only touches home)
This behavior did not seem desirable for the watcher

Code Changes

Agent::run no longer borrows &self and instead takes an agent-specific <Agent>Channel struct that defines all data types needed to run one home <> replica channel
Agent::run_many builds an <Agent>Channel struct and hands this off to an Agent::run task; if the run task errors out, it will log error and try to start it again instead of returning error to top level
Watcher and updater ignore this pattern, as they must overwrite Agent::run_all

TODO:
[ ] add unit tests to mock faulty RPC
[x] add exponential backoff for retries
[x] metric to track channel number of channel faults

Closes #161

…ad of vecs

…k number of faults

kekonen · 2022-02-18T17:57:23Z

@prestwich @ltchang2019 I merged the Unit tests, so I believe it is ready to be merged

luketchang · 2022-02-21T21:21:11Z

@arnaud036 @yourbuddyconner lets run this PR in dev before merging

arnaud036 · 2022-02-23T16:31:48Z

This repo only build a container once the PR is merged to main. We could change this behavior but we should agree on our git flow strategy for CI/CD.

The flow I had in mind was to continuously deploy main to dev when a PR gets merged and deploy to prod once a tag is created. We could also consider deploying to staging eventually and use this environment as an integration env

luketchang changed the title ~~Luke/run refactor~~ refactor: isolate faulty channels and retry channel task on faults Feb 10, 2022

luketchang added 5 commits February 10, 2022 21:59

refactor: agents now build channels and give to run tasks

60bd395

refactor: channel-specific agent structs take counter and gauge inste…

42ac5a9

…ad of vecs

refactor: use decl_channel! macro

d14c6f6

feature: add exponential retry for channel faults and metrics to trac…

c4ab6b1

…k number of faults

fix: move macro to macro.rs

5f957c5

luketchang force-pushed the luke/run-refactor branch from a24ac4d to 5f957c5 Compare February 11, 2022 06:00

luketchang added the refactor label Feb 11, 2022

prestwich assigned luketchang Feb 12, 2022

kekonen added 4 commits February 21, 2022 09:39

test: dirty attempt, not done yet

885f137

chore: small fixes

d6b2faa

chore: cleaned

a0402f3

chore: sanity check

1abde77

luketchang force-pushed the luke/run-refactor branch from a3ff450 to 1abde77 Compare February 21, 2022 16:43

luketchang mentioned this pull request Feb 22, 2022

refactor: isolate faulty channels and retry channel task on faults nomad-xyz/nomad-monorepo#187

Closed

luketchang merged commit 50267ae into main Feb 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: isolate faulty channels and retry channel task on faults #1

refactor: isolate faulty channels and retry channel task on faults #1

prestwich commented Feb 9, 2022 •

edited by luketchang

Loading

kekonen commented Feb 18, 2022

luketchang commented Feb 21, 2022

arnaud036 commented Feb 23, 2022 •

edited

Loading

refactor: isolate faulty channels and retry channel task on faults #1

refactor: isolate faulty channels and retry channel task on faults #1

Conversation

prestwich commented Feb 9, 2022 • edited by luketchang Loading

kekonen commented Feb 18, 2022

luketchang commented Feb 21, 2022

arnaud036 commented Feb 23, 2022 • edited Loading

prestwich commented Feb 9, 2022 •

edited by luketchang

Loading

arnaud036 commented Feb 23, 2022 •

edited

Loading