Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: isolate faulty channels and retry channel task on faults #1

Merged
merged 9 commits into from
Feb 23, 2022

Conversation

prestwich
Copy link
Member

@prestwich prestwich commented Feb 9, 2022

High Level Changes:

  • Allows kathy, relayer, and processor to isolate failures at the channel level and retry channel task instead of crashing whole agent if one channel fails
  • Note that isolating channel failures is not relevant to the updater (updater only touches home)
  • This behavior did not seem desirable for the watcher

Code Changes

  • Agent::run no longer borrows &self and instead takes an agent-specific <Agent>Channel struct that defines all data types needed to run one home <> replica channel
  • Agent::run_many builds an <Agent>Channel struct and hands this off to an Agent::run task; if the run task errors out, it will log error and try to start it again instead of returning error to top level
  • Watcher and updater ignore this pattern, as they must overwrite Agent::run_all

TODO:
[ ] add unit tests to mock faulty RPC
[x] add exponential backoff for retries
[x] metric to track channel number of channel faults

Closes #161

@luketchang luketchang changed the title Luke/run refactor refactor: isolate faulty channels and retry channel task on faults Feb 10, 2022
@kekonen
Copy link
Member

kekonen commented Feb 18, 2022

@prestwich @ltchang2019 I merged the Unit tests, so I believe it is ready to be merged

@luketchang
Copy link
Collaborator

@arnaud036 @yourbuddyconner lets run this PR in dev before merging

@arnaud036
Copy link
Contributor

arnaud036 commented Feb 23, 2022

This repo only build a container once the PR is merged to main. We could change this behavior but we should agree on our git flow strategy for CI/CD.

The flow I had in mind was to continuously deploy main to dev when a PR gets merged and deploy to prod once a tag is created. We could also consider deploying to staging eventually and use this environment as an integration env

@luketchang luketchang merged commit 50267ae into main Feb 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants