Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deployment Health Check and Automatic Restart #191

Merged
merged 2 commits into from
Oct 29, 2024

Conversation

movchan74
Copy link
Contributor

Summary:
This PR introduces an automatic health check and restart mechanism for deployments, addressing an issue where deployments returned errors but were not terminated, preventing automatic recovery by Ray.

Key Changes:

  1. Exception-Based Health Check: Added a decorator, exception_handler, which logs exceptions to monitor deployment health and helps trigger restarts if necessary.
  2. Automatic Restart Logic: Deployments will now restart if more than 50% of requests fail, ensuring service stability.
  3. Tests: Added tests to verify deployment restart functionality following consecutive failures.

@movchan74 movchan74 added enhancement New feature or request Reliability Aana is designed to be reliable and robust. It is built to be fault-tolerant and to handle failures labels Oct 29, 2024
@movchan74 movchan74 requested a review from HRashidi October 29, 2024 09:09
@movchan74 movchan74 self-assigned this Oct 29, 2024
Copy link
Contributor

@HRashidi HRashidi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments for now!

aana/deployments/idefics_2_deployment.py Outdated Show resolved Hide resolved
aana/deployments/base_deployment.py Outdated Show resolved Hide resolved
…Remove restart_exceptions setting from reconfigure.
@movchan74 movchan74 requested a review from HRashidi October 29, 2024 11:42
@movchan74 movchan74 merged commit 580090c into main Oct 29, 2024
6 checks passed
@movchan74 movchan74 deleted the deployment_health_check branch October 29, 2024 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Reliability Aana is designed to be reliable and robust. It is built to be fault-tolerant and to handle failures
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants