Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Letting each condition control its frequency #150

Merged
merged 2 commits into from
Oct 3, 2024

Conversation

ashishgo-aws
Copy link
Contributor

Using throttling, each condition was being checked once per minute. But a condition may need to perform its check at a different pace like for example the TaskMonitor condition. This change applies throttling at individual condition instead of applying it generically at the subprocess level. The TaskMonitor condition will not be executed once per 10 seconds instead of one minute. This will allow for faster evaluation of worker idleness which would help in scaling down the worker fleet a lot more effectively.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Using throttling, each condition was being checked once per minute. But a condition may need to perform its check at a different pace like for example the TaskMonitor condition. This change applies throttling at individual condition instead of applying it generically at the subprocess level. The TaskMonitor condition will not be executed once per 10 seconds instead of one minute. This will allow for faster evaluation of worker idleness which would help in scaling down the worker fleet a lot more effectively.
@ashishgo-aws ashishgo-aws self-assigned this Sep 27, 2024
ashishgo-aws added a commit to ashishgo-aws/amazon-mwaa-docker-images that referenced this pull request Oct 1, 2024
The task monitor needs to process the following 4 MWAA signals for the graceful update project:

1. Termination signal: Graceful termination of the workers when the environment is going through a graceful update
2. Resume signal: Reverting the state of graceful termination and resume work when the environment is going through a rollback after attempting a graceful update
3. Kill signal: Shutting down the worker without waiting for the current Airflow tasks to finish when the environment is going through a forced update
4. Activation signal: Starting consumption of work from the queue after termination protection has enabled on the corresponding Fargate task

The processing is gated behind certain environment variables which are either absent or marked as false for an environment which does not have graceful updates enabled.

The CR brings the changes that have been merged for 2.9.2 (aws#137 and aws#150) to the newer version.
Pause task consumption in the close method helps reduce the probability of worker picking up work during the shutdown procedure. Plus,   AIRFLOW__CELERY__WORKER_AUTOSCALE value to "80,80" was done when 2xlarge environment class was introduced. This change got left out when porting improved autoscaling changes to 2.9
Copy link
Contributor

@kuyperse kuyperse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes also need to be ported to 2.10.1.

@ashishgo-aws
Copy link
Contributor Author

These changes also need to be ported to 2.10.1.

I believe I have covered these in #151. Is there anything you saw that is missing in that PR?

@ashishgo-aws ashishgo-aws requested a review from dhegberg October 3, 2024 22:08
@kuyperse
Copy link
Contributor

kuyperse commented Oct 3, 2024

These changes also need to be ported to 2.10.1.

I believe I have covered these in #151. Is there anything you saw that is missing in that PR?

I didn't check the other PRs to see if the changes were ported there. Nothing seems to be missing. LGTM.

@ashishgo-aws ashishgo-aws merged commit bd9c236 into aws:main Oct 3, 2024
1 check passed
ashishgo-aws added a commit that referenced this pull request Oct 3, 2024
)

The task monitor needs to process the following 4 MWAA signals for the
graceful update project:

1. Termination signal: Graceful termination of the workers when the
environment is going through a graceful update
2. Resume signal: Reverting the state of graceful termination and resume
work when the environment is going through a rollback after attempting a
graceful update
3. Kill signal: Shutting down the worker without waiting for the current
Airflow tasks to finish when the environment is going through a forced
update
4. Activation signal: Starting consumption of work from the queue after
termination protection has enabled on the corresponding Fargate task

The processing is gated behind certain environment variables which are
either absent or marked as false for an environment which does not have
graceful updates enabled.

The CR brings the changes that have been merged for 2.9.2
(#137 and
#150) to the newer
version.

By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
@ashishgo-aws ashishgo-aws deleted the task-monitor-condition-speed branch October 3, 2024 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants