Simplify `build-and-run-batch-run` design #16

dfsnow · 2023-12-27T19:33:36Z

The current Action for submitting Batch jobs to AWS has a few issues:

It checks the state of the job by periodically polling AWS using its API. This approach works. However, GitHub Actions jobs are limited to 6 hours of run time, meaning our Batch jobs can only be polled for 6 hours (though they continue to run after the Actions job fails). See Consider refactoring build-and-run-batch-job to use GitHub webooks to push job state #13.
Running multiple EC2 Batch jobs on the same branch fails #15
Jobs created for a branch without an associated PR (using workflow_dispatch) never run the cleanup job, since a PR doesn't get closed to trigger it.

After a bunch of research, I propose the following changes:

Switch to Deployments

Rather than polling the Batch job with a running Actions job. We should instead take advantage of GitHub Deployments. This would keep the current strategy for submitting jobs, but would offload status reporting to the Batch job.

This would involve adding an entrypoint script to all jobs which would auth with GitHub (via a GitHub application JWT), then update the deploy status as the Batch job runs.

This completely obviates the need for the polling logic and solves the 6 hour problem. It also allows us to...

Use different deploy environments to manage job types

Instead of controlling job resources by editing the workflow YAML, we can setup different deploy environments, each with an associated job size. This has many advantages:

It makes it much easier to change the job resources on-the-fly
It means we can add a dropdown on the workflow dispatch that lets you pick the deploy env (and therefore resources)
We can gate the largest, most costly jobs behind deploy protections

I envision this as basically a dropdown with the following environments:

Small (Fargate, 4 vCPU, 8 GB RAM)
Medium (Fargate, 16 vCPU, 32 GB RAM)
Large (EC2, 32 vCPU, 64 GB RAM)
LargeGPU (EC2, same as large + GPU)

Switch to static job queue and compute envs

We currently instantiate the job queue and computer environment for each PR. However, this seems unnecessarily complicated and error-prone. Instead, I propose we create 4 permanent job queues/compute environments, one for each of the environments above.

This way, the only thing we need to Terraform for each workflow is the job definition, as based on the built container and deploy env. We can further simplify the cleanup step to run after receiving a deployment status update from AWS. It would then only need to delete the job definition, since the other resources are static.

The text was updated successfully, but these errors were encountered:

dfsnow · 2023-12-27T19:34:37Z

@jeancochrane Can you take a look at this once you're back and drop a quick time estimate for these changes in this issue? Would also like your thoughts on whether these changes seem reasonable.

jeancochrane · 2024-05-01T15:59:52Z

As part of this issue, I think we should also rethink the comps compute environment -- perhaps we should separate the comps step from the pipeline and provision a separate set of resources for it, since it takes a long time to execute (and so is expensive) and could potentially have different resource requirements from the main modeling pipeline.

This was referenced Jan 10, 2024

Small improvements to build-and-run-batch-job UX #17

Merged

Remove branch delete event handling from build-and-run-batch-job #18

Merged

Use all vCPU and memory available to the instance in build-and-run-model ccao-data/model-res-avm#158

Merged

dfsnow added this to the GitHub Actions cleanup milestone Apr 16, 2024

dfsnow removed this from the GitHub Actions cleanup milestone Apr 26, 2024

jeancochrane mentioned this issue May 1, 2024

Spike upgrading comps algorithm with taichi ccao-data/model-res-avm#229

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify `build-and-run-batch-run` design #16

Simplify `build-and-run-batch-run` design #16

dfsnow commented Dec 27, 2023

dfsnow commented Dec 27, 2023 •

edited

Loading

jeancochrane commented May 1, 2024

Simplify build-and-run-batch-run design #16

Simplify build-and-run-batch-run design #16

Comments

dfsnow commented Dec 27, 2023

Switch to Deployments

Use different deploy environments to manage job types

Switch to static job queue and compute envs

dfsnow commented Dec 27, 2023 • edited Loading

jeancochrane commented May 1, 2024

Simplify `build-and-run-batch-run` design #16

Simplify `build-and-run-batch-run` design #16

dfsnow commented Dec 27, 2023 •

edited

Loading