Work in progress, provided as a minimal example.
The STILT model is a popular extension of the HYSPLIT Lagrangian particle dispersion model that uses atmospheric simulations to estimate the impact of surface emissions on a downstream receptor location. STILT takes meteorological data and a receptor location and outputs a footprint representing the receptor's sensitivity to upstream emissions. Refer to the STILT documentation for details about how the model is configured.
STILT's built in batch processing workflows work well for traditional use cases on HPC systems, where nodes are assumed to be reliable and sets of simulations generally succeed or fail as a single unit. However, operationalizing large quantities of simulations requires distributed, fault tolerant systems for reliable throughput. This is a minimal implementation to run STILT workloads at scale on Kubernetes, leveraging cost conscious compute and storage systems.
- Scene: a group of associated simulations, represents all simulations related to a single gridded satellite retrieval.
- Pixel: represents a location within a scene. Each contains attributes for retrieved column methane concentration, viewing geometry, and forecast meteorology which provides wind and boundary layer characteristics.
- Simulation config: the configuration parameters used to tweak STILT's transport and dispersion methods. Configuration parameters often consist of shared attributes for a given scene.
- Receptor: the location and time (
x, y, z, t
) of interest, which represents the starting point of a time-backwards STILT simulation. - Trajectories: the time resolved evolution of a particle ensemble traveling backward in time from the receptor location. STILT natively outputs particle trajectories as a data frame using the built in serialization format in R (
.rds
), which we selectively transform into parquet files for archival. - Footprint: a grid of time resolved (
x, y, t
) or time integrated (x, y
) sensitivities to fluxes, given as mole fraction per unit flux (ppm / (umol m-2 s-1)
). STILT natively outputs footprints as NetCDF (.nc
) files following the CF convention. - Simulation: the STILT process which is given a receptor and relevant meteorology and returns trajectories and footprint.
- Meteorology: meteorological model data in the ARL format, typically pulled from NOAA ARL or user generated with the WRF model. The HRRR model (3 km) is often the best data available for the continental US while the GFS (0.25 degree ~ 25 km at midlatitudes) can be used globally.
- Meteorology Aggregate: a single-file collection of meteorology data, cropped to the space-time domain required by a simulation (
x, y, z, t
) to minimize resource consumption. Aggregates are commonly shared among simulations of a scene.
STILT is only compatible with UNIX systems but may run under WSL. Dependencies include:
postgres
docker
poetry
Poetry defaults to maintaining virtual environments externally. To configure poetry to create virtual environments in the project directory, first run
poetry config virtualenvs.in-project true
Install stiltctl
with development dependencies into a virtual environment with:
make install
To execute the tests, we need a postgres
database used for job queues. Run the local development dependencies defined in docker-compose.yaml
with:
make dependencies
You can run the tests using any of the following options:
make ci
: check formatting, perform static analysis, and run tests.make test
: run onlypytest
to check functionality and print test coverage and profiling reports.- Use VSCode's built-in test runner for interactive development with the debugger.
This example deploys to GCP but most of the implementation is provider agnostic. Terraform deploys a single GKE cluster with dev
and prod
namespaces to isolate environments.
The Terraform configuration also deploys Kubernetes resources to reference the deployed infrastructure, including:
- A cluster-wide install of KEDA.
- A
db-secret
KubernetesSecret
in thedev
andprod
namespaces withPOSTGRES_USER
,POSTGRES_PASSWORD
,POSTGRES_HOST
, andPOSTGRES_CONNECTION
keys. - A
environment
KubernetesConfigMap
in thedev
andprod
namespaces withENVIRONMENT
,PROJECT
,ARTIFACT_DRIVER
, andARTIFACT_BUCKET
keys.
Since database credentials are provisioned by Terraform during deployment, subsequent deployments will cycle passwords for the dev
and prod
Postgres instances.
Our test project runs on a 500 vCPU GKE cluster, but this requires quota increases from GCP. The example Terraform configurations limit the cluster to 50 vCPUs
You'll need
gcloud
to point to an active configuration within an existing GCP project. Seegcloud init
andgcloud config configurations list
.
Make sure that gcloud
is configured to use application default credentials:
gcloud auth application-default login
Take a look in terraform/main.tf to be sure you know what you're deploying. Set these environment variables and run the deploy script:
export PROJECT=<project_name>
export REGION=us-central1
make infra
Set these environment variables and run the deploy script:
export PROJECT=<project_name>
export ENVIRONMENT=dev
make deploy
This deploys several services and a suspended CronJob
specified in helm/templates/scene-generator.yaml that you can use as an example entrypoint. If you deployed to the dev
environment as shown above, trigger the cronjob with:
kubectl -n dev create job scene-generator --from=cronjob/scene-generator
- Add use case to pass upstream aggregated retrieval DTO to create a scene and associated STILT configurations (currently
DomainConfig.get_simulation_manifests
). - Using a scene's max PBLH, estimate the number of GFS vertical levels (sigma-pressure hybrid coordinates) required to run simulations (see
stiltctl.services.minimize_meteorology
). - Pass vertical weighting function into STILT to scale particle influence by retrieval sensitivity. This requires a PR to upstream uataq/stilt adding a
before_footprint_path
option tostilt_cli.r
, which contains a function namedbefore_footprint
which can rescale particle sensitivity (output$particle$foot
) based on retrieval characteristics (water vapor, pressure, averaging kernel, etc.). More generally, STILT's HPC interface providesbefore_footprint
andbefore_trajec
arguments which can inject arbitrary user code into strategic points in the simulation workflow (see the STILT docs and the implementation). - After all simulations of a scene are complete, materialize footprint jacobian for inverse model.
- Harden for various failure modes.
- KEDA should be namespace scoped to test version updates.
- Default service accounts used here are overly permissive, should be revisited for production workloads.
- Postgres instances currently allow public access (with password auth) for development. Probably better off connecting via the cloud SQL auth proxy mounted as a sidecar container in
stiltctl
service pods. - The Kubernetes cluster uses node auto-provisioning, which should be replaced with user defined node pools.
- This example uses surface-based simulations and will need to be modified for column simulations. See
tests.test_simulations.test_column_simulation
for an example of how to construct a column receptor.
- This implementation makes the use of sqlalchemy's imperative mapping (see
stiltctl.database
andstiltctl.repositories
) which seems out of place in this stripped down example. While the code where this is useful is stripped out here (not relevant to MethaneSAT), the imperative mapper can instrument domain models with persistence logic such that they can be used natively with sqlalchemy'ssession
(after callingstiltctl.database.configure_mappers
). It avoids a separate domain -> persistence model mapping service for each database backed repository, and keeps domain models agnostic to their persistence mechanism. - We use postgres as a job queue instead of pulling in a message broker (previous implementations have used RabbitMQ, Redis, and GCP Pub/Sub). Since our workload is compute constrained, the required messaging throughput is relatively small (on the order of 10 read/write per second) and can be handled by a minimally specced Postgres instance. It's one less connection to manage, provides clean transactional semantics around enqueue/dequeue operations that can be committed alongside other state changes, has a better failure/backup story, and can guarantee exactly once processing of events. The main tradeoffs are your write throughput is constrained by how much you can vertically scale your Postgres instance (not an issue here), and modeling fan-in DAG dependencies pushes orchestration logic into application services. If modeling DAG dependencies becomes a challenge, you might be better off with the Airflow/Argo/Dagster/Flyte/Prefects of the world.
- Postgres credentials are provisioned by Terraform and passwords are rotated on each deploy. If you want to tinker around in the database, shell into one of the running pods and extract the postgres credentials from the
POSTGRES_CONNECTION
environment variable.
- Postgres credentials are provisioned by Terraform and passwords are rotated on each deploy. If you want to tinker around in the database, shell into one of the running pods and extract the postgres credentials from the