Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ideas for running dagster in the cloud #49

Open
peterdudfield opened this issue Jan 9, 2024 · 8 comments
Open

Ideas for running dagster in the cloud #49

peterdudfield opened this issue Jan 9, 2024 · 8 comments

Comments

@peterdudfield
Copy link
Contributor

peterdudfield commented Jan 9, 2024

When We don't have enough internet bandwidth it might be worth thinking can we run things in the cloud.
Theres quite a few different options.

  1. Stick
  • wait it out, but prioritise things
  • no cloud costs
  1. Move ICON to use Planetary computers. This is free to run. Resource 3.2 TB RAM + 400 dask cluster. Can start up vms too.
  • good its free
  • bad its not in Dagster
  • might take time to get working
  • We can use this for any data we want to make public, as we can save straight to HF.
  1. Run in some jobs in Dagster cloud. https://dagster.io/
  • this could be expensive,
  • Might be annoying having 2 different Dagster runs.
  • Might be a good options to just run some smaller jobs on there. Smaller datasets like ECMWF can be stored in GCS.
  1. Move all to Dagster Cloud https://dagster.io/
  • Think this is too expensive,
  • and expensive to store the data somewhere
  • not very flexible
  1. Deploy our own instance Dagster on GCP.
  • might be complicated to do, but could take a while to get going.
  • can be scaled up and down.
  1. From local Dagster, trigger jobs on GCP.
  • https://docs.dagster.io/_apidocs/libraries/dagster-gcp
  • Not sure we can do it, but needs looking into. Ideal we could trigger off Cloud Run jobs and save data to GCS.
  • This could also be used for saving stuff to HF.
  • Might be expensive
  • Could use Dataproc to run jobs. Lots of different frameworks support
  1. Big another computer in different location
  • another computer to manage
  • need to communicate between machines, might need some set up. Large setup
@jacobbieker
Copy link
Member

For 2, we can use kbatch to launch jobs, they run in a 32GB RAM VM by default, and can include anything (nwp-consumer, HF uploading, etc.) They have a 24 hour timelimit on the machines, so if jobs take longer than that, it won't work.

@devsjc
Copy link
Collaborator

devsjc commented Jan 9, 2024

Also we could still track it from dagster using the ExternalAsset resource if we wanted to!

@peterdudfield
Copy link
Contributor Author

Also we could still track it from dagster using the ExternalAsset resource if we wanted to!

Does this mean we have the option to run it somewhere else, but then Dagster can just track if the file is there are the end?

@peterdudfield
Copy link
Contributor Author

I'm begining to like 2 and 6 more. Means we have one mission control, but can scale things to different platforms as we need to

@jacobbieker
Copy link
Member

For reference on Planetary Computer, this is basically the script I have that processes EUMETSAT data and saves to huggingface: https://github.com/jacobbieker/planetary-datasets/blob/main/planetary_datasets/conversion/zarr/eumetsat.py

@jacobbieker
Copy link
Member

The only things in the PC version is steps that install Satip in the VM

@peterdudfield
Copy link
Contributor Author

The only things in the PC version is steps that install Satip in the VM

what do you mean?

@jacobbieker
Copy link
Member

I have a very slightly modified version of that script that just has this at the top of the file, to make it simpler to run the script there and include installing the missing dependencies in the VM. The VMs come with geospatial stuff already installed.

"""This gather and uses the global mosaic of geostationary satellites from NOAA on AWS"""
import subprocess

def install(package):
    subprocess.check_call(["/srv/conda/envs/notebook/bin/python", "-m", "pip", "install", package])

install("datasets")
install("satip")

"""Convert EUMETSAT raw imagery files to Zarr"""
try:
    import satip
except ImportError:
    print("Please install Satip to continue")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants