From a9a12ddfc1e0b401217a7fbcc92711ca2e1cef13 Mon Sep 17 00:00:00 2001 From: thalassemia Date: Fri, 29 Nov 2024 01:46:31 -0800 Subject: [PATCH] Add Google Cloud documentation --- doc/gcloud.rst | 268 +++++++++++++++++++++++++++++++++++++++++++++++++ doc/index.rst | 1 + 2 files changed, 269 insertions(+) create mode 100644 doc/gcloud.rst diff --git a/doc/gcloud.rst b/doc/gcloud.rst new file mode 100644 index 000000000..ab76205e6 --- /dev/null +++ b/doc/gcloud.rst @@ -0,0 +1,268 @@ +============ +Google Cloud +============ + +Large vEcoli workflows can be run cost-effectively on Google Cloud. This section +covers setup starting from a fresh project, running workflows, and handling outputs. + +Members of the Covert Lab should skip to the `Create Your VM`_ section for setup. + +------------------- +Fresh Project Setup +------------------- + +Create a new project for vEcoli using `this link `_. +Choose any name that you like and you should be brought to the Google Cloud +console dashboard for your new project. Use the top search bar to find +the following APIs and enable them: + +- Compute Engine +- Cloud Build +- Artiface Registry + +You will be asked to link a billing account at this time. + +.. tip:: + If you are new to Google Cloud, we recommend that you take some time to + familiarize yourself with the Cloud console after enabling the above APIs. + +.. warning:: + By default, vEcoli workflows request spot virtual machines (VMs), which are much cheaper + than standard VMs but do not have guaranteed availability and can be preempted. + The workflow is configured to automatically retry jobs that fail due to preemption. + However, the free credit ($300 as of Nov. 2024) offered to new users cannot be used to pay + for spot VMs. If you would like to try the model using free credits, you can + tell vEcoli to use standard VMs by deleting ``google.batch.spot = true`` from + ``runscripts/nextflow/config.template``. + +Set a default region and zone for Compute Engine following +`these instructions`. +This avoids unnecessary charges for multi-region data availability and access, +improves latency, and is required for some of vEcoli's code to work. + +Create a new repository in Artifact Registry following the steps +on `this page`_. +Make sure to name the repository ``vecoli`` and create it in the same +region as your Compute Engine default. This is where the Docker images +used to run the workflow will be stored (see `Build Docker Images`_). + +Compute Engine VMs come with `service accounts ` +allow users to control access to project resources (compute, storage, etc.). +To run vEcoli workflows, only a small subset of the default +service account permissions are necessary. For that reason, we strongly +recommend that users either modify the default Compute Engine service +account permissions or create a dedicated vEcoli service account. + +Using the `Google Cloud console `_, +navigate to the "IAM & Admin" panel. You can edit the Compute Engine default +service account on this page by clicking the pencil icon in the corresponding row. +To create a new service account, click the "Service Accounts" tab in the side bar +and then "Create Service Account". Once you get to a section where you +can assign roles to the service account, assign the following set of roles: + + - Artifact Registry Writer + - Batch Agent Reporter + - Batch Job Editor + - Cloud Build Editor + - Compute Instance Admin (v1) + - Logs Writer + - Monitoring Metric Writer + - Service Account User + - Storage Object Admin + - Viewer + +If you created a dedicated service account, keep its associated email handy +as you will need it to create your VM in the next step. + +-------------- +Create Your VM +-------------- + +Click on the terminal shell icon near the top-right corner of the +`Cloud console `. Run the command +``gcloud init`` and choose to reinitialize your configuration. Choose +the right account and project, allowing ``gcloud`` to pull in the +project's default Compute Engine zone and region. + +.. tip:: + We are using the Cloud Shell built into Cloud console for convenience. + If you would like to do this locally, you can install the ``gcloud`` + CLI on your machine following `these steps `_. + +Once done, run the following to create a Compute Engine VM to run your workflows, +replacing ``INSTANCE_NAME`` with a unique name of your choosing and ``SERVICE_ACCT`` +as described below:: + + gcloud compute instances create INSTANCE_NAME \ + --shielded-secure-boot \ + --machine-type=e2-medium \ + --scopes=cloud-platform \ + --service-account=SERVICE_ACCT + +If you created a new service account earlier in the setup process, substitute +the email address for that service account. If you are a member of the Covert Lab +or have been granted access to the Covert Lab project, substitute +``fireworker@allen-discovery-center-mcovert.iam.gserviceaccount.com``. Otherwise, +including if you edited the default service account permissions, run +the above command without the ``--service-acount`` flag. + +.. warning:: + Remember to stop your VM when you are done using it. You can either do this + through the Cloud console or by running ``gcloud compute instances stop INSTANCE_NAME``. + You can always restart the instance when you need it again and your files will + persist across sessions. + +SSH into your newly created VM (if connection error, wait a moment, then retry):: + + gcloud compute ssh INSTANCE_NAME + +Now, on the VM, initialize ``gcloud`` by running ``gcloud init`` and selecting the +right service account and project. Next, install Git and clone the vEcoli repository:: + + sudo apt update && sudo apt install git + git clone https://github.com/CovertLab/vEcoli.git + +Try running ``python3 -m venv vEcoli-env`` and read the error message to find +what version of ``venv`` you need to ``sudo apt install``. Once installed, +run ``python3 -m venv vEcoli-env`` to create a virtual environment. Activate +this virtual environment by running ``source vEcoli-env/bin/activate``. + +.. tip:: + Instead of doing this manually every time you start your VM, you can append + ``source $HOME/vEcoli-env/bin/activate`` to your ``~/.bashrc``. + +With the virtual environment activated, navigate into the cloned vEcoli +repository and install the required Python packages (check README.md and +requirements.txt for correct versions):: + + cd vEcoli + pip install --upgrade pip setuptools==73.0.1 wheel + pip install numpy==1.26.4 + pip install -r requirements.txt + make clean compile + +Then, install Java (through SDKMAN) and Nextflow following +`these instructions`. + +------------------ +Create Your Bucket +------------------ + +vEcoli workflows persist their final outputs to a Cloud Storage +bucket. To create a bucket, follow the steps on +`this page`_. By default, +buckets are created in the US multi-region. We strongly recommend changing this to +the same single region as your Compute Engine default (``us-west1`` for Covert Lab). +All other settings can be kept as default. + +.. danger:: + Do NOT use underscores or special characters in your bucket name. Hyphens are OK. + +Once you have created your bucket, tell vEcoli to use that bucket by setting the +``out_uri`` key under the ``emitter_arg`` key in your config JSON (see `json_config`_). +The URI should be in the form ``gs://{bucket name}``. Remember to remove the ``out_dir`` +key under ``emitter_arg`` if present. + +------------------- +Build Docker Images +------------------- + +On Google Cloud, each job in a workflow (ParCa, sim 1, sim 2, etc.) is run +on its own temporary VM. To ensure reproducibility, workflows run on Google +Cloud must be run using Docker containers. vEcoli contains scripts in the +``runscripts/container`` folder to build the required Docker images from the +current state of your repository. + +``build-runtime.sh`` builds a base Docker image containing the Python packages +necessary to run vEcoli as listed in ``requirements.txt``. After the build is +finished, the Docker image should be automatically uploaded to an Artifact Registry +repository called ``vecoli``. + +``build-wcm.sh`` builds on the base image created by ``build-runtime.sh`` by copying +the files in the cloned vEcoli repository including any uncommitted changes. Note +that files matching any entry in ``.gitignore`` are not copied. The built image is +also uploaded to the ``vecoli`` Artifact Registry repository. + +.. tip:: + If you want to build these Docker images for local testing, you can run + these scripts locally as long as you have Docker installed. + +These scripts are mostly not meant to be run manually. Instead, users should let +:py:mod:`runscripts.workflow` handle this automatically by setting the following +keys in your configuration JSON:: + + { + "gcloud": { + "runtime_image_name": "Name of image build-runtime.sh built/will build" + "build_runtime_image": Boolean, can put false if requirements.txt did not + change since the last time this was true, + "wcm_image_image": "Name of image build-wcm.sh built/will build" + "build_wcm_image": Boolean, can put false if nothing in repository changed + since the last time this was true + } + } + +These configuration keys, in addition to the ``out_uri`` key under ``emitter_arg``, +are necessary and sufficient to tell :py:mod:`runscripts.workflow` that you intend to +run the workflow on Google Cloud. After setting these options in your configuration JSON, +you can use ```screen`` to open a virtual console that will persist even after your SSH +connection is closed. In that virtual console, invoke :py:mod:`runscripts.workflow` +as normal to start your workflow:: + + python runscripts/workflow.py --config {} + +Once your workflow has started, you can use press "ctrl+a d" to detach from the +virtual console then close your SSH connection to your VM. The VM must continue +to run until the workflow is complete. You can SSH into the VM and reconnect to +the virtual terminal with ``screen -r`` to monitor progress or inspect the file +``.nextflow.log`` in the root of the cloned repository. + +---------------- +Handling Outputs +---------------- + +Once a workflow is complete, all of the outputs should be contained within the Cloud +Storage bucket at the URI in the ``out_uri`` key under ``emitter_arg`` in the +configuration JSON. We strongly discourage users from trying to download this data, +as that will incur significant egress charges. Instead, you should use your VM to run +analyses, avoiding these charges as long as your VM and bucket are in the same region. + +Data stored in Cloud Storage is billed for the amount of data and how long it is stored +(prorated). Storing terabytes of simulation data on Cloud Storage can cost upwards of +$1,000/year, dwarfing the cost of the compute needed to generate that data. For that +reason, we recommend that you delete workflow output data from your bucket as soon as +you are done with your analyses. If necessary, it will likely be cheaper to re-run the +workflow to regenerate that data later than to keep it around. + +--------------- +Troubleshooting +--------------- + +Cloud Storage Permission Issue +------------------------------ + +If you are trying to launch a cloud workflow or access cloud +output (e.g. run an analysis script) from a local computer, you +may encounter an error like the following:: + + HttpError: Anonymous caller does not have storage.objects.list access to the Google Cloud Storage bucket. Permission 'storage.objects.list' denied on resource (or it may not exist)., 401 + +We do not recommend using local computers to launch +cloud workflows because that would require the computer to be on and connected +to the internet for the entire duration of the workflow. We STRONGLY discourage +using a local computer to run analyses on workflow output saved in +Cloud Storage as that will incur hefty data egress charges. + +Instead, users should stick to launching workflows and running analysis scripts +on Compute Engine VMs. Small VMs are fairly cheap to keep running for the duration +of a workflow, and larger VMs can be created to leverage DuckDB's multithreading +for fast reading of workflow outputs stored in Cloud Storage. Assuming the VMs are +in the same region as the Cloud Storage bucket being accessed, no egress charges +will be applied, resulting in much lower costs. + +If you absolutely must interact with cloud resources from a local machine, the above +error may be resolved by running the following command to generate credentials that +will be automatically picked up by PyArrow:: + + gcloud auth application-default login + diff --git a/doc/index.rst b/doc/index.rst index fa6036616..f89750e2e 100644 --- a/doc/index.rst +++ b/doc/index.rst @@ -43,4 +43,5 @@ for developing and running the model. We recommend new users read through the se output tutorial docs + gcloud API Reference