Skip to content

Commit

Permalink
Add Google Cloud documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
thalassemia committed Nov 29, 2024
1 parent e636624 commit a9a12dd
Show file tree
Hide file tree
Showing 2 changed files with 269 additions and 0 deletions.
268 changes: 268 additions & 0 deletions doc/gcloud.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,268 @@
============
Google Cloud
============

Large vEcoli workflows can be run cost-effectively on Google Cloud. This section
covers setup starting from a fresh project, running workflows, and handling outputs.

Members of the Covert Lab should skip to the `Create Your VM`_ section for setup.

-------------------
Fresh Project Setup
-------------------

Create a new project for vEcoli using `this link <https://console.cloud.google.com/projectcreate>`_.
Choose any name that you like and you should be brought to the Google Cloud
console dashboard for your new project. Use the top search bar to find
the following APIs and enable them:

- Compute Engine
- Cloud Build
- Artiface Registry

You will be asked to link a billing account at this time.

.. tip::
If you are new to Google Cloud, we recommend that you take some time to
familiarize yourself with the Cloud console after enabling the above APIs.

.. warning::
By default, vEcoli workflows request spot virtual machines (VMs), which are much cheaper
than standard VMs but do not have guaranteed availability and can be preempted.
The workflow is configured to automatically retry jobs that fail due to preemption.
However, the free credit ($300 as of Nov. 2024) offered to new users cannot be used to pay
for spot VMs. If you would like to try the model using free credits, you can
tell vEcoli to use standard VMs by deleting ``google.batch.spot = true`` from
``runscripts/nextflow/config.template``.

Set a default region and zone for Compute Engine following
`these instructions<https://cloud.google.com/compute/docs/regions-zones/changing-default-zone-region#console>`.
This avoids unnecessary charges for multi-region data availability and access,
improves latency, and is required for some of vEcoli's code to work.

Create a new repository in Artifact Registry following the steps
on `this page<https://cloud.google.com/artifact-registry/docs/repositories/create-repos>`_.
Make sure to name the repository ``vecoli`` and create it in the same
region as your Compute Engine default. This is where the Docker images
used to run the workflow will be stored (see `Build Docker Images`_).

Compute Engine VMs come with `service accounts <https://cloud.google.com/compute/docs/access/service-accounts>`
allow users to control access to project resources (compute, storage, etc.).
To run vEcoli workflows, only a small subset of the default
service account permissions are necessary. For that reason, we strongly
recommend that users either modify the default Compute Engine service
account permissions or create a dedicated vEcoli service account.

Using the `Google Cloud console <https://console.cloud.google.com>`_,
navigate to the "IAM & Admin" panel. You can edit the Compute Engine default
service account on this page by clicking the pencil icon in the corresponding row.
To create a new service account, click the "Service Accounts" tab in the side bar
and then "Create Service Account". Once you get to a section where you
can assign roles to the service account, assign the following set of roles:

- Artifact Registry Writer
- Batch Agent Reporter
- Batch Job Editor
- Cloud Build Editor
- Compute Instance Admin (v1)
- Logs Writer
- Monitoring Metric Writer
- Service Account User
- Storage Object Admin
- Viewer

If you created a dedicated service account, keep its associated email handy
as you will need it to create your VM in the next step.

--------------
Create Your VM
--------------

Click on the terminal shell icon near the top-right corner of the
`Cloud console <https://console.cloud.google.com>`. Run the command
``gcloud init`` and choose to reinitialize your configuration. Choose
the right account and project, allowing ``gcloud`` to pull in the
project's default Compute Engine zone and region.

.. tip::
We are using the Cloud Shell built into Cloud console for convenience.
If you would like to do this locally, you can install the ``gcloud``
CLI on your machine following `these steps <https://cloud.google.com/sdk/docs/install>`_.

Once done, run the following to create a Compute Engine VM to run your workflows,
replacing ``INSTANCE_NAME`` with a unique name of your choosing and ``SERVICE_ACCT``
as described below::

gcloud compute instances create INSTANCE_NAME \
--shielded-secure-boot \
--machine-type=e2-medium \
--scopes=cloud-platform \
--service-account=SERVICE_ACCT

If you created a new service account earlier in the setup process, substitute
the email address for that service account. If you are a member of the Covert Lab
or have been granted access to the Covert Lab project, substitute
``fireworker@allen-discovery-center-mcovert.iam.gserviceaccount.com``. Otherwise,
including if you edited the default service account permissions, run
the above command without the ``--service-acount`` flag.

.. warning::
Remember to stop your VM when you are done using it. You can either do this
through the Cloud console or by running ``gcloud compute instances stop INSTANCE_NAME``.
You can always restart the instance when you need it again and your files will
persist across sessions.

SSH into your newly created VM (if connection error, wait a moment, then retry)::

gcloud compute ssh INSTANCE_NAME

Now, on the VM, initialize ``gcloud`` by running ``gcloud init`` and selecting the
right service account and project. Next, install Git and clone the vEcoli repository::

sudo apt update && sudo apt install git
git clone https://github.com/CovertLab/vEcoli.git

Try running ``python3 -m venv vEcoli-env`` and read the error message to find
what version of ``venv`` you need to ``sudo apt install``. Once installed,
run ``python3 -m venv vEcoli-env`` to create a virtual environment. Activate
this virtual environment by running ``source vEcoli-env/bin/activate``.

.. tip::
Instead of doing this manually every time you start your VM, you can append
``source $HOME/vEcoli-env/bin/activate`` to your ``~/.bashrc``.

With the virtual environment activated, navigate into the cloned vEcoli
repository and install the required Python packages (check README.md and
requirements.txt for correct versions)::

cd vEcoli
pip install --upgrade pip setuptools==73.0.1 wheel
pip install numpy==1.26.4
pip install -r requirements.txt
make clean compile

Then, install Java (through SDKMAN) and Nextflow following
`these instructions<https://www.nextflow.io/docs/latest/install.html>`.

------------------
Create Your Bucket
------------------

vEcoli workflows persist their final outputs to a Cloud Storage
bucket. To create a bucket, follow the steps on
`this page<https://cloud.google.com/storage/docs/creating-buckets>`_. By default,
buckets are created in the US multi-region. We strongly recommend changing this to
the same single region as your Compute Engine default (``us-west1`` for Covert Lab).
All other settings can be kept as default.

.. danger::
Do NOT use underscores or special characters in your bucket name. Hyphens are OK.

Once you have created your bucket, tell vEcoli to use that bucket by setting the
``out_uri`` key under the ``emitter_arg`` key in your config JSON (see `json_config`_).
The URI should be in the form ``gs://{bucket name}``. Remember to remove the ``out_dir``
key under ``emitter_arg`` if present.

-------------------
Build Docker Images
-------------------

On Google Cloud, each job in a workflow (ParCa, sim 1, sim 2, etc.) is run
on its own temporary VM. To ensure reproducibility, workflows run on Google
Cloud must be run using Docker containers. vEcoli contains scripts in the
``runscripts/container`` folder to build the required Docker images from the
current state of your repository.

``build-runtime.sh`` builds a base Docker image containing the Python packages
necessary to run vEcoli as listed in ``requirements.txt``. After the build is
finished, the Docker image should be automatically uploaded to an Artifact Registry
repository called ``vecoli``.

``build-wcm.sh`` builds on the base image created by ``build-runtime.sh`` by copying
the files in the cloned vEcoli repository including any uncommitted changes. Note
that files matching any entry in ``.gitignore`` are not copied. The built image is
also uploaded to the ``vecoli`` Artifact Registry repository.

.. tip::
If you want to build these Docker images for local testing, you can run
these scripts locally as long as you have Docker installed.

These scripts are mostly not meant to be run manually. Instead, users should let
:py:mod:`runscripts.workflow` handle this automatically by setting the following
keys in your configuration JSON::

{
"gcloud": {
"runtime_image_name": "Name of image build-runtime.sh built/will build"
"build_runtime_image": Boolean, can put false if requirements.txt did not
change since the last time this was true,
"wcm_image_image": "Name of image build-wcm.sh built/will build"
"build_wcm_image": Boolean, can put false if nothing in repository changed
since the last time this was true
}
}

These configuration keys, in addition to the ``out_uri`` key under ``emitter_arg``,
are necessary and sufficient to tell :py:mod:`runscripts.workflow` that you intend to
run the workflow on Google Cloud. After setting these options in your configuration JSON,
you can use ```screen`` to open a virtual console that will persist even after your SSH
connection is closed. In that virtual console, invoke :py:mod:`runscripts.workflow`
as normal to start your workflow::
python runscripts/workflow.py --config {}

Once your workflow has started, you can use press "ctrl+a d" to detach from the
virtual console then close your SSH connection to your VM. The VM must continue
to run until the workflow is complete. You can SSH into the VM and reconnect to
the virtual terminal with ``screen -r`` to monitor progress or inspect the file
``.nextflow.log`` in the root of the cloned repository.

----------------
Handling Outputs
----------------

Once a workflow is complete, all of the outputs should be contained within the Cloud
Storage bucket at the URI in the ``out_uri`` key under ``emitter_arg`` in the
configuration JSON. We strongly discourage users from trying to download this data,
as that will incur significant egress charges. Instead, you should use your VM to run
analyses, avoiding these charges as long as your VM and bucket are in the same region.

Data stored in Cloud Storage is billed for the amount of data and how long it is stored
(prorated). Storing terabytes of simulation data on Cloud Storage can cost upwards of
$1,000/year, dwarfing the cost of the compute needed to generate that data. For that
reason, we recommend that you delete workflow output data from your bucket as soon as
you are done with your analyses. If necessary, it will likely be cheaper to re-run the
workflow to regenerate that data later than to keep it around.

---------------
Troubleshooting
---------------

Cloud Storage Permission Issue
------------------------------

If you are trying to launch a cloud workflow or access cloud
output (e.g. run an analysis script) from a local computer, you
may encounter an error like the following::

HttpError: Anonymous caller does not have storage.objects.list access to the Google Cloud Storage bucket. Permission 'storage.objects.list' denied on resource (or it may not exist)., 401

We do not recommend using local computers to launch
cloud workflows because that would require the computer to be on and connected
to the internet for the entire duration of the workflow. We STRONGLY discourage
using a local computer to run analyses on workflow output saved in
Cloud Storage as that will incur hefty data egress charges.

Instead, users should stick to launching workflows and running analysis scripts
on Compute Engine VMs. Small VMs are fairly cheap to keep running for the duration
of a workflow, and larger VMs can be created to leverage DuckDB's multithreading
for fast reading of workflow outputs stored in Cloud Storage. Assuming the VMs are
in the same region as the Cloud Storage bucket being accessed, no egress charges
will be applied, resulting in much lower costs.

If you absolutely must interact with cloud resources from a local machine, the above
error may be resolved by running the following command to generate credentials that
will be automatically picked up by PyArrow::

gcloud auth application-default login

1 change: 1 addition & 0 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,4 +43,5 @@ for developing and running the model. We recommend new users read through the se
output
tutorial
docs
gcloud
API Reference <reference/api_ref.rst>

0 comments on commit a9a12dd

Please sign in to comment.