Skip to content

Latest commit

 

History

History
512 lines (410 loc) · 17.2 KB

README.md

File metadata and controls

512 lines (410 loc) · 17.2 KB

Vertex Pipelines Deployer

Artefact Skaff Logo

Deploy Vertex Pipelines within minutes

This tool is a wrapper around kfp and google-cloud-aiplatform that allows you to check, compile, upload, run, and schedule Vertex Pipelines in a standardized manner.


PyPI - Python Version PyPI - Status PyPI - Downloads PyPI - License

CI Release

Pre-commit Linting: ruff Imports: isort

📚 Table of Contents
  1. Why this tool?
  2. Prerequisites
  3. Installation
    1. From git repo
    2. From Artifact Registry (not available in PyPI yet)
    3. Add to requirements
  4. Usage
    1. Setup
    2. Folder Structure
    3. CLI: Deploying a Pipeline with `deploy`
    4. CLI: Checking Pipelines are valid with `check`
    5. CLI: Other commands
      1. `config`
      2. `create`
      3. `init`
      4. `list`
  5. CLI: Options
  6. Configuration

Full CLI documentation

❓ Why this tool?

Three use cases:

  1. CI: Check pipeline validity.
  2. Dev mode: Quickly iterate over your pipelines by compiling and running them in multiple environments (test, dev, staging, etc.) without duplicating code or searching for the right kfp/aiplatform snippet.
  3. CD: Deploy your pipelines to Vertex Pipelines in a standardized manner in your CD with Cloud Build or GitHub Actions.

Two main commands:

  • check: Check your pipelines (imports, compile, check configs validity against pipeline definition).
  • deploy: Compile, upload to Artifact Registry, run, and schedule your pipelines.

📋 Prerequisites

  • Unix-like environment (Linux, macOS, WSL, etc.)
  • Python 3.8 to 3.10
  • Google Cloud SDK
  • A GCP project with Vertex Pipelines enabled

📦 Installation

From PyPI

pip install vertex-deployer

From git repo

Stable version:

pip install git+https://github.com/artefactory/vertex-pipelines-deployer.git@main

Develop version:

pip install git+https://github.com/artefactory/vertex-pipelines-deployer.git@develop

If you want to test this package on examples from this repo:

git clone [email protected]:artefactory/vertex-pipelines-deployer.git
poetry install
poetry shell  # if you want to activate the virtual environment
cd example

🚀 Usage

🛠️ Setup

  1. Setup your GCP environment:
export PROJECT_ID=<gcp_project_id>
gcloud config set project $PROJECT_ID
gcloud auth login
gcloud auth application-default login
  1. You need the following APIs to be enabled:
  • Cloud Build API
  • Artifact Registry API
  • Cloud Storage API
  • Vertex AI API
gcloud services enable \
    cloudbuild.googleapis.com \
    artifactregistry.googleapis.com \
    storage.googleapis.com \
    aiplatform.googleapis.com
  1. Create an artifact registry repository for your base images (Docker format):
export GAR_DOCKER_REPO_ID=<your_gar_repo_id_for_images>
export GAR_LOCATION=<your_gar_location>
gcloud artifacts repositories create ${GAR_DOCKER_REPO_ID} \
    --location=${GAR_LOCATION} \
    --repository-format=docker
  1. Build and upload your base images to the repository. To do so, please follow Google Cloud Build documentation.

  2. Create an artifact registry repository for your pipelines (KFP format):

export GAR_PIPELINES_REPO_ID=<your_gar_repo_id_for_pipelines>
gcloud artifacts repositories create ${GAR_PIPELINES_REPO_ID} \
    --location=${GAR_LOCATION} \
    --repository-format=kfp
  1. Create a GCS bucket for Vertex Pipelines staging:
export GCP_REGION=<your_gcp_region>
export VERTEX_STAGING_BUCKET_NAME=<your_bucket_name>
gcloud storage buckets create gs://${VERTEX_STAGING_BUCKET_NAME} --location=${GCP_REGION}
  1. Create a service account for Vertex Pipelines:
export VERTEX_SERVICE_ACCOUNT_NAME=foobar
export VERTEX_SERVICE_ACCOUNT="${VERTEX_SERVICE_ACCOUNT_NAME}@${PROJECT_ID}.iam.gserviceaccount.com"

gcloud iam service-accounts create ${VERTEX_SERVICE_ACCOUNT_NAME}

gcloud projects add-iam-policy-binding ${PROJECT_ID} \
    --member="serviceAccount:${VERTEX_SERVICE_ACCOUNT}" \
    --role="roles/aiplatform.user"

gcloud storage buckets add-iam-policy-binding gs://${VERTEX_STAGING_BUCKET_NAME} \
    --member="serviceAccount:${VERTEX_SERVICE_ACCOUNT}" \
    --role="roles/storage.objectUser"

gcloud artifacts repositories add-iam-policy-binding ${GAR_PIPELINES_REPO_ID} \
   --location=${GAR_LOCATION} \
   --member="serviceAccount:${VERTEX_SERVICE_ACCOUNT}" \
   --role="roles/artifactregistry.admin"

You can use the deployer CLI (see example below) or import VertexPipelineDeployer in your code (try it yourself).

📁 Folder Structure

You must respect the following folder structure. If you already follow the Vertex Pipelines Starter Kit folder structure, it should be pretty smooth to use this tool:

vertex
├─ configs/
│  └─ {pipeline_name}
│     └─ {config_name}.json
└─ pipelines/
   └─ {pipeline_name}.py

!!! tip "About folder structure" You must have at least these files. If you need to share some config elements between pipelines, you can have a shared folder in configs and import them in your pipeline configs.

If you're following a different folder structure, you can change the default paths in the `pyproject.toml` file.
See [Configuration](#configuration) section for more information.

Pipelines

Your file {pipeline_name}.py must contain a function called {pipeline_name} decorated using kfp.dsl.pipeline. In previous versions, the functions / object used to be called pipeline but it was changed to {pipeline_name} to avoid confusion with the kfp.dsl.pipeline decorator.

# vertex/pipelines/dummy_pipeline.py
import kfp.dsl

# New name to avoid confusion with the kfp.dsl.pipeline decorator
@kfp.dsl.pipeline()
def dummy_pipeline():
    ...

# Old name
@kfp.dsl.pipeline()
def pipeline():
    ...

Configs

Config file can be either .py, .json, .toml or yaml format. They must be located in the config/{pipeline_name} folder.

Why multiple formats?

.py files are useful to define complex configs (e.g. a list of dicts) while .json / .toml / yaml files are useful to define simple configs (e.g. a string). It also adds flexibility to the user and allows you to use the deployer with almost no migration cost.

How to format them?

  • .py files must be valid python files with two important elements:

    • parameter_values to pass arguments to your pipeline
    • input_artifacts if you want to retrieve and create input artifacts to your pipeline. See Vertex Documentation for more information.
  • .json files must be valid json files containing only one dict of key: value representing parameter values.

  • .toml files must be the same. Please note that TOML sections will be flattened, except for inline tables. Section names will be joined using "_" separator and this is not configurable at the moment. Example:

    === "TOML file" toml [modeling] model_name = "my-model" params = { lambda = 0.1 }

    === "Resulting parameter values" python { "modeling_model_name": "my-model", "modeling_params": { "lambda": 0.1 } }

  • .yaml files must be valid yaml files containing only one dict of key: value representing parameter values.

??? question "Why are sections flattened when using TOML config files?" Vertex Pipelines parameter validation and parameter logging to Vertex Experiments are based on the parameter name. If you do not flatten your sections, you'll only be able to validate section names and that they should be of type dict.

Not very useful.

??? question "Why aren't input_artifacts supported in TOML / JSON config files?" Because it's low on the priority list. Feel free to open a PR if you want to add it.

How to name them?

{config_name}.py or {config_name}.json or {config_name}.toml. config_name is free but must be unique for a given pipeline.

Settings

You will also need the following ENV variables, either exported or in a .env file (see example in example.env):

PROJECT_ID=YOUR_PROJECT_ID  # GCP Project ID
GCP_REGION=europe-west1  # GCP Region

GAR_LOCATION=europe-west1  # Google Artifact Registry Location
GAR_PIPELINES_REPO_ID=YOUR_GAR_KFP_REPO_ID  # Google Artifact Registry Repo ID (KFP format)

VERTEX_STAGING_BUCKET_NAME=YOUR_VERTEX_STAGING_BUCKET_NAME  # GCS Bucket for Vertex Pipelines staging
VERTEX_SERVICE_ACCOUNT=YOUR_VERTEX_SERVICE_ACCOUNT  # Vertex Pipelines Service Account

!!! note "About env files" We're using env files and dotenv to load the environment variables. No default value for --env-file argument is provided to ensure that you don't accidentally deploy to the wrong project. An example.env file is provided in this repo. This also allows you to work with multiple environments thanks to env files (test.env, dev.env, prod.env, etc)

🚀 CLI: Deploying a Pipeline with deploy

Let's say you defined a pipeline in dummy_pipeline.py and a config file named config_test.json. You can deploy your pipeline using the following command:

vertex-deployer deploy dummy_pipeline \
    --compile \
    --upload \
    --run \
    --env-file example.env \
    --tags my-tag \
    --config-filepath vertex/configs/dummy_pipeline/config_test.json \
    --experiment-name my-experiment \
    --enable-caching \
    --skip-validation

✅ CLI: Checking Pipelines are valid with check

To check that your pipelines are valid, you can use the check command. It uses a pydantic model to:

  • check that your pipeline imports and definition are valid
  • check that your pipeline can be compiled
  • check that all configs related to the pipeline are respecting the pipeline definition (using a Pydantic model based on pipeline signature)

To validate one or multiple pipeline(s):

vertex-deployer check dummy_pipeline <other pipeline name>

To validate all pipelines in the vertex/pipelines folder:

vertex-deployer check --all

🛠️ CLI: Other commands

config

You can check your vertex-deployer configuration options using the config command. Fields set in pyproject.toml will overwrite default values and will be displayed differently:

vertex-deployer config --all

create

You can create all files needed for a pipeline using the create command:

vertex-deployer create my_new_pipeline --config-type py

This will create a my_new_pipeline.py file in the vertex/pipelines folder and a vertex/config/my_new_pipeline/ folder with multiple config files in it.

init

To initialize the deployer with default settings and folder structure, use the init command:

vertex-deployer init
$ vertex-deployer init
Welcome to Vertex Deployer!
This command will help you getting fired up.
Do you want to configure the deployer? [y/n]: n
Do you want to build default folder structure [y/n]: n
Do you want to create a pipeline? [y/n]: n
All done

list

You can list all pipelines in the vertex/pipelines folder using the list command:

vertex-deployer list --with-configs

🍭 CLI: Options

vertex-deployer --help

To see package version:

vertex-deployer --version

To adapt log level, use the --log-level option. Default is INFO.

vertex-deployer --log-level DEBUG deploy ...

Configuration

You can configure the deployer using the pyproject.toml file to better fit your needs. This will overwrite default values. It can be useful if you always use the same options, e.g. always the same --scheduler-timezone

[tool.vertex_deployer]
vertex_folder_path = "my/path/to/vertex"
log_level = "INFO"

[tool.vertex_deployer.deploy]
scheduler_timezone = "Europe/Paris"

You can display all the configurable parameterss with default values by running:

$ vertex-deployer config --all
'*' means the value was set in config file

* vertex_folder_path=my/path/to/vertex
* log_level=INFO
deploy
  env_file=None
  compile=True
  upload=False
  run=False
  schedule=False
  cron=None
  delete_last_schedule=False
  * scheduler_timezone=Europe/Paris
  tags=['latest']
  config_filepath=None
  config_name=None
  enable_caching=False
  experiment_name=None
check
  all=False
  config_filepath=None
  raise_error=False
list
  with_configs=True
create
  config_type=json

Repository Structure

├─ .github
│  ├─ ISSUE_TEMPLATE/
│  ├─ workflows
│  │  ├─ ci.yaml
│  │  ├─ pr_agent.yaml
│  │  └─ release.yaml
│  ├─ CODEOWNERS
│  └─ PULL_REQUEST_TEMPLATE.md
├─ deployer                                     # Source code
│  ├─ __init__.py
│  ├─ cli.py
│  ├─ constants.py
│  ├─ pipeline_checks.py
│  ├─ pipeline_deployer.py
│  ├─ settings.py
│  └─ utils
│     ├─ config.py
│     ├─ console.py
│     ├─ exceptions.py
│     ├─ logging.py
│     ├─ models.py
│     └─ utils.py
├─ docs/                                        # Documentation folder (mkdocs)
├─ templates/                                   # Semantic Release templates
├─ tests/
├─ example                                      # Example folder with dummy pipeline and config
|   ├─ example.env
│   └─ vertex
│      ├─ components
│      │  └─ dummy.py
│      ├─ configs
│      │  ├─ broken_pipeline
│      │  │  └─ config_test.json
│      │  └─ dummy_pipeline
│      │     ├─ config_test.json
│      │     ├─ config.py
│      │     └─ config.toml
│      ├─ deployment
│      ├─ lib
│      └─ pipelines
│         ├─ broken_pipeline.py
│         └─ dummy_pipeline.py
├─ .gitignore
├─ .pre-commit-config.yaml
├─ catalog-info.yaml                            # Roadie integration configuration
├─ CHANGELOG.md
├─ CONTRIBUTING.md
├─ LICENSE
├─ Makefile
├─ mkdocs.yml                                   # Mkdocs configuration
├─ pyproject.toml
└─ README.md