From f4503f925b79a795c4344a214c792960f407e257 Mon Sep 17 00:00:00 2001 From: Jayesh Sharma Date: Fri, 18 Oct 2024 12:33:33 +0530 Subject: [PATCH] [docs] Dedicated docs on how to skip building an image on pipeline run (#3079) * add some info on docker skip build * add docs on not building a docker image * update toc and title * added text to stress that this doesnt always happen * Apply suggestions from code review Co-authored-by: Hamza Tahir * restructure headings * more english * Apply suggestions from code review Co-authored-by: Alex Strick van Linschoten * apply review changes * add how to reuse builds page * aoply hamza comments * add redirect for new page name * apply review changes * move the artifact store block to the top * update redirect * add scarf * Update .gitbook.yaml * link to code repository * fix relative link * Apply suggestions from code review Co-authored-by: Alex Strick van Linschoten * add where the code should be added --------- Co-authored-by: Hamza Tahir Co-authored-by: Alex Strick van Linschoten --- .gitbook.yaml | 1 + .../docker-settings-on-a-pipeline.md | 2 +- ...-build-times.md => how-to-reuse-builds.md} | 47 ++++++- .../use-a-prebuilt-image.md | 123 ++++++++++++++++++ docs/book/toc.md | 3 +- 5 files changed, 168 insertions(+), 8 deletions(-) rename docs/book/how-to/customize-docker-builds/{use-code-repositories-to-speed-up-docker-build-times.md => how-to-reuse-builds.md} (53%) create mode 100644 docs/book/how-to/customize-docker-builds/use-a-prebuilt-image.md diff --git a/.gitbook.yaml b/.gitbook.yaml index ba16bb650c6..ee2e29729f9 100644 --- a/.gitbook.yaml +++ b/.gitbook.yaml @@ -5,6 +5,7 @@ structure: summary: toc.md redirects: + how-to/customize-docker-builds/use-code-repositories-to-speed-up-docker-build-times: how-to/customize-docker-builds/how-to-reuse-builds.md reference/migration-guide/README.md: how-to/manage-the-zenml-server/migration-guide/migration-guide.md reference/migration-guide/migration-zero-twenty.md: how-to/manage-the-zenml-server/migration-guide/migration-zero-twenty.md reference/migration-guide/migration-zero-thirty.md: how-to/manage-the-zenml-server/migration-guide/migration-zero-thirty.md diff --git a/docs/book/how-to/customize-docker-builds/docker-settings-on-a-pipeline.md b/docs/book/how-to/customize-docker-builds/docker-settings-on-a-pipeline.md index 4bc3b1c9849..17bbcd0bea5 100644 --- a/docs/book/how-to/customize-docker-builds/docker-settings-on-a-pipeline.md +++ b/docs/book/how-to/customize-docker-builds/docker-settings-on-a-pipeline.md @@ -133,7 +133,7 @@ def my_pipeline(...): ``` {% hint style="warning" %} -This is an advanced feature and may cause unintended behavior when running your pipelines. If you use this, ensure your code files are correctly included in the image you specified. +This is an advanced feature and may cause unintended behavior when running your pipelines. If you use this, ensure your code files are correctly included in the image you specified. Read in detail about this feature [here](./use-a-prebuilt-image.md) before proceeding. {% endhint %}
ZenML Scarf
diff --git a/docs/book/how-to/customize-docker-builds/use-code-repositories-to-speed-up-docker-build-times.md b/docs/book/how-to/customize-docker-builds/how-to-reuse-builds.md similarity index 53% rename from docs/book/how-to/customize-docker-builds/use-code-repositories-to-speed-up-docker-build-times.md rename to docs/book/how-to/customize-docker-builds/how-to-reuse-builds.md index f1c53381dc7..bffce76a90e 100644 --- a/docs/book/how-to/customize-docker-builds/use-code-repositories-to-speed-up-docker-build-times.md +++ b/docs/book/how-to/customize-docker-builds/how-to-reuse-builds.md @@ -1,8 +1,43 @@ -# Use code repositories to speed up Docker build times +--- +description: > + Learn how to reuse builds to speed up your pipeline runs. +--- -While reusing Docker builds is useful, it can be limited. This is because specifying a custom build when running a pipeline will **not run the code on your client machine** but will use the code **included in the Docker images of the build**. As a consequence, even if you make local code changes, reusing a build will _always_ execute the code bundled in the Docker image, rather than the local code. Therefore, if you would like to reuse a Docker build AND make sure your local code changes are also downloaded into the image, you need to disconnect your code from the build. +# How to reuse builds -You can do so by connecting a git repository. Registering a code repository lets you avoid building images each time you run a pipeline **and** quickly iterate on your code. When running a pipeline that is part of a local code repository checkout, ZenML can instead build the Docker images without including any of your source files, and download the files inside the container before running your code. This greatly speeds up the building process and also allows you to reuse images that one of your colleagues might have built for the same stack. +When you run a pipeline, ZenML will check if a build with the same pipeline and stack exists. If it does, it will reuse that build. If it doesn't, ZenML will create a new build. This guide explains what a build is and the best practices around reusing builds. + +## What is a build? + +A pipeline build is an encapsulation of a pipeline and the stack it was run on. It contains the Docker images that were built for the pipeline with all the requirements from the stack, integrations and the user. Optionally, it also contains the pipeline code. + +You can list all the builds for a pipeline using the CLI: + +```bash +zenml pipeline builds list --pipeline_id='startswith:ab53ca' +``` + +You can also create a build manually using the CLI: + +```bash +zenml pipeline build --stack vertex-stack my_module.my_pipeline_instance +``` + +You can use the options to specify the configuration file and the stack to use for the build. The source should be a path to a pipeline instance. Learn more about the build function [here](https://sdkdocs.zenml.io/latest/core_code_docs/core-new/#zenml.new.pipelines.pipeline.Pipeline.build). + +## Reusing builds + +As already mentioned, ZenML will find an existing build if it matches your pipeline and stack, by itself. However, you can also force it to use a specific build by [passing the build ID](../../how-to/use-configuration-files/what-can-be-configured.md#build-id) to the `build` parameter of the pipeline configuration. + +While reusing Docker builds is useful, it can be limited. This is because specifying a custom build when running a pipeline will **not run the code on your client machine** but will use the code **included in the Docker images of the build**. As a consequence, even if you make local code changes, reusing a build will _always_ execute the code bundled in the Docker image, rather than the local code. Therefore, if you would like to reuse a Docker build AND make sure your local code changes are also downloaded into the image, you need to disconnect your code from the build. You can do this either by registering a code repository or by letting ZenML use the artifact store to upload your code. + +## Use the artifact store to upload your code + +You can also let ZenML use the artifact store to upload your code. This is the default behaviour if no code repository is detected and the `allow_download_from_artifact_store` flag is not set to `False` in your `DockerSettings`. + +## Use code repositories to speed up Docker build times + +One way to speed up Docker builds is to connect a git repository. Registering a [code repository](../../user-guide/production-guide/connect-code-repository.md) lets you avoid building images each time you run a pipeline **and** quickly iterate on your code. When running a pipeline that is part of a local code repository checkout, ZenML can instead build the Docker images without including any of your source files, and download the files inside the container before running your code. This greatly speeds up the building process and also allows you to reuse images that one of your colleagues might have built for the same stack. ZenML will **automatically figure out which builds match your pipeline and reuse the appropriate build id**. Therefore, you **do not** need to explicitly pass in the build id when you have a clean repository state and a connected git repository. This approach is **highly recommended**. See an end to end example [here](../../user-guide/production-guide/connect-code-repository.md). @@ -14,18 +49,18 @@ zenml integration install github ``` {% endhint %} -## Detecting local code repository checkouts +### Detecting local code repository checkouts Once you have registered one or more code repositories, ZenML will check whether the files you use when running a pipeline are tracked inside one of those code repositories. This happens as follows: * First, the [source root](./which-files-are-built-into-the-image.md) is computed * Next, ZenML checks whether this source root directory is included in a local checkout of one of the registered code repositories -## Tracking code version for pipeline runs +### Tracking code versions for pipeline runs If a [local code repository checkout](#detecting-local-code-repository-checkouts) is detected when running a pipeline, ZenML will store a reference to the current commit for the pipeline run, so you'll be able to know exactly which code was used. Note that this reference is only tracked if your local checkout is clean (i.e. it does not contain any untracked or uncommitted files). This is to ensure that your pipeline is actually running with the exact code stored at the specific code repository commit. -## Tips and best practices +### Tips and best practices It is also important to take some additional points into consideration: diff --git a/docs/book/how-to/customize-docker-builds/use-a-prebuilt-image.md b/docs/book/how-to/customize-docker-builds/use-a-prebuilt-image.md new file mode 100644 index 00000000000..052c5dea2a6 --- /dev/null +++ b/docs/book/how-to/customize-docker-builds/use-a-prebuilt-image.md @@ -0,0 +1,123 @@ +--- +description: "Skip building an image for your ZenML pipeline altogether." +--- + +# Use a prebuilt image for pipeline execution + +When running a pipeline on a remote Stack, ZenML builds a Docker image with a base ZenML image and adds all of your project dependencies to it. Optionally, if a code repository is not registered and `allow_download_from_artifact_store` is not set to `True` in your `DockerSettings`, ZenML will also add your pipeline code to the image. This process might take significant time depending on how big your dependencies are, how powerful your local system is and how fast your internet connection is. This is because Docker must pull base layers and push the final image to your container registry. Although this process only happens once and is skipped if ZenML detects no change in your environment, it might still be a bottleneck slowing down your pipeline execution. + +To save time and costs, you can choose to not build a Docker image every time your pipeline runs. This guide shows you how to do it using a prebuilt image, what you should include in your image for the pipeline to run successfully and other tips. + +{% hint style="info" %} +Note that using this feature means that you won't be able to leverage any updates you make to your code or dependencies, outside of what your image already contains. +{% endhint %} + +## How do you use this feature + +The [`DockerSettings`](./docker-settings-on-a-pipeline.md#specify-docker-settings-for-a-pipeline) class in ZenML allows you to set a parent image to be used in your pipeline runs and gives the ability to skip building an image on top of it. + +To do this, just set the `parent_image` attribute of the `DockerSettings` class to the image you want to use and set `skip_build` to `True`. + +```python +docker_settings = DockerSettings( + parent_image="my_registry.io/image_name:tag", + skip_build=True +) + + +@pipeline(settings={"docker": docker_settings}) +def my_pipeline(...): + ... +``` + +{% hint style="warning" %} +You should make sure that this image is pushed to a registry from which the orchestrator/step operator/other components that require the image can pull, without any involvement by ZenML. +{% endhint %} + +## What the parent image should contain + +When you run a pipeline with a pre-built image, skipping the build process, ZenML will not build any image on top of it. This means that the image you provide to the `parent_image` attribute of the `DockerSettings` class has to contain all the dependencies that are needed to run your pipeline, and optionally any code files if you don't have a code repository registered, and the `allow_download_from_artifact_store` flag is set to `False`. + +{% hint style="info" %} +Note that this is different from the case where you [only specify a parent image](./docker-settings-on-a-pipeline.md#using-a-pre-built-parent-image) and don't want to `skip_build`. In the latter, ZenML still builds the image but does it on top of your parent image and not the base ZenML image. +{% endhint %} +{% hint style="info" %} +If you're using an image that was already built by ZenML in a previous pipeline run, you don't need to worry about what goes in it as long as it was built for the **same stack** as your current pipeline run. You can use it directly. +{% endhint %} + +The following points are derived from how ZenML builds an image internally and will help you make your own images. + +### Your stack requirements + +A ZenML Stack can have different components and each comes with its own requirements. You need to ensure that your image contains them. The following is how you can get a list of stack requirements. + +```python +from zenml.client import Client + +stack_name = +# set your stack as active if it isn't already +Client().set_active_stack(stack_name) + +# get the requirements for the active stack +active_stack = Client().active_stack +stack_requirements = active_stack.requirements() +``` + +### Integration requirements + +For all integrations that you use in your pipeline, you need to have their dependencies installed too. You can get a list of them in the following way: + +```python +from zenml.integrations.registry import integration_registry +from zenml.integrations.constants import HUGGINGFACE, PYTORCH + +# define a list of all required integrations +required_integrations = [PYTORCH, HUGGINGFACE] + +# Generate requirements for all required integrations +integration_requirements = set( + itertools.chain.from_iterable( + integration_registry.select_integration_requirements( + integration_name=integration, + target_os=OperatingSystemType.LINUX, + ) + for integration in required_integrations + ) +) +``` + +### Any project-specific requirements + +For any other dependencies that your project relies on, you can then install all of these different requirements through a line in your `Dockerfile` that looks like the following. It assumes you have accumulated all the requirements in one file. + +```Dockerfile +RUN pip install -r FILE +``` + +### Any system packages + +If you have any `apt` packages that are needed for your application to function, be sure to include them too. This can be achieved in a `Dockerfile` as follows: + +```Dockerfile +RUN apt-get update && apt-get install -y --no-install-recommends YOUR_APT_PACKAGES +``` + +### Your project code files + +The files containing your pipeline and step code and all other necessary functions should be available in your execution environment. + +- If you have a [code repository](../../user-guide/production-guide/connect-code-repository.md) registered, you don't need to include your code files in the image yourself. ZenML will download them from the repository to the appropriate location in the image. + +- If you don't have a code repository but `allow_download_from_artifact_store` is set to `True` in your `DockerSettings` (`True` by default), ZenML will upload your code to the artifact store and make it available to the image. + +- If both of these options are disabled, you can include your code files in the image yourself. This approach is not recommended and you should use one of the above options. + +Take a look at [which files are built into the image](./which-files-are-built-into-the-image.md) page to learn more about what to include. Make sure that your code is in the `/app` directory and that this is set as the active working directory. + + +{% hint style="info" %} +Note that you also need Python, `pip` and `zenml` installed in your image. +{% endhint %} + + +
ZenML Scarf
diff --git a/docs/book/toc.md b/docs/book/toc.md index ea07c5a3e59..cfd639443f4 100644 --- a/docs/book/toc.md +++ b/docs/book/toc.md @@ -99,10 +99,11 @@ * [🐳 Customize Docker builds](how-to/customize-docker-builds/README.md) * [Docker settings on a pipeline](how-to/customize-docker-builds/docker-settings-on-a-pipeline.md) * [Docker settings on a step](how-to/customize-docker-builds/docker-settings-on-a-step.md) + * [Use a prebuilt image for pipeline execution](how-to/customize-docker-builds/use-a-prebuilt-image.md) * [Specify pip dependencies and apt packages](how-to/customize-docker-builds/specify-pip-dependencies-and-apt-packages.md) * [Use your own Dockerfiles](how-to/customize-docker-builds/use-your-own-docker-files.md) * [Which files are built into the image](how-to/customize-docker-builds/which-files-are-built-into-the-image.md) - * [Use code repositories to automate Docker build reuse](how-to/customize-docker-builds/use-code-repositories-to-speed-up-docker-build-times.md) + * [How to reuse builds](how-to/customize-docker-builds/how-to-reuse-builds.md) * [Define where an image is built](how-to/customize-docker-builds/define-where-an-image-is-built.md) * [📔 Run remote pipelines from notebooks](how-to/run-remote-steps-and-pipelines-from-notebooks/README.md) * [Limitations of defining steps in notebook cells](how-to/run-remote-steps-and-pipelines-from-notebooks/limitations-of-defining-steps-in-notebook-cells.md)