Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] Dedicated docs on how to skip building an image on pipeline run #3079

Merged
merged 26 commits into from
Oct 18, 2024
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
ae03fd1
add some info on docker skip build
wjayesh Oct 8, 2024
73dd4e0
add docs on not building a docker image
wjayesh Oct 14, 2024
7687aae
update toc and title
wjayesh Oct 14, 2024
ddf105c
added text to stress that this doesnt always happen
wjayesh Oct 14, 2024
057aa73
Apply suggestions from code review
wjayesh Oct 14, 2024
1df5ba3
restructure headings
wjayesh Oct 14, 2024
b69e138
Merge branch 'docs/docker-skip-build' of https://github.com/zenml-io/…
wjayesh Oct 14, 2024
bbe9e95
more english
wjayesh Oct 14, 2024
516a214
Apply suggestions from code review
wjayesh Oct 15, 2024
0b966c1
Merge branch 'docs/docker-skip-build' of https://github.com/zenml-io/…
wjayesh Oct 14, 2024
d373e44
apply review changes
wjayesh Oct 16, 2024
563cb04
add how to reuse builds page
wjayesh Oct 16, 2024
75d947c
aoply hamza comments
wjayesh Oct 16, 2024
44dc550
add redirect for new page name
wjayesh Oct 16, 2024
e5cd75e
apply review changes
wjayesh Oct 16, 2024
a369a8c
move the artifact store block to the top
wjayesh Oct 16, 2024
b626e21
update redirect
wjayesh Oct 16, 2024
d2acb0a
add scarf
wjayesh Oct 16, 2024
a3d8da2
Update .gitbook.yaml
wjayesh Oct 17, 2024
d9daabc
link to code repository
wjayesh Oct 16, 2024
1a0d4dc
Merge branch 'develop' into docs/docker-skip-build
wjayesh Oct 17, 2024
b382722
Merge branch 'develop' into docs/docker-skip-build
wjayesh Oct 17, 2024
a958607
fix relative link
wjayesh Oct 16, 2024
dcbfd8d
Apply suggestions from code review
wjayesh Oct 17, 2024
a407181
Merge branch 'develop' into docs/docker-skip-build
wjayesh Oct 17, 2024
740150d
add where the code should be added
wjayesh Oct 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .gitbook.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@ structure:
readme: introduction.md
summary: toc.md

#redirects:
# help: ./support.md
redirects:
how-to/customize-docker-builds/use-code-repositories-to-speed-up-docker-build-times: how-to/customize-docker-builds/how-to-reuse-builds.md
wjayesh marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -1,8 +1,43 @@
# Use code repositories to speed up Docker build times
---
description: >
Learn how to reuse builds to speed up your pipeline runs.
---

While reusing Docker builds is useful, it can be limited. This is because specifying a custom build when running a pipeline will **not run the code on your client machine** but will use the code **included in the Docker images of the build**. As a consequence, even if you make local code changes, reusing a build will _always_ execute the code bundled in the Docker image, rather than the local code. Therefore, if you would like to reuse a Docker build AND make sure your local code changes are also downloaded into the image, you need to disconnect your code from the build.
# How to reuse builds

You can do so by connecting a git repository. Registering a code repository lets you avoid building images each time you run a pipeline **and** quickly iterate on your code. When running a pipeline that is part of a local code repository checkout, ZenML can instead build the Docker images without including any of your source files, and download the files inside the container before running your code. This greatly speeds up the building process and also allows you to reuse images that one of your colleagues might have built for the same stack.
When you run a pipeline, ZenML will check if a build with the same pipeline and stack exists. If it does, it will reuse that build. If it doesn't, ZenML will create a new build. This guide explains what a build is and the best practices around reusing builds.

## What is a build?
wjayesh marked this conversation as resolved.
Show resolved Hide resolved

A pipeline build is an encapsulation of a pipeline and the stack it was run on. It contains the Docker images that were built for the pipeline with all the requirements from the stack, integrations and the user. Optionally, it also contains the pipeline code.

You can list all the builds for a pipeline using the CLI:

```bash
zenml pipeline builds list --pipeline_id='startswith:ab53ca'
```

You can also create a build manually using the CLI:

```bash
zenml pipeline build --stack vertex-stack my_module.my_pipeline_instance
```

You can use the options to specify the configuration file and the stack to use for the build. The source should be a path to a pipeline instance. Learn more about the build function [here](https://sdkdocs.zenml.io/latest/core_code_docs/core-new/#zenml.new.pipelines.pipeline.Pipeline.build).

## Reusing builds

As already mentioned, ZenML will find an existing build if it matches your pipeline and stack, by itself. However, you can also force it to use a specific build by [passing the build ID](../../how-to/use-configuration-files/what-can-be-configured.md#build-id) to the `build` parameter of the pipeline configuration.

While reusing Docker builds is useful, it can be limited. This is because specifying a custom build when running a pipeline will **not run the code on your client machine** but will use the code **included in the Docker images of the build**. As a consequence, even if you make local code changes, reusing a build will _always_ execute the code bundled in the Docker image, rather than the local code. Therefore, if you would like to reuse a Docker build AND make sure your local code changes are also downloaded into the image, you need to disconnect your code from the build. You can do this either by registering a code repository or by letting ZenML use the artifact store to upload your code.
wjayesh marked this conversation as resolved.
Show resolved Hide resolved

## Use the artifact store to upload your code

You can also let ZenML use the artifact store to upload your code. This is the default behaviour if no code repository is detected and the `allow_download_from_artifact_store` flag is not set to `False` in your `DockerSettings`.

## Use code repositories to speed up Docker build times

One way to speed up Docker builds is to connect a git repository. Registering a [code repository](../../user-guide/production-guide/connect-code-repository.md) lets you avoid building images each time you run a pipeline **and** quickly iterate on your code. When running a pipeline that is part of a local code repository checkout, ZenML can instead build the Docker images without including any of your source files, and download the files inside the container before running your code. This greatly speeds up the building process and also allows you to reuse images that one of your colleagues might have built for the same stack.

ZenML will **automatically figure out which builds match your pipeline and reuse the appropriate build id**. Therefore, you **do not** need to explicitly pass in the build id when you have a clean repository state and a connected git repository. This approach is **highly recommended**. See an end to end example [here](../../user-guide/production-guide/connect-code-repository.md).

Expand All @@ -14,18 +49,18 @@ zenml integration install github
```
{% endhint %}

## Detecting local code repository checkouts
### Detecting local code repository checkouts
wjayesh marked this conversation as resolved.
Show resolved Hide resolved

Once you have registered one or more code repositories, ZenML will check whether the files you use when running a pipeline are tracked inside one of those code repositories. This happens as follows:

* First, the [source root](./which-files-are-built-into-the-image.md) is computed
* Next, ZenML checks whether this source root directory is included in a local checkout of one of the registered code repositories

## Tracking code version for pipeline runs
### Tracking code version for pipeline runs
wjayesh marked this conversation as resolved.
Show resolved Hide resolved
wjayesh marked this conversation as resolved.
Show resolved Hide resolved

If a [local code repository checkout](#detecting-local-code-repository-checkouts) is detected when running a pipeline, ZenML will store a reference to the current commit for the pipeline run, so you'll be able to know exactly which code was used. Note that this reference is only tracked if your local checkout is clean (i.e. it does not contain any untracked or uncommitted files). This is to ensure that your pipeline is actually running with the exact code stored at the specific code repository commit.

## Tips and best practices
### Tips and best practices

It is also important to take some additional points into consideration:

Expand Down
73 changes: 37 additions & 36 deletions docs/book/how-to/customize-docker-builds/use-a-prebuilt-image.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,19 @@ description: "Skip building an image for your ZenML pipeline altogether."

# Use a prebuilt image for pipeline execution
wjayesh marked this conversation as resolved.
Show resolved Hide resolved

When running a pipeline on a remote Stack, ZenML builds a Docker image with a base ZenML image and adds all of your project dependencies and your pipeline code to it. This process might take significant time depending on how big your dependencies are, how powerful your local system is and how fast your internet connection is (to pull base layers and push the final image to your container registry). Although this process only happens once and is skipped if ZenML detects no change in your environment, it might still be a bottleneck in your pipeline execution.
When running a pipeline on a remote Stack, ZenML builds a Docker image with a base ZenML image and adds all of your project dependencies to it. Optionally, if a code repository is not registered and `allow_download_from_artifact_store` is not set to `True` in your `DockerSettings`, ZenML will also add your pipeline code to the image. This process might take significant time depending on how big your dependencies are, how powerful your local system is and how fast your internet connection is. This is because Docker must pull base layers and push the final image to your container registry. Although this process only happens once and is skipped if ZenML detects no change in your environment, it might still be a bottleneck slowing down your pipeline execution.

To save time and costs, you can choose to not build a Docker image every time your pipeline runs. This guide shows you how to do it using a pre-built image, what you should include in your image for the pipeline to run successfully and other tips.
To save time and costs, you can choose to not build a Docker image every time your pipeline runs. This guide shows you how to do it using a prebuilt image, what you should include in your image for the pipeline to run successfully and other tips.

{% hint style="info" %}
Note that using this feature means that you won't be able to leverage any updates you make to your code or dependencies, outside of what your image already contains.
{% endhint %}

The DockerSettings class in ZenML allows you to set a parent image to be used in your pipeline runs and the ability to skip building an image on top of it. This approach works in two ways:
- [If ZenML built an image for your pipeline previously](#if-zenml-built-an-image-for-your-pipeline-previously)
- [If ZenML has never built an image for your pipeline](#if-zenml-has-never-built-an-image-for-your-pipeline)
## How do you use this feature

wjayesh marked this conversation as resolved.
Show resolved Hide resolved
## If ZenML built an image for your pipeline previously
The [DockerSettings](../../../../docs/book/how-to/customize-docker-builds/docker-settings-on-a-pipeline.md#specify-docker-settings-for-a-pipeline) class in ZenML allows you to set a parent image to be used in your pipeline runs and the ability to skip building an image on top of it.
wjayesh marked this conversation as resolved.
Show resolved Hide resolved

This is the case where you have had a previous pipeline run where ZenML did build an image and pushed it your registry.

Here, you can just reuse the image by setting it to the `parent_image` attribute of the `DockerSettings` class and setting `skip_build` to `True`.
Just set the `parent_image` attribute of the `DockerSettings` class to the image you want to use and set `skip_build` to `True`.
wjayesh marked this conversation as resolved.
Show resolved Hide resolved

```python
docker_settings = DockerSettings(
Expand All @@ -34,26 +30,26 @@ def my_pipeline(...):
...
```

On every subsequent pipeline run now, ZenML will just use the code and the dependencies that were a part of the pipeline run whose image you are now using. This approach works without you having to worry about what goes in the image since ZenML built it in the first place.

{% hint style="warning" %}
You should make sure that this image is pushed to a registry where the orchestrator/step operator/other components that require the image can pull it from, without any involvement by ZenML.
wjayesh marked this conversation as resolved.
Show resolved Hide resolved
{% endhint %}

## If ZenML has never built an image for your pipeline
## What the parent image should contain

This is the case where you are running a pipeline for the first time with ZenML and don't want ZenML to build an image for you. Here, you will have to provide an image to the `parent_image` attribute which already has all of your code and the right dependencies installed.
When you run a pipeline with a pre-built image, skipping the build process, ZenML will not build any image on top of it. This means that the image you provide to the `parent_image` attribute of the `DockerSettings` class has to contain all the dependencies that are needed to run your pipeline, and optionally any code files if you don't have a code repository registered, and the `allow_download_from_artifact_store` flag is set to `False`.

{% hint style="info" %}
Note that this is different from the case where you [only specify a parent image](../../../../docs/book/how-to/customize-docker-builds/docker-settings-on-a-pipeline.md#using-a-pre-built-parent-image) and don't want to `skip_build`. In the latter, ZenML still builds the image but does it on top of your parent image and not the base ZenML image.
Note that this is different from the case where you [only specify a parent image](./docker-settings-on-a-pipeline.md#using-a-pre-built-parent-image) and don't want to `skip_build`. In the latter, ZenML still builds the image but does it on top of your parent image and not the base ZenML image.
{% endhint %}
{% hint style="info" %}
If you're using an image that was already built by ZenML in a previous pipeline run, you don't need to worry about what goes in it as long as it was built for the **same stack** as your current pipeline run. You can use it directly.
{% endhint %}

### What should your image contain

It is crucial that you ensure that all of your code files and dependencies are included in the image you provide to `DockerSettings` as parent image. This is because ZenML expects to be able to execute this image right away, without any modifications, to run your pipeline steps.

The following points are derived from how ZenML builds an image internally and will help you make your own images:
The following points are derived from how ZenML builds an image internally and will help you make your own images.

#### Your stack requirements
### Your stack requirements
schustmi marked this conversation as resolved.
Show resolved Hide resolved

A ZenML Stack can have different tools and each comes with its own requirements. You need to ensure that your image contains them. The following is how you can get a list of stack requirements.
A ZenML Stack can have different components and each comes with its own requirements. You need to ensure that your image contains them. The following is how you can get a list of stack requirements.

```python
from zenml.client import Client
Expand All @@ -67,9 +63,9 @@ active_stack = Client().active_stack
stack_requirements = active_stack.requirements()
```

#### Integration requirements
### Integration requirements

For all integrations that you use in your pipeline, you need to have their dependencies installed too. The following is how you can get a list of them.
For all integrations that you use in your pipeline, you need to have their dependencies installed too. You can get a list of them in the following way:

```python
from zenml.integrations.registry import integration_registry
Expand All @@ -90,33 +86,38 @@ integration_requirements = set(
)
```

#### Any project specific requirements
### Any project-specific requirements

Any other dependencies that your project relies on. You can then install all of these different requirements through a line in your Dockerfile that looks like the following. It assumes you have accumulated all the requirements in one file.
For any other dependencies that your project relies on, you can then install all of these different requirements through a line in your `Dockerfile` that looks like the following. It assumes you have accumulated all the requirements in one file.

```Dockerfile
RUN pip insatll <ANY_ARGS> -r FILE
RUN pip install <ANY_ARGS> -r FILE
```

#### Any system packages
### Any system packages

If you have any apt packages that are needed for your application to function, be sure to include them too. This can be achieved in a Dockerfile as follows:
If you have any `apt` packages that are needed for your application to function, be sure to include them too. This can be achieved in a `Dockerfile` as follows:

```Dockerfile
RUN apt-get update && apt-get install -y --no-install-recommends YOUR_APT_PACKAGES
```

#### Your project code files
### Your project code files

The files containing your pipeline and step code and all other necessary functions should also be available inside the image. Take a look at [which files are built into the image](../../../../docs/book/how-to/customize-docker-builds/which-files-are-built-into-the-image.md) page to learn more about what to include.
The files containing your pipeline and step code and all other necessary functions should be available in your execution environment.

- If you have a [code repository](../../user-guide/production-guide/connect-code-repository.md) registered, you don't need to include your code files in the image yourself. ZenML will download them from the repository to the appropriate location in the image.

{% hint style="info" %}
Note that you also need Python, pip and zenml installed in your image.
{% endhint %}
- If you don't have a code repository but `allow_download_from_artifact_store` is set to `True` in your `DockerSettings` (`True` by default), ZenML will upload your code to the artifact store and make it available to the image.

- If both of these options are disabled, you can include your code files in the image yourself. This approach is not recommended and you should use one of the above options.

Take a look at [which files are built into the image](../../../../docs/book/how-to/customize-docker-builds/which-files-are-built-into-the-image.md) page to learn more about what to include.

## Where you can use this feature

- One way to use this is when ZenML has already built an image for your code in a previous pipeline run and you want to reuse it in a new run. This saves you build times at the cost of not being able to leverage any updates you made to your code (or your dependencies) since then.
- Another use case is when you are running in an environment that either doesn't have docker installed or doesn't have enough memory to pull your base image and build a new image on top of it (think Codespaces or other CI/CD environments).
{% hint style="info" %}
Note that you also need Python, `pip` and `zenml` installed in your image.
{% endhint %}

<!-- For scarf -->
<figure><img alt="ZenML Scarf" referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=f0b4f458-0a54-4fcd-aa95-d5ee424815bc" /></figure>
2 changes: 1 addition & 1 deletion docs/book/toc.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@
* [Specify pip dependencies and apt packages](how-to/customize-docker-builds/specify-pip-dependencies-and-apt-packages.md)
* [Use your own Dockerfiles](how-to/customize-docker-builds/use-your-own-docker-files.md)
* [Which files are built into the image](how-to/customize-docker-builds/which-files-are-built-into-the-image.md)
* [Use code repositories to automate Docker build reuse](how-to/customize-docker-builds/use-code-repositories-to-speed-up-docker-build-times.md)
* [How to reuse builds](how-to/customize-docker-builds/how-to-reuse-builds.md)
* [Define where an image is built](how-to/customize-docker-builds/define-where-an-image-is-built.md)
* [📔 Run remote pipelines from notebooks](how-to/run-remote-steps-and-pipelines-from-notebooks/README.md)
* [Limitations of defining steps in notebook cells](how-to/run-remote-steps-and-pipelines-from-notebooks/limitations-of-defining-steps-in-notebook-cells.md)
Expand Down