Leverage Docker to create a consistent and reproducible environment for running LayerSkip without requiring GPU support. This setup ensures that all dependencies are managed efficiently and secrets like the HuggingFace token are handled securely.
-
Docker Installed: Ensure Docker is installed on your machine. Get Docker
-
HuggingFace Token: Obtain your HuggingFace access token. HuggingFace Tokens
Follow these steps to build the Docker image for LayerSkip:
-
Clone the Repository:
git clone [email protected]:facebookresearch/LayerSkip.git cd LayerSkip
-
Ensure Dockerfile and Entrypoint Script are Present:
Make sure theDockerfile
,entrypoint.sh
, and.dockerignore
are located in the root directory of your project as shown below:. ├── Dockerfile ├── entrypoint.sh ├── .dockerignore ├── arguments.py ├── benchmark.py ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── correctness.py ├── data.py ├── eval.py ├── generate.py ├── LICENSE ├── README.md ├── requirements.txt ├── self_speculation │ ├── autoregressive_generator.py │ ├── generator_base.py │ ├── llama_model_utils.py │ ├── self_speculation_generator.py │ └── speculative_streamer.py ├── sweep.py └── utils.py
-
Build the Docker Image:
docker build -t layerskip:latest .
- Explanation:
-t layerskip:latest
: Tags the image aslayerskip
with thelatest
tag..
: Specifies the current directory as the build context.
Note: The build process may take several minutes as it installs all dependencies.
- Explanation:
Once the Docker image is built, you can run your LayerSkip scripts inside the container. Below are instructions and examples for executing different scripts.
docker run -it --rm \
-e HUGGINGFACE_TOKEN=your_huggingface_token_here \
layerskip:latest \
python your_script.py --help
- Flags and Arguments:
-it
: Runs the container in interactive mode with a pseudo-TTY.--rm
: Automatically removes the container when it exits.-e HUGGINGFACE_TOKEN=your_huggingface_token_here
: Sets theHUGGINGFACE_TOKEN
environment variable inside the container. Replaceyour_huggingface_token_here
with your actual token.layerskip:latest
: Specifies the Docker image to use.python your_script.py --help
: The command to execute inside the container. Replaceyour_script.py --help
with your desired script and arguments.
Run the generate.py
script in interactive mode using regular autoregressive decoding:
docker run -it --rm \
-e HUGGINGFACE_TOKEN=your_huggingface_token_here \
layerskip:latest \
python generate.py --model facebook/layerskip-llama2-7B \
--sample True \
--max_steps 512
To observe speedup with self-speculative decoding, specify --exit_layer
and --num_speculations
:
docker run -it --rm \
-e HUGGINGFACE_TOKEN=your_huggingface_token_here \
layerskip:latest \
python generate.py --model facebook/layerskip-llama2-7B \
--sample True \
--max_steps 512 \
--generation_strategy self_speculative \
--exit_layer 8 \
--num_speculations 6
Benchmark the model on a specific dataset:
docker run -it --rm \
-e HUGGINGFACE_TOKEN=your_huggingface_token_here \
-v /path/on/host/logs:/app/logs \
layerskip:latest \
python benchmark.py --model facebook/layerskip-llama2-7B \
--dataset cnn_dm_summarization \
--num_samples 100 \
--generation_strategy self_speculative \
--exit_layer 8 \
--num_speculations 6 \
--output_dir /app/logs
- Explanation:
-v /path/on/host/logs:/app/logs
: Mounts the host directory/path/on/host/logs
to the container's/app/logs
directory, ensuring that logs are saved on the host.
Evaluate the model using the Eleuther Language Model Evaluation Harness:
docker run -it --rm \
-e HUGGINGFACE_TOKEN=your_huggingface_token_here \
-v /path/on/host/logs:/app/logs \
layerskip:latest \
python eval.py --model facebook/layerskip-llama2-7B \
--tasks gsm8k \
--limit 10 \
--generation_strategy self_speculative \
--exit_layer 8 \
--num_speculations 6 \
--output_dir /app/logs
Perform a sweep over different exit_layer
and num_speculations
hyperparameters:
docker run -it --rm \
-e HUGGINGFACE_TOKEN=your_huggingface_token_here \
-v /path/on/host/sweep:/app/sweep \
layerskip:latest \
python sweep.py --model facebook/layerskip-llama2-7B \
--dataset human_eval \
--generation_strategy self_speculative \
--num_samples 150 \
--max_steps 256 \
--output_dir /app/sweep \
--sample False
Verify the correctness of self-speculative decoding:
docker run -it --rm \
-e HUGGINGFACE_TOKEN=your_huggingface_token_here \
-v /path/on/host/correctness:/app/correctness \
layerskip:latest \
python correctness.py --model facebook/layerskip-llama2-7B \
--dataset human_eval \
--generation_strategy self_speculative \
--num_speculations 6 \
--exit_layer 4 \
--num_samples 10 \
--sample False \
--output_dir /app/correctness
- Explanation:
-v /path/on/host/correctness:/app/correctness
: Mounts the host directory/path/on/host/correctness
to the container's/app/correctness
directory, ensuring that correctness metrics are saved on the host.
To persist outputs and logs generated by your scripts, mount host directories to the corresponding directories inside the Docker container using the -v
flag. This ensures that all results are stored on your host machine and are not lost when the container is removed.
Example: Mounting Logs Directory
docker run -it --rm \
-e HUGGINGFACE_TOKEN=your_huggingface_token_here \
-v /path/on/host/logs:/app/logs \
layerskip:latest \
python benchmark.py --model facebook/layerskip-llama2-7B \
--dataset human_eval \
--num_samples 100 \
--generation_strategy self_speculative \
--exit_layer 8 \
--num_speculations 6 \
--output_dir /app/logs
Important: Never hardcode sensitive information like the HuggingFace token directly into the Dockerfile
or your scripts. Always pass them securely at runtime using environment variables.
When running the Docker container, pass the HuggingFace token using the -e
flag:
docker run -it --rm \
-e HUGGINGFACE_TOKEN=your_huggingface_token_here \
layerskip:latest \
python generate.py --help
For enhanced security, especially in production environments, consider using Docker secrets to manage sensitive data. This approach is more secure than passing environment variables directly.
Example Using Docker Secrets (Docker Swarm):
-
Create a Secret:
echo "your_huggingface_token_here" | docker secret create huggingface_token -
-
Update
entrypoint.sh
to Read the Secret:#!/bin/bash # entrypoint.sh # Activate the Conda environment source /opt/conda/etc/profile.d/conda.sh conda activate layer_skip # Read HuggingFace token from Docker secret export HUGGINGFACE_TOKEN=$(cat /run/secrets/huggingface_token) # Execute the passed command exec "$@"
-
Deploy the Service with the Secret:
docker service create --name layerskip_service \ --secret huggingface_token \ layerskip:latest \ python generate.py --help
Note: Docker secrets are primarily designed for use with Docker Swarm. If you're not using Swarm, passing environment variables securely as shown earlier is the recommended approach.
To avoid re-downloading models every time you run the container, mount the HuggingFace cache directory:
docker run -it --rm \
-e HUGGINGFACE_TOKEN=your_huggingface_token_here \
-v /path/on/host/huggingface_cache:/root/.cache/huggingface \
layerskip:latest \
python generate.py --help
- Explanation:
-v /path/on/host/huggingface_cache:/root/.cache/huggingface
: Mounts the host directory to the container's HuggingFace cache directory, speeding up model loading times.
-
Leverage Docker Caching: By copying
requirements.txt
and installing dependencies before copying the rest of the code, Docker can cache these layers and speed up subsequent builds when dependencies haven't changed. -
Combine RUN Commands: Reduce the number of Docker layers by combining multiple
RUN
commands where possible.Example:
RUN conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 cpuonly -c pytorch -y && \ pip install --upgrade pip && \ pip install --no-cache-dir -r /app/requirements.txt
If any of your scripts run a web server or need specific ports exposed, add the EXPOSE
directive in the Dockerfile and map the ports when running the container.
Example: Exposing Port 8000
-
Update Dockerfile:
EXPOSE 8000
-
Run the Container with Port Mapping:
docker run -it --rm \ -e HUGGINGFACE_TOKEN=your_huggingface_token_here \ -p 8000:8000 \ layerskip:latest \ python your_web_server_script.py
First, ensure that Python and PyTorch are correctly installed in your Docker image.
docker run -it --rm layerskip:latest python -c "import torch; print(torch.__version__)"
Expected Output:
2.2.1
This confirms that PyTorch version 2.2.1 is installed.
Ensure that your scripts are accessible and functioning as expected by checking their help messages. This helps verify that all dependencies are correctly installed.
-
Generate Script:
docker run -it --rm layerskip:latest python generate.py --help
Expected Output:
Displays the help message for
generate.py
, listing available arguments and usage instructions. -
Benchmark Script:
docker run -it --rm layerskip:latest python benchmark.py --help
Expected Output:
Displays the help message for
benchmark.py
. -
Evaluate Script:
docker run -it --rm layerskip:latest python eval.py --help
Expected Output:
Displays the help message for
eval.py
. -
Sweep Script:
docker run -it --rm layerskip:latest python sweep.py --help
Expected Output:
Displays the help message for
sweep.py
. -
Correctness Script:
docker run -it --rm layerskip:latest python correctness.py --help
Expected Output:
Displays the help message for
correctness.py
.
Ensure that environment variables are correctly set up within the Docker container.
docker run -it --rm layerskip:latest bash -c 'echo $HUGGINGFACE_TOKEN'
Expected Output:
If you haven't set the HUGGINGFACE_TOKEN
when running the container, this will likely be empty or show a default placeholder. This is expected since you don't have the token yet.
Even without a valid token, you can pass a dummy value to ensure that environment variables are handled correctly. This won't allow you to access models, but it confirms that the token is being set.
docker run -it --rm \
-e HUGGINGFACE_TOKEN=dummy_token \
layerskip:latest \
python generate.py --help
Expected Output:
Displays the help message for generate.py
without attempting to access any models, thereby avoiding authentication errors.
Once you obtain your HuggingFace token, you can run your scripts with proper authentication. Here's how to proceed:
Follow these steps to get your HuggingFace token:
-
Create a HuggingFace Account: If you haven't already, create an account on HuggingFace.
-
Generate a Token:
- Navigate to your account settings.
- Go to the "Access Tokens" section.
- Click on "New Token" and follow the prompts to generate a token.
- Note: Keep your token secure and do not share it publicly.
Replace your_huggingface_token_here
with your actual token in the following commands.
docker run -it --rm \
-e HUGGINGFACE_TOKEN=your_huggingface_token_here \
layerskip:latest \
python generate.py --model facebook/layerskip-llama2-7B \
--sample True \
--max_steps 512 \
--generation_strategy self_speculative \
--exit_layer 8 \
--num_speculations 6
docker run -it --rm \
-e HUGGINGFACE_TOKEN=your_huggingface_token_here \
-v /path/on/host/logs:/app/logs \
layerskip:latest \
python benchmark.py --model facebook/layerskip-llama2-7B \
--dataset cnn_dm_summarization \
--num_samples 100 \
--generation_strategy self_speculative \
--exit_layer 8 \
--num_speculations 6 \
--output_dir /app/logs
- Explanation:
-v /path/on/host/logs:/app/logs
: Mounts the host directory/path/on/host/logs
to the container's/app/logs
directory, ensuring logs are saved on the host.
docker run -it --rm \
-e HUGGINGFACE_TOKEN=your_huggingface_token_here \
-v /path/on/host/logs:/app/logs \
layerskip:latest \
python eval.py --model facebook/layerskip-llama2-7B \
--tasks gsm8k \
--limit 10 \
--generation_strategy self_speculative \
--exit_layer 8 \
--num_speculations 6 \
--output_dir /app/logs
docker run -it --rm \
-e HUGGINGFACE_TOKEN=your_huggingface_token_here \
-v /path/on/host/sweep:/app/sweep \
layerskip:latest \
python sweep.py --model facebook/layerskip-llama2-7B \
--dataset human_eval \
--generation_strategy self_speculative \
--num_samples 150 \
--max_steps 256 \
--output_dir /app/sweep \
--sample False
docker run -it --rm \
-e HUGGINGFACE_TOKEN=your_huggingface_token_here \
-v /path/on/host/correctness:/app/correctness \
layerskip:latest \
python correctness.py --model facebook/layerskip-llama2-7B \
--dataset human_eval \
--generation_strategy self_speculative \
--num_speculations 6 \
--exit_layer 4 \
--num_samples 10 \
--sample False \
--output_dir /app/correctness
-
Without a HuggingFace Token:
- Run help commands to ensure scripts are accessible.
- Verify Python and PyTorch installations.
- Check environment variable settings.
-
With a HuggingFace Token:
- Run your scripts as intended by passing the token via the
-e
flag. - Mount host directories for logs and outputs as needed.
- Run your scripts as intended by passing the token via the
To avoid re-downloading models every time you run the container, mount the HuggingFace cache directory:
docker run -it --rm \
-e HUGGINGFACE_TOKEN=your_huggingface_token_here \
-v /path/on/host/huggingface_cache:/root/.cache/huggingface \
layerskip:latest \
python generate.py --help
- Explanation:
-v /path/on/host/huggingface_cache:/root/.cache/huggingface
: Mounts the host directory to the container's HuggingFace cache directory, speeding up model loading times.
-
Avoid Hardcoding Tokens:
Never hardcode your HuggingFace tokens in the Dockerfile or scripts. Always pass them as environment variables at runtime. -
Using Docker Secrets (Advanced):
For enhanced security, especially in production environments, consider using Docker secrets or other secret management tools.Example Using Docker Secrets (Docker Swarm):
-
Create a Secret:
echo "your_huggingface_token_here" | docker secret create huggingface_token -
-
Update
entrypoint.sh
to Read the Secret:#!/bin/bash # entrypoint.sh # Activate the Conda environment source /opt/conda/etc/profile.d/conda.sh conda activate layer_skip # Read HuggingFace token from Docker secret export HUGGINGFACE_TOKEN=$(cat /run/secrets/huggingface_token) # Execute the passed command exec "$@"
-
Deploy the Service with the Secret:
docker service create --name layerskip_service \ --secret huggingface_token \ layerskip:latest \ python generate.py --help
Note: Docker secrets are primarily designed for use with Docker Swarm. If you're not using Swarm, passing environment variables securely as shown earlier is the recommended approach.
-
-
Leverage Docker Caching:
By copyingrequirements.txt
and installing dependencies before copying the rest of the code, Docker can cache these layers and speed up subsequent builds when dependencies haven't changed. -
Combine
RUN
Commands:
Reduce the number of Docker layers by combining multipleRUN
commands where possible.Example:
RUN conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 cpuonly -c pytorch -y && \ pip install --upgrade pip && \ pip install --no-cache-dir -r /app/requirements.txt
If any of your scripts run a web server or need specific ports exposed, add the EXPOSE
directive in the Dockerfile and map the ports when running the container.
Example: Exposing Port 8000
-
Update Dockerfile:
EXPOSE 8000
-
Run the Container with Port Mapping:
docker run -it --rm \ -e HUGGINGFACE_TOKEN=your_huggingface_token_here \ -p 8000:8000 \ layerskip:latest \ python your_web_server_script.py