Candle is a minimalistic Machine/Deep Learning framework written on Rust by huggingface. It tries to provide a simpler interface to implement models along with GPU support. This is a modified implementation of Llama2-Candle example to analyse the benchmark performance across different devices and precision.
For running this benchmark, make sure you have Rust installed. You can run the Candle benchmark using the following command:
./bench_candle/bench.sh \
--prompt <value> \ # Enter a prompt string
--max_tokens <value> \ # Maximum number of tokens to output
--repetitions <value> \ # Number of repititions to be made for the prompt.
--log_file <file_path> \ # A .log file underwhich we want to write the results.
--device <cpu/cuda/metal> \ # The device in which we want to benchmark.
--models_dir <path_to_models> # The directory in which model weights are present
To get started quickly you can simply run:
./bench_candle/bench.sh -d cuda
This will take all the default values (see in the bench.sh file) and perform the benchmarks. You can find all the benchmarks results for Candle here.
- Running this benchmark requires HuggingFace Llama2-7B weights. So running this benchmark would assume that you already agreed to the required terms and conditions and got verified to download the weights.
- Candle does not have support for Metal devices.
- Candles does support quantized models. The benchmarks for quantized candles model will be available in the next versions.