Skip to content

v0.2.0

Latest
Compare
Choose a tag to compare
@tengomucho tengomucho released this 20 Nov 13:06
· 1 commit to main since this release
1fc59ce

This is the first release of Optimum TPU that includes support for Jetstream Pytorch engine as backend for Test Generation Inference (TGI).
JetStream is a throughput and memory optimized engine for LLM inference on TPUs, and its Pytorch implementation allows for a seamless integration in the TGI code. The supported models (for now Llama 2 and Llama 3, Gemma 1 and Mixtral, and serving inference on these models resulted has given results close to 10x in terms of tokens/sec compared to the previously used backend (Pytorch XLA/transformers).
On top of that, it is possible to use quantization to serve using even less resources while maintaining a similar throughput and quality.
Details follow.

What's Changed

New Contributors

Full Changelog: v0.1.5...v0.2.0