Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update results as we changed torch compile to optimal setting #1799

Merged
merged 3 commits into from
Mar 15, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 19 additions & 18 deletions templates/getting_started/benchmarks.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ We benchmark the three backends of Keras 3
alongside native PyTorch implementations ([HuggingFace](https://huggingface.co/)
and [Meta Research](https://github.com/facebookresearch/)) and alongside Keras 2
with TensorFlow. Find code and setup details for reproducing our results
[here](https://github.com/haifeng-jin/keras-benchmarks/tree/v0.0.1).
[here](https://github.com/haifeng-jin/keras-benchmarks/tree/v0.0.2).

## Models

Expand Down Expand Up @@ -34,9 +34,10 @@ PyTorch backend.

We employed synthetic data for all benchmarks. We used `bfloat16` precision for
all LLM training and inferencing, and LoRA<sup>6</sup> for all LLM training
(fine-tuning). Additionally, we applied `torch.compile()` to compatible native
PyTorch implementations (with the exception of Gemma training and Mistral
training due to incompatibility).
(fine-tuning). Based on the recommendations of the PyTorch team, we used
`torch.compile(model, mode="reduce-overhead")` with native PyTorch
implementations (with the exception of Gemma training and Mistral training due
to incompatibility).

To measure out-of-the-box performance, we use high-level APIs (e.g. `Trainer()`
from HuggingFace, plain PyTorch training loops and Keras `model.fit()`) with as
Expand Down Expand Up @@ -75,18 +76,18 @@ better.

| | Batch<br>size | Native<br>PyTorch | Keras 2<br>(TensorFlow) | Keras 3<br>(TensorFlow) | Keras 3<br>(JAX) | Keras 3<br>(PyTorch) | Keras 3<br>(best) |
|:---:|---:|---:|---:|---:|---:|---:|---:|
| **SegmentAnything<br>(fit)** | 1 | 1,306.85 | 386.93 | **355.25** | 361.69 | 1,388.87 | **355.25** |
| **SegmentAnything<br>(predict)** | 7 | 2,733.90 | 3,187.09 | 762.67 | **660.16** | 2,973.64 | **660.16** |
| **Stable Diffusion<br>(fit)** | 8 | 481.22 | 1,023.21 | 392.24 | **391.21** | 823.44 | **391.21** |
| **Stable Diffusion<br>(predict)** | 13 | 775.36 | 649.71 | **616.04** | 627.27 | 1,337.17 | **616.04** |
| **BERT<br>(fit)** | 54 | 1,137.57 | 841.84 | **404.17** | 414.26 | 1,320.41 | **404.17** |
| **BERT<br>(predict)** | 531 | 3,837.65 | 965.21 | 962.11 | **865.29** | 3,869.72 | **865.29** |
| **SegmentAnything<br>(fit)** | 1 | 1,310.97 | 386.93 | **355.25** | 361.69 | 1,388.87 | **355.25** |
| **SegmentAnything<br>(predict)** | 7 | 2,733.65 | 3,187.09 | 762.67 | **660.16** | 2,973.64 | **660.16** |
| **Stable Diffusion<br>(fit)** | 8 | 484.56 | 1,023.21 | 392.24 | **391.21** | 823.44 | **391.21** |
| **Stable Diffusion<br>(predict)** | 13 | 759.05 | 649.71 | **616.04** | 627.27 | 1,337.17 | **616.04** |
| **BERT<br>(fit)** | 32 | 214.73 | 486.00 | **214.49** | 222.37 | 808.68 | **214.49** |
| **BERT<br>(predict)** | 256 | 739.46 | 470.12 | 466.01 | **418.72** | 1,865.98 | **865.29** |
| **Gemma<br>(fit)** | 8 | 253.95 | NA | **232.52** | 273.67 | 525.15 | **232.52** |
| **Gemma<br>(generate)** | 32 | 2,717.04 | NA | 1,134.91 | **1,128.21** | 7,952.67<sup>*</sup> | **1,128.21** |
| **Gemma<br>(generate)** | 1 | 1,632.66 | NA | 758.57 | **703.46** | 7,649.40<sup>*</sup> | **703.46** |
| **Gemma<br>(generate)** | 32 | 2,735.18 | NA | 1,134.91 | **1,128.21** | 7,952.67<sup>*</sup> | **1,128.21** |
| **Gemma<br>(generate)** | 1 | 1,618.85 | NA | 758.57 | **703.46** | 7,649.40<sup>*</sup> | **703.46** |
| **Mistral<br>(fit)** | 8 | 217.56 | NA | **185.92** | 213.22 | 452.12 | **185.92** |
| **Mistral<br>(generate)** | 32 | 1,594.65 | NA | 966.06 | **957.25** | 10,932.59<sup>*</sup> | **957.25** |
| **Mistral<br>(generate)** | 1 | 1,532.63 | NA | 743.28 | **679.30** | 11,054.67<sup>*</sup> | **679.30** |
| **Mistral<br>(generate)** | 32 | 1,633.50 | NA | 966.06 | **957.25** | 10,932.59<sup>*</sup> | **957.25** |
| **Mistral<br>(generate)** | 1 | 1,554.79 | NA | 743.28 | **679.30** | 11,054.67<sup>*</sup> | **679.30** |

\* _LLM inference with the PyTorch backend is abnormally slow at this time
because KerasNLP uses static sequence padding, unlike HuggingFace. This will be
Expand All @@ -112,13 +113,13 @@ the throughput (steps/ms) increase of Keras 3 over native PyTorch from Table 2.
A 100% increase indicates Keras 3 is twice as fast, while 0% means both
frameworks perform equally.

![Figure 1](https://i.imgur.com/3s3RZOx.png)
![Figure 1](https://i.imgur.com/tFeLTbz.png)

**Figure 1**: Keras 3 speedup over PyTorch measured in throughput (steps/ms)

Keras 3 with the best-performing backend outperformed the reference native
PyTorch implementations for all the models. Notably, 5 out of 10 tasks
demonstrated speedups exceeding 100%, with a maximum speedup of 340%.
demonstrated speedups exceeding 50%, with a maximum speedup of 340%.

### Key Finding 3: Keras 3 delivers best-in-class "out-of-the-box" performance

Expand Down Expand Up @@ -149,7 +150,7 @@ We also calculated the throughput (steps/ms) increase of Keras 3 (using its
best-performing backend) over Keras 2 with TensorFlow from Table 1. Results are
shown in the following figure.

![Figrue 2](https://i.imgur.com/BUjRUK1.png)
![Figrue 2](https://i.imgur.com/lBAPgsY.png)

**Figure 2**: Keras 3 speedup over Keras 2 measured in throughput (steps/ms)

Expand Down Expand Up @@ -190,4 +191,4 @@ open models." The Keyword, Google (2024).
arXiv:2310.06825 (2023).

<sup>6</sup> Hu, Edward J., et al. "Lora: Low-rank adaptation of large language
models." ICLR (2022).
models." ICLR (2022).
Loading