diff --git a/templates/getting_started/benchmarks.md b/templates/getting_started/benchmarks.md
index 4667b56094..10f329420f 100644
--- a/templates/getting_started/benchmarks.md
+++ b/templates/getting_started/benchmarks.md
@@ -6,7 +6,7 @@ We benchmark the three backends of Keras 3
alongside native PyTorch implementations ([HuggingFace](https://huggingface.co/)
and [Meta Research](https://github.com/facebookresearch/)) and alongside Keras 2
with TensorFlow. Find code and setup details for reproducing our results
-[here](https://github.com/haifeng-jin/keras-benchmarks/tree/v0.0.1).
+[here](https://github.com/haifeng-jin/keras-benchmarks/tree/v0.0.2).
## Models
@@ -34,9 +34,10 @@ PyTorch backend.
We employed synthetic data for all benchmarks. We used `bfloat16` precision for
all LLM training and inferencing, and LoRA6 for all LLM training
-(fine-tuning). Additionally, we applied `torch.compile()` to compatible native
-PyTorch implementations (with the exception of Gemma training and Mistral
-training due to incompatibility).
+(fine-tuning). Based on the recommendations of the PyTorch team, we used
+`torch.compile(model, mode="reduce-overhead")` with native PyTorch
+implementations (with the exception of Gemma training and Mistral training due
+to incompatibility).
To measure out-of-the-box performance, we use high-level APIs (e.g. `Trainer()`
from HuggingFace, plain PyTorch training loops and Keras `model.fit()`) with as
@@ -75,18 +76,18 @@ better.
| | Batch
size | Native
PyTorch | Keras 2
(TensorFlow) | Keras 3
(TensorFlow) | Keras 3
(JAX) | Keras 3
(PyTorch) | Keras 3
(best) |
|:---:|---:|---:|---:|---:|---:|---:|---:|
-| **SegmentAnything
(fit)** | 1 | 1,306.85 | 386.93 | **355.25** | 361.69 | 1,388.87 | **355.25** |
-| **SegmentAnything
(predict)** | 7 | 2,733.90 | 3,187.09 | 762.67 | **660.16** | 2,973.64 | **660.16** |
-| **Stable Diffusion
(fit)** | 8 | 481.22 | 1,023.21 | 392.24 | **391.21** | 823.44 | **391.21** |
-| **Stable Diffusion
(predict)** | 13 | 775.36 | 649.71 | **616.04** | 627.27 | 1,337.17 | **616.04** |
-| **BERT
(fit)** | 54 | 1,137.57 | 841.84 | **404.17** | 414.26 | 1,320.41 | **404.17** |
-| **BERT
(predict)** | 531 | 3,837.65 | 965.21 | 962.11 | **865.29** | 3,869.72 | **865.29** |
+| **SegmentAnything
(fit)** | 1 | 1,310.97 | 386.93 | **355.25** | 361.69 | 1,388.87 | **355.25** |
+| **SegmentAnything
(predict)** | 7 | 2,733.65 | 3,187.09 | 762.67 | **660.16** | 2,973.64 | **660.16** |
+| **Stable Diffusion
(fit)** | 8 | 484.56 | 1,023.21 | 392.24 | **391.21** | 823.44 | **391.21** |
+| **Stable Diffusion
(predict)** | 13 | 759.05 | 649.71 | **616.04** | 627.27 | 1,337.17 | **616.04** |
+| **BERT
(fit)** | 32 | 214.73 | 486.00 | **214.49** | 222.37 | 808.68 | **214.49** |
+| **BERT
(predict)** | 256 | 739.46 | 470.12 | 466.01 | **418.72** | 1,865.98 | **865.29** |
| **Gemma
(fit)** | 8 | 253.95 | NA | **232.52** | 273.67 | 525.15 | **232.52** |
-| **Gemma
(generate)** | 32 | 2,717.04 | NA | 1,134.91 | **1,128.21** | 7,952.67* | **1,128.21** |
-| **Gemma
(generate)** | 1 | 1,632.66 | NA | 758.57 | **703.46** | 7,649.40* | **703.46** |
+| **Gemma
(generate)** | 32 | 2,735.18 | NA | 1,134.91 | **1,128.21** | 7,952.67* | **1,128.21** |
+| **Gemma
(generate)** | 1 | 1,618.85 | NA | 758.57 | **703.46** | 7,649.40* | **703.46** |
| **Mistral
(fit)** | 8 | 217.56 | NA | **185.92** | 213.22 | 452.12 | **185.92** |
-| **Mistral
(generate)** | 32 | 1,594.65 | NA | 966.06 | **957.25** | 10,932.59* | **957.25** |
-| **Mistral
(generate)** | 1 | 1,532.63 | NA | 743.28 | **679.30** | 11,054.67* | **679.30** |
+| **Mistral
(generate)** | 32 | 1,633.50 | NA | 966.06 | **957.25** | 10,932.59* | **957.25** |
+| **Mistral
(generate)** | 1 | 1,554.79 | NA | 743.28 | **679.30** | 11,054.67* | **679.30** |
\* _LLM inference with the PyTorch backend is abnormally slow at this time
because KerasNLP uses static sequence padding, unlike HuggingFace. This will be
@@ -112,13 +113,13 @@ the throughput (steps/ms) increase of Keras 3 over native PyTorch from Table 2.
A 100% increase indicates Keras 3 is twice as fast, while 0% means both
frameworks perform equally.
-
+
**Figure 1**: Keras 3 speedup over PyTorch measured in throughput (steps/ms)
Keras 3 with the best-performing backend outperformed the reference native
PyTorch implementations for all the models. Notably, 5 out of 10 tasks
-demonstrated speedups exceeding 100%, with a maximum speedup of 340%.
+demonstrated speedups exceeding 50%, with a maximum speedup of 340%.
### Key Finding 3: Keras 3 delivers best-in-class "out-of-the-box" performance
@@ -149,7 +150,7 @@ We also calculated the throughput (steps/ms) increase of Keras 3 (using its
best-performing backend) over Keras 2 with TensorFlow from Table 1. Results are
shown in the following figure.
-
+
**Figure 2**: Keras 3 speedup over Keras 2 measured in throughput (steps/ms)
@@ -190,4 +191,4 @@ open models." The Keyword, Google (2024).
arXiv:2310.06825 (2023).
6 Hu, Edward J., et al. "Lora: Low-rank adaptation of large language
-models." ICLR (2022).
\ No newline at end of file
+models." ICLR (2022).