From 6737ff832893d2145dfc8cdeb1611baca7e53f87 Mon Sep 17 00:00:00 2001
From: Haifeng Jin <5476582+haifeng-jin@users.noreply.github.com>
Date: Fri, 15 Mar 2024 22:21:26 +0000
Subject: [PATCH 1/3] update results as we changed torch compile to optimal
setting
---
templates/getting_started/benchmarks.md | 28 ++++++++++++-------------
1 file changed, 14 insertions(+), 14 deletions(-)
diff --git a/templates/getting_started/benchmarks.md b/templates/getting_started/benchmarks.md
index 4667b56094..5fc178a29f 100644
--- a/templates/getting_started/benchmarks.md
+++ b/templates/getting_started/benchmarks.md
@@ -6,7 +6,7 @@ We benchmark the three backends of Keras 3
alongside native PyTorch implementations ([HuggingFace](https://huggingface.co/)
and [Meta Research](https://github.com/facebookresearch/)) and alongside Keras 2
with TensorFlow. Find code and setup details for reproducing our results
-[here](https://github.com/haifeng-jin/keras-benchmarks/tree/v0.0.1).
+[here](https://github.com/haifeng-jin/keras-benchmarks/tree/v0.0.2).
## Models
@@ -75,18 +75,18 @@ better.
| | Batch
size | Native
PyTorch | Keras 2
(TensorFlow) | Keras 3
(TensorFlow) | Keras 3
(JAX) | Keras 3
(PyTorch) | Keras 3
(best) |
|:---:|---:|---:|---:|---:|---:|---:|---:|
-| **SegmentAnything
(fit)** | 1 | 1,306.85 | 386.93 | **355.25** | 361.69 | 1,388.87 | **355.25** |
-| **SegmentAnything
(predict)** | 7 | 2,733.90 | 3,187.09 | 762.67 | **660.16** | 2,973.64 | **660.16** |
-| **Stable Diffusion
(fit)** | 8 | 481.22 | 1,023.21 | 392.24 | **391.21** | 823.44 | **391.21** |
-| **Stable Diffusion
(predict)** | 13 | 775.36 | 649.71 | **616.04** | 627.27 | 1,337.17 | **616.04** |
-| **BERT
(fit)** | 54 | 1,137.57 | 841.84 | **404.17** | 414.26 | 1,320.41 | **404.17** |
-| **BERT
(predict)** | 531 | 3,837.65 | 965.21 | 962.11 | **865.29** | 3,869.72 | **865.29** |
+| **SegmentAnything
(fit)** | 1 | 1,310.97 | 386.93 | **355.25** | 361.69 | 1,388.87 | **355.25** |
+| **SegmentAnything
(predict)** | 7 | 2,733.65 | 3,187.09 | 762.67 | **660.16** | 2,973.64 | **660.16** |
+| **Stable Diffusion
(fit)** | 8 | 484.56 | 1,023.21 | 392.24 | **391.21** | 823.44 | **391.21** |
+| **Stable Diffusion
(predict)** | 13 | 759.05 | 649.71 | **616.04** | 627.27 | 1,337.17 | **616.04** |
+| **BERT
(fit)** | 32 | 214.73 | 486.00 | **214.49** | 222.37 | 808.68 | **214.49** |
+| **BERT
(predict)** | 256 | 739.46 | 470.12 | 466.01 | **418.72** | 1,865.98 | **865.29** |
| **Gemma
(fit)** | 8 | 253.95 | NA | **232.52** | 273.67 | 525.15 | **232.52** |
-| **Gemma
(generate)** | 32 | 2,717.04 | NA | 1,134.91 | **1,128.21** | 7,952.67* | **1,128.21** |
-| **Gemma
(generate)** | 1 | 1,632.66 | NA | 758.57 | **703.46** | 7,649.40* | **703.46** |
+| **Gemma
(generate)** | 32 | 2,735.18 | NA | 1,134.91 | **1,128.21** | 7,952.67* | **1,128.21** |
+| **Gemma
(generate)** | 1 | 1,618.85 | NA | 758.57 | **703.46** | 7,649.40* | **703.46** |
| **Mistral
(fit)** | 8 | 217.56 | NA | **185.92** | 213.22 | 452.12 | **185.92** |
-| **Mistral
(generate)** | 32 | 1,594.65 | NA | 966.06 | **957.25** | 10,932.59* | **957.25** |
-| **Mistral
(generate)** | 1 | 1,532.63 | NA | 743.28 | **679.30** | 11,054.67* | **679.30** |
+| **Mistral
(generate)** | 32 | 1,633.50 | NA | 966.06 | **957.25** | 10,932.59* | **957.25** |
+| **Mistral
(generate)** | 1 | 1,554.79 | NA | 743.28 | **679.30** | 11,054.67* | **679.30** |
\* _LLM inference with the PyTorch backend is abnormally slow at this time
because KerasNLP uses static sequence padding, unlike HuggingFace. This will be
@@ -112,13 +112,13 @@ the throughput (steps/ms) increase of Keras 3 over native PyTorch from Table 2.
A 100% increase indicates Keras 3 is twice as fast, while 0% means both
frameworks perform equally.
-
+
**Figure 1**: Keras 3 speedup over PyTorch measured in throughput (steps/ms)
Keras 3 with the best-performing backend outperformed the reference native
PyTorch implementations for all the models. Notably, 5 out of 10 tasks
-demonstrated speedups exceeding 100%, with a maximum speedup of 340%.
+demonstrated speedups exceeding 50%, with a maximum speedup of 340%.
### Key Finding 3: Keras 3 delivers best-in-class "out-of-the-box" performance
@@ -149,7 +149,7 @@ We also calculated the throughput (steps/ms) increase of Keras 3 (using its
best-performing backend) over Keras 2 with TensorFlow from Table 1. Results are
shown in the following figure.
-
+
**Figure 2**: Keras 3 speedup over Keras 2 measured in throughput (steps/ms)
From 70a09c9f86b49d3131e50937bac2bba4476b6642 Mon Sep 17 00:00:00 2001
From: Haifeng Jin <5476582+haifeng-jin@users.noreply.github.com>
Date: Fri, 15 Mar 2024 22:56:37 +0000
Subject: [PATCH 2/3] update
---
templates/getting_started/benchmarks.md | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/templates/getting_started/benchmarks.md b/templates/getting_started/benchmarks.md
index 5fc178a29f..2b5dfa625e 100644
--- a/templates/getting_started/benchmarks.md
+++ b/templates/getting_started/benchmarks.md
@@ -34,9 +34,10 @@ PyTorch backend.
We employed synthetic data for all benchmarks. We used `bfloat16` precision for
all LLM training and inferencing, and LoRA6 for all LLM training
-(fine-tuning). Additionally, we applied `torch.compile()` to compatible native
-PyTorch implementations (with the exception of Gemma training and Mistral
-training due to incompatibility).
+(fine-tuning). Based on the recommendations of the PyTorch team, we used
+`torch.compile(model, mode="reduce-overhead")` to compatible native PyTorch
+implementations (with the exception of Gemma training and Mistral training due
+to incompatibility).
To measure out-of-the-box performance, we use high-level APIs (e.g. `Trainer()`
from HuggingFace, plain PyTorch training loops and Keras `model.fit()`) with as
From 754d5ea2d05787309362ae820a4c54ef651d4339 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Fran=C3=A7ois=20Chollet?=
Date: Fri, 15 Mar 2024 16:09:17 -0700
Subject: [PATCH 3/3] Update benchmarks.md
---
templates/getting_started/benchmarks.md | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/templates/getting_started/benchmarks.md b/templates/getting_started/benchmarks.md
index 2b5dfa625e..10f329420f 100644
--- a/templates/getting_started/benchmarks.md
+++ b/templates/getting_started/benchmarks.md
@@ -35,7 +35,7 @@ PyTorch backend.
We employed synthetic data for all benchmarks. We used `bfloat16` precision for
all LLM training and inferencing, and LoRA6 for all LLM training
(fine-tuning). Based on the recommendations of the PyTorch team, we used
-`torch.compile(model, mode="reduce-overhead")` to compatible native PyTorch
+`torch.compile(model, mode="reduce-overhead")` with native PyTorch
implementations (with the exception of Gemma training and Mistral training due
to incompatibility).
@@ -191,4 +191,4 @@ open models." The Keyword, Google (2024).
arXiv:2310.06825 (2023).
6 Hu, Edward J., et al. "Lora: Low-rank adaptation of large language
-models." ICLR (2022).
\ No newline at end of file
+models." ICLR (2022).