From 6737ff832893d2145dfc8cdeb1611baca7e53f87 Mon Sep 17 00:00:00 2001 From: Haifeng Jin <5476582+haifeng-jin@users.noreply.github.com> Date: Fri, 15 Mar 2024 22:21:26 +0000 Subject: [PATCH 1/3] update results as we changed torch compile to optimal setting --- templates/getting_started/benchmarks.md | 28 ++++++++++++------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/templates/getting_started/benchmarks.md b/templates/getting_started/benchmarks.md index 4667b56094..5fc178a29f 100644 --- a/templates/getting_started/benchmarks.md +++ b/templates/getting_started/benchmarks.md @@ -6,7 +6,7 @@ We benchmark the three backends of Keras 3 alongside native PyTorch implementations ([HuggingFace](https://huggingface.co/) and [Meta Research](https://github.com/facebookresearch/)) and alongside Keras 2 with TensorFlow. Find code and setup details for reproducing our results -[here](https://github.com/haifeng-jin/keras-benchmarks/tree/v0.0.1). +[here](https://github.com/haifeng-jin/keras-benchmarks/tree/v0.0.2). ## Models @@ -75,18 +75,18 @@ better. | | Batch
size | Native
PyTorch | Keras 2
(TensorFlow) | Keras 3
(TensorFlow) | Keras 3
(JAX) | Keras 3
(PyTorch) | Keras 3
(best) | |:---:|---:|---:|---:|---:|---:|---:|---:| -| **SegmentAnything
(fit)** | 1 | 1,306.85 | 386.93 | **355.25** | 361.69 | 1,388.87 | **355.25** | -| **SegmentAnything
(predict)** | 7 | 2,733.90 | 3,187.09 | 762.67 | **660.16** | 2,973.64 | **660.16** | -| **Stable Diffusion
(fit)** | 8 | 481.22 | 1,023.21 | 392.24 | **391.21** | 823.44 | **391.21** | -| **Stable Diffusion
(predict)** | 13 | 775.36 | 649.71 | **616.04** | 627.27 | 1,337.17 | **616.04** | -| **BERT
(fit)** | 54 | 1,137.57 | 841.84 | **404.17** | 414.26 | 1,320.41 | **404.17** | -| **BERT
(predict)** | 531 | 3,837.65 | 965.21 | 962.11 | **865.29** | 3,869.72 | **865.29** | +| **SegmentAnything
(fit)** | 1 | 1,310.97 | 386.93 | **355.25** | 361.69 | 1,388.87 | **355.25** | +| **SegmentAnything
(predict)** | 7 | 2,733.65 | 3,187.09 | 762.67 | **660.16** | 2,973.64 | **660.16** | +| **Stable Diffusion
(fit)** | 8 | 484.56 | 1,023.21 | 392.24 | **391.21** | 823.44 | **391.21** | +| **Stable Diffusion
(predict)** | 13 | 759.05 | 649.71 | **616.04** | 627.27 | 1,337.17 | **616.04** | +| **BERT
(fit)** | 32 | 214.73 | 486.00 | **214.49** | 222.37 | 808.68 | **214.49** | +| **BERT
(predict)** | 256 | 739.46 | 470.12 | 466.01 | **418.72** | 1,865.98 | **865.29** | | **Gemma
(fit)** | 8 | 253.95 | NA | **232.52** | 273.67 | 525.15 | **232.52** | -| **Gemma
(generate)** | 32 | 2,717.04 | NA | 1,134.91 | **1,128.21** | 7,952.67* | **1,128.21** | -| **Gemma
(generate)** | 1 | 1,632.66 | NA | 758.57 | **703.46** | 7,649.40* | **703.46** | +| **Gemma
(generate)** | 32 | 2,735.18 | NA | 1,134.91 | **1,128.21** | 7,952.67* | **1,128.21** | +| **Gemma
(generate)** | 1 | 1,618.85 | NA | 758.57 | **703.46** | 7,649.40* | **703.46** | | **Mistral
(fit)** | 8 | 217.56 | NA | **185.92** | 213.22 | 452.12 | **185.92** | -| **Mistral
(generate)** | 32 | 1,594.65 | NA | 966.06 | **957.25** | 10,932.59* | **957.25** | -| **Mistral
(generate)** | 1 | 1,532.63 | NA | 743.28 | **679.30** | 11,054.67* | **679.30** | +| **Mistral
(generate)** | 32 | 1,633.50 | NA | 966.06 | **957.25** | 10,932.59* | **957.25** | +| **Mistral
(generate)** | 1 | 1,554.79 | NA | 743.28 | **679.30** | 11,054.67* | **679.30** | \* _LLM inference with the PyTorch backend is abnormally slow at this time because KerasNLP uses static sequence padding, unlike HuggingFace. This will be @@ -112,13 +112,13 @@ the throughput (steps/ms) increase of Keras 3 over native PyTorch from Table 2. A 100% increase indicates Keras 3 is twice as fast, while 0% means both frameworks perform equally. -![Figure 1](https://i.imgur.com/3s3RZOx.png) +![Figure 1](https://i.imgur.com/tFeLTbz.png) **Figure 1**: Keras 3 speedup over PyTorch measured in throughput (steps/ms) Keras 3 with the best-performing backend outperformed the reference native PyTorch implementations for all the models. Notably, 5 out of 10 tasks -demonstrated speedups exceeding 100%, with a maximum speedup of 340%. +demonstrated speedups exceeding 50%, with a maximum speedup of 340%. ### Key Finding 3: Keras 3 delivers best-in-class "out-of-the-box" performance @@ -149,7 +149,7 @@ We also calculated the throughput (steps/ms) increase of Keras 3 (using its best-performing backend) over Keras 2 with TensorFlow from Table 1. Results are shown in the following figure. -![Figrue 2](https://i.imgur.com/BUjRUK1.png) +![Figrue 2](https://i.imgur.com/lBAPgsY.png) **Figure 2**: Keras 3 speedup over Keras 2 measured in throughput (steps/ms) From 70a09c9f86b49d3131e50937bac2bba4476b6642 Mon Sep 17 00:00:00 2001 From: Haifeng Jin <5476582+haifeng-jin@users.noreply.github.com> Date: Fri, 15 Mar 2024 22:56:37 +0000 Subject: [PATCH 2/3] update --- templates/getting_started/benchmarks.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/templates/getting_started/benchmarks.md b/templates/getting_started/benchmarks.md index 5fc178a29f..2b5dfa625e 100644 --- a/templates/getting_started/benchmarks.md +++ b/templates/getting_started/benchmarks.md @@ -34,9 +34,10 @@ PyTorch backend. We employed synthetic data for all benchmarks. We used `bfloat16` precision for all LLM training and inferencing, and LoRA6 for all LLM training -(fine-tuning). Additionally, we applied `torch.compile()` to compatible native -PyTorch implementations (with the exception of Gemma training and Mistral -training due to incompatibility). +(fine-tuning). Based on the recommendations of the PyTorch team, we used +`torch.compile(model, mode="reduce-overhead")` to compatible native PyTorch +implementations (with the exception of Gemma training and Mistral training due +to incompatibility). To measure out-of-the-box performance, we use high-level APIs (e.g. `Trainer()` from HuggingFace, plain PyTorch training loops and Keras `model.fit()`) with as From 754d5ea2d05787309362ae820a4c54ef651d4339 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Fran=C3=A7ois=20Chollet?= Date: Fri, 15 Mar 2024 16:09:17 -0700 Subject: [PATCH 3/3] Update benchmarks.md --- templates/getting_started/benchmarks.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/templates/getting_started/benchmarks.md b/templates/getting_started/benchmarks.md index 2b5dfa625e..10f329420f 100644 --- a/templates/getting_started/benchmarks.md +++ b/templates/getting_started/benchmarks.md @@ -35,7 +35,7 @@ PyTorch backend. We employed synthetic data for all benchmarks. We used `bfloat16` precision for all LLM training and inferencing, and LoRA6 for all LLM training (fine-tuning). Based on the recommendations of the PyTorch team, we used -`torch.compile(model, mode="reduce-overhead")` to compatible native PyTorch +`torch.compile(model, mode="reduce-overhead")` with native PyTorch implementations (with the exception of Gemma training and Mistral training due to incompatibility). @@ -191,4 +191,4 @@ open models." The Keyword, Google (2024). arXiv:2310.06825 (2023). 6 Hu, Edward J., et al. "Lora: Low-rank adaptation of large language -models." ICLR (2022). \ No newline at end of file +models." ICLR (2022).