Skip to content
This repository has been archived by the owner on Oct 25, 2024. It is now read-only.

Commit

Permalink
Update release data (#491)
Browse files Browse the repository at this point in the history
Signed-off-by: Wenxin Zhang <[email protected]>
Signed-off-by: Lv, Kaokao <[email protected]>
Co-authored-by: Lv, Kaokao <[email protected]>
  • Loading branch information
VincyZhang and lkk12014402 authored Oct 19, 2023
1 parent ce3d24e commit 38dfeae
Show file tree
Hide file tree
Showing 2 changed files with 10 additions and 77 deletions.
70 changes: 8 additions & 62 deletions docs/release_data.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,23 +12,19 @@ Validated Model Performance

3. [LLM Runtime (GGML-Compatible)](#llm-runtime-GGML-compatible)

3.1 [LLama-7B-hf](#llama-7b-hf)
3.1 [MPT-7B](#mpt-7b)

3.2 [LLama2-7B-chat](#llama2-7b-chat)
3.2 [GPT-j-6B](#gpt-j-6b)

3.3 [MPT-7B](#mpt-7b)
3.3 [Falcon-7B](#falcon-7b)

3.4 [GPT-j-6B](#gpt-j-6b)
3.4 [GPT-NEOX-20B](#gpt-neox-20b)

3.5 [Falcon-7B](#falcon-7b)
3.5 [Dolly-V2-3B](#dolly-v2-3b)

3.6 [GPT-NEOX-20B](#gpt-neox-20b)
3.6 [OPT-1.3B](#opt-13b)

3.7 [Dolly-V2-3B](#dolly-v2-3b)

3.8 [OPT-1.3B](#opt-13b)

3.9 [StarCoder-3B](#starcoder-3b)
3.7 [StarCoder-3B](#starcoder-3b)

4. [LLM Finetuning](#llm-finetuning)

Expand Down Expand Up @@ -58,8 +54,6 @@ Intel Neural Compressor: 2.3
| pytorch | bloom_7b1 | NeelNanda/pile-10k | 12.36 | 60.14% | 3.22 | 57.64% | 3.83 | 4.34% |
| pytorch | opt_2.7b | NeelNanda/pile-10k | 23.19 | 63.67% | 12.24 | 63.65% | 1.89 | 0.03% |
| pytorch | opt_6.7b | NeelNanda/pile-10k | 13.5 | 67.01% | 4.1 | 67.69% | 3.29 | \-1.00% |
| pytorch | llama_13b | NeelNanda/pile-10k | 8.9 | 56.88% | 2.24 | 76.27% | 3.97 | \-25.42% |
| pytorch | llama_7b | NeelNanda/pile-10k | 13.56 | 58.55% | 4.4 | 73.61% | 3.09 | \-20.46% |
| pytorch | gpt_j_6b | NeelNanda/pile-10k | 10.76 | 67.59% | 4.38 | 68.31% | 2.46 | \-1.05% |
| pytorch | flan_t5_large | samsum | 69.75 | 46.25 (rougeLsum) | 33.16 | 47.67 (rougeLsum) | 2.1 | \-2.99% |
| pytorch | gpt_neox_clm | wikitext | 1.47 | 4.04 (eval_loss) | 0.65 | 3.52 (eval_loss) | 2.27 | \-14.78% |
Expand All @@ -80,7 +74,6 @@ Pytorch: 2.0.1+cpu
| --------- | --------------------- | -------- | ------ | --------- | --------- | --------- | --------- | --------- | --------- | -------- |
| pytorch | gpt-neox-20b | 32 | 32 | 9283 (ms) | | | | | | |
| pytorch | dolly-v2-3b | 32 | 32 | 3191 (ms) | 3798 (ms) | 2689 (ms) | 1.19x | 1.41x | |
| pytorch | llama-7b-hf | 32 | 32 | 1872 (ms) | 5402 (ms) | 2689 (ms) | 1935 (ms) | 2.89x | 2.01x | 2.79x |
| pytorch | gpt-j-6b-pruned | 32 | 32 | 4523 (ms) | 2421 (ms) | 1758 (ms) | 1.87x | 2.57x |
| pytorch | gpt-j-6b | 32 | 32 | 1658 (ms) | 4561 (ms) | 2429 (ms) | 1793 (ms) | 2.75x | 1.88x | 2.54x |

Expand Down Expand Up @@ -125,53 +118,6 @@ Pytorch: 2.0.1+cpu
Environment:
GCC / G++: 12.1.0

### LLama-7B-hf

| Backend | Input | Output | Cores/Instance | Precision | Compute Type | Group Size | Next Token(ms) | Memory mean used (Top 50%) MB | First Token(ms) | Total Latency(ms) | P90 Latency(ms) | P99 Latency(ms) |
| ---------- | ----- | ------ | -------------- | --------- | ------------ | ---------- | -------------- | ----------------------------- | --------------- | ----------------- | --------------- | --------------- |
| LLM Runtime | 32 | 32 | 32 | INT4 | INT8 | 128 | 27.2 |  4212 | 72.69 | 915 | 27.37 | 58.73 |
| LLM Runtime | 1024 | 32 | 32 | INT4 | INT8 | 128 | 30.39 | 4495 | 3091 | 4033 | 30.75 | 2142 |
| LLM Runtime | 32 | 32 | 48 | INT4 | INT8 | 128 | 24.41 | 4786 | 71.1 | 827 | 24.63 | 56.85 |
| LLM Runtime | 1024 | 32 | 48 | INT4 | INT8 | 128 | 27.46 | 4751 | 2904 | 3755 | 27.56 | 2012 |
| LLM Runtime | 32 | 32 | 56 | INT4 | INT8 | 128 | 24.84 | 4810 | 72.1 | 842.05 | 25.01 | 57.72 |
| LLM Runtime | 1024 | 32 | 56 | INT4 | INT8 | 128 | 27.97 | 4790 | 2749 | 3616 | 28.09 | 1906 |
| LLM Runtime | 32 | 32 | 32 | INT4 | INT8 | 32 | 28.95 | 4154 | 125.24 | 1022 | 29.04 | 95.51 |
| LLM Runtime | 1024 | 32 | 32 | INT4 | INT8 | 32 | 32.49 | 4966 | 4645 | 5652 | 32.55 | 3215 |
| LLM Runtime | 32 | 32 | 48 | INT4 | INT8 | 32 | 27.1 | 4780 | 113.01 | 953 | 27.28 | 86.56 |
| LLM Runtime | 1024 | 32 | 48 | INT4 | INT8 | 32 | 30.51 | 4853 | 5077 | 6022 | 30.64 | 3513 |
| LLM Runtime | 32 | 32 | 56 | INT4 | INT8 | 32 | 27.99 | 4808 | 121.65 | 989 | 28.15 | 92.8 |
| LLM Runtime | 1024 | 32 | 56 | INT4 | INT8 | 32 | 31.22 | 4855 | 4805 | 5773 | 31.36 | 3326 |
| GGML | 32 | 32 | 32 | INT4 | INT8 | 32 | 29.78 | 4035 | 426.64 | 1349 | 30.07 | 303 |
| GGML | 1024 | 32 | 32 | INT4 | INT8 | 32 | 34.07 | 4789 | 14561 | 15617 | 34.57 | 10058 |
| GGML | 32 | 32 | 48 | INT4 | INT8 | 32 | 27.37 | 4776 | 309.31 | 1157 | 27.52 | 222 |
| GGML | 1024 | 32 | 48 | INT4 | INT8 | 32 | 30.6 | 4811 | 10653 | 11601 | 30.85 | 7360 |
| GGML | 32 | 32 | 56 | INT4 | INT8 | 32 | 27.2 | 4803 | 282.86 | 1126 | 27.42 | 203 |
| GGML | 1024 | 32 | 56 | INT4 | INT8 | 32 | 30.06 | 4827 | 9677 | 10609 | 30.24 | 6688 |


### LLama2-7B-chat

| Backend | Input | Output | Cores/Instance | Precision | Compute Type | Group Size | Next Token(ms) | Memory mean used (Top 50%) MB | First Token(ms) | Total Latency(ms) | P90 Latency(ms) | P99 Latency(ms) |
| ---------- | ----- | ------ | -------------- | --------- | ------------ | ---------- | -------------- | ----------------------------- | --------------- | ----------------- | --------------- | --------------- |
| LLM Runtime | 32 | 32 | 32 | INT4 | INT8 | 128 | 26.2 |  4320 | 103.62 | 915 | 26.38 | 79.73 |
| LLM Runtime | 1024 | 32 | 32 | INT4 | INT8 | 128 | 30.74 | 4516 | 4102 | 5055 | 31.19 | 2842 |
| LLM Runtime | 32 | 32 | 48 | INT4 | INT8 | 128 | 24.34 | 4772 | 91.57 | 846 | 24.64 | 70.99 |
| LLM Runtime | 1024 | 32 | 48 | INT4 | INT8 | 128 | 27.52 | 4743 | 4575 | 5428 | 27.58 | 3166 |
| LLM Runtime | 32 | 32 | 56 | INT4 | INT8 | 128 | 24.55 | 4784 | 95.24 | 856 | 24.75 | 73.58 |
| LLM Runtime | 1024 | 32 | 56 | INT4 | INT8 | 128 | 27.7 | 4762 | 4185 | 5043 | 27.82 | 2896 |
| LLM Runtime | 32 | 32 | 32 | INT4 | INT8 | 32 | 29.37 | 4163 | 130 | 1040 | 29.55 | 98.94 |
| LLM Runtime | 1024 | 32 | 32 | INT4 | INT8 | 32 | 32.89 | 4952 | 4812 | 5831 | 33.06 | 3330 |
| LLM Runtime | 32 | 32 | 48 | INT4 | INT8 | 32 | 27.66 | 4771 | 113 | 970 | 28.54 | 86.89 |
| LLM Runtime | 1024 | 32 | 48 | INT4 | INT8 | 32 | 31.24 | 4857 | 9884 | 10852 | 31.33 | 6829 |
| LLM Runtime | 32 | 32 | 56 | INT4 | INT8 | 32 | 27.87 | 4782 | 120.6 | 984 | 28.1 | 92.17 |
| LLM Runtime | 1024 | 32 | 56 | INT4 | INT8 | 32 | 31.13 | 4819 | 4507 | 5472 | 31.24 | 3119 |
| GGML | 32 | 32 | 32 | INT4 | INT8 | 32 | 32.2 | 3970 | 426.65 | 1424 | 32.49 | 304.4 |
| GGML | 1024 | 32 | 32 | INT4 | INT8 | 32 | 33.99 | 4774 | 14457 | 15511 | 34.2 | 9986 |
| GGML | 32 | 32 | 48 | INT4 | INT8 | 32 | 28.14 | 4768 | 309.57 | 1181 | 28.38 | 222.4 |
| GGML | 1024 | 32 | 48 | INT4 | INT8 | 32 | 31.49 | 4786 | 16512 | 17488 | 31.65 | 11404 |
| GGML | 32 | 32 | 56 | INT4 | INT8 | 32 | 27.71 | 4779 | 283.91 | 1142 | 27.89 | 204.6 |
| GGML | 1024 | 32 | 56 | INT4 | INT8 | 32 | 31.53 | 4780 | 15819 | 16797 | 31.59 | 10926 |


### MPT-7B

Expand Down Expand Up @@ -361,4 +307,4 @@ PyTorch: 2.0.1+cpu
| --------- | ----------- | ----------------- | ------------- | ----- | --- | --------- | ---- | --------------- | ------- | ---------- | ---------- | -------------------- | ----------------- | ------------- |
| PyTorch | 4096 | 13K | Yes | 1 | 1 | BF16 | Yes | 8/16 | 3 | 3.2 Hour | 9.6 Hours | 0.30/0.45 | 128 | 1.00E-04 |
| PyTorch | 4096 | 13K | Yes | 2 | 2 | BF16 | Yes | 8/16 | 3 | 1.2 Hour | 3.6 Hours | 0.30/0.45 | 128 | 1.00E-04 |
| PyTorch | 4096 | 13K | Yes | 4 | 2 | BF16 | Yes | 8/16 | 3 | 0.67 Hour | 2 Hours | 0.30/0.45 | 128 | 1.00E-04 |
| PyTorch | 4096 | 13K | Yes | 4 | 2 | BF16 | Yes | 8/16 | 3 | 0.67 Hour | 2 Hours | 0.30/0.45 | 128 | 1.00E-04 |
17 changes: 2 additions & 15 deletions workflows/chatbot/fine_tuning/dpo_pipeline/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,13 @@ We select 12k examples from [Orca](https://arxiv.org/abs/2306.02707) style datas
## 3. Training

```
python dpo_clm.py --model_name_or_path "meta-llama/Llama-2-7b-hf" --output_dir "llama2_7b-dpo" --per_device_train_batch_size 1 --gradient_accumulation_steps 8 --learning_rate 5e-4 --max_steps 1000 --save_steps 10 --lora_alpha 16 --lora_rank 16 --lora_dropout 0.05 --dataset_name Intel/orca_dpo_pairs --bf16 --use_auth_token True
python dpo_clm.py --model_name_or_path "mosaicml/mpt-7b" --output_dir "mpt_7b-dpo" --per_device_train_batch_size 1 --gradient_accumulation_steps 8 --learning_rate 5e-4 --max_steps 1000 --save_steps 10 --lora_alpha 16 --lora_rank 16 --lora_dropout 0.05 --dataset_name Intel/orca_dpo_pairs --bf16 --use_auth_token True
```


## 4. Evaluation

We verify DPO training on our finetuned `mpt-7b` model [Intel/neural-chat-7b-v1-1](https://huggingface.co/Intel/neural-chat-7b-v1-1), our finetuned `llama-2-7b` model, and a finetuned `llama-2-7b` model [pankajmathur/orca_mini_v3_7b](https://huggingface.co/pankajmathur/orca_mini_v3_7b) that has a relative high score in [open_llm_leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), which prove that the performance of model can be significantly improved. The evaluation metrics is same as [open_llm_leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) which uses [Eleuther AI Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/master), a unified framework to test generative language models on a large number of different evaluation tasks.
We verify DPO training on our finetuned `mpt-7b` model [Intel/neural-chat-7b-v1-1](https://huggingface.co/Intel/neural-chat-7b-v1-1). The evaluation metrics is same as [open_llm_leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) which uses [Eleuther AI Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/master), a unified framework to test generative language models on a large number of different evaluation tasks.

#### mpt architecture
| Model | Average ⬆️| ARC (25-s) ⬆️ | HellaSwag (10-s) ⬆️ | MMLU (5-s) ⬆️| TruthfulQA (MC) (0-s) ⬆️ | Evaluation by |
Expand All @@ -34,16 +34,3 @@ We verify DPO training on our finetuned `mpt-7b` model [Intel/neural-chat-7b-v1-
| **[Intel/neural-chat-7b-v1-1](https://huggingface.co/Intel/neural-chat-7b-v1-1) with DPO** | **52.39** | 51.54 | 76.45 | 39.47| 42.10 | ours |


#### llama-2 architecture

| Model | Average ⬆️| ARC (25-s) ⬆️ | HellaSwag (10-s) ⬆️ | MMLU (5-s) ⬆️| TruthfulQA (MC) (0-s) ⬆️ | Evaluation by |
| --- | --- | --- | --- | --- | --- | --- |
|[meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)|54.275 | 52.90 | 78.63 | 46.61 | 38.96|ours |
| [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)|55.81 | 53.50 | 78.60 | 46.53 | 44.60 |ours |
| Our Finetuned | **57.4** | 54.78 | 78.77 | 51.2 | 44.85 | ours |
| **Our Finetuned with DPO** | **59.58** | 57.34 | 78.61 | 50.8 | 51.6 | ours |
| [pankajmathur/orca_mini_v3_7b](https://huggingface.co/pankajmathur/orca_mini_v3_7b) | **59.86** | 56.91 | 79.64 | 52.37 | 50.51 | [open_llm_leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) |
| **[pankajmathur/orca_mini_v3_7b](https://huggingface.co/pankajmathur/orca_mini_v3_7b) with DPO** | **60.92** | 59.22 | 79.92 | 51.84 | 52.71 | ours |



0 comments on commit 38dfeae

Please sign in to comment.