Skip to content

Latest commit

 

History

History
226 lines (182 loc) · 12.7 KB

README.md

File metadata and controls

226 lines (182 loc) · 12.7 KB

简体中文
New project: AI-Enhancement-Filter powered by onnx-tool

onnx-tool

A tool for ONNX model:

Supported Models:


Build LLM model and profile

Profile 10 hugging face models within one second. Save the ONNX models as simple as llama.cpp's. code ref

model name(1k input) MACs(G) Parameters(G) KV Cache(G)
gpt-j-6b 6277 6.05049 0.234881
yi-1.5-34B 35862 34.3889 0.125829
microsoft/phi-2 2948 2.77944 0.167772
Phi-3-mini-4k 4083 3.82108 0.201327
Phi-3-small-8k-instruct 7912 7.80167 0.0671089
Phi-3-medium-4k-instruct 14665 13.9602 0.104858
Llama3-8B 8029 8.03026 0.0671089
Llama-3.1-70B-Japanese-Instruct-2407 72888 70.5537 0.167772
QWen-7B 7509 7.61562 0.0293601
Qwen2_72B_Instruct 74895 72.7062 0.167772

Get first-token latency and next-token latency from hardware specs.

model_type_4bit_kv16bit memory_size(GB) Ultra-155H_first_latency Ultra-155H_next_latency Arc-A770_first_latency Arc-A770_next_latency H100-PCIe_first_latency H100-PCIe_next_latency
gpt-j-6b 3.75678 1.0947 0.041742 0.0916882 0.00670853 0.0164015 0.00187839
yi-1.5-34B 19.3369 5.77095 0.214854 0.45344 0.0345302 0.0747854 0.00966844
microsoft/phi-2 1.82485 0.58361 0.0202761 0.0529628 0.00325866 0.010338 0.000912425
Phi-3-mini-4k 2.49649 0.811173 0.0277388 0.0745356 0.00445802 0.0147274 0.00124825
Phi-3-small-8k-instruct 4.2913 1.38985 0.0476811 0.117512 0.00766303 0.0212535 0.00214565
Phi-3-medium-4k-instruct 7.96977 2.4463 0.088553 0.198249 0.0142317 0.0340576 0.00398489
Llama3-8B 4.35559 1.4354 0.0483954 0.123333 0.00777784 0.0227182 0.00217779
Llama-3.1-70B-Japanese-Instruct-2407 39.4303 11.3541 0.438114 0.868475 0.0704112 0.137901 0.0197151
QWen-7B 4.03576 1.34983 0.0448417 0.11722 0.00720671 0.0218461 0.00201788
Qwen2_72B_Instruct 40.5309 11.6534 0.450343 0.890816 0.0723766 0.14132 0.0202654

Basic Parse and Edit

You can load any onnx file by onnx_tool.Model:
Change graph structure with onnx_tool.Graph;
Change op attributes and IO tensors with onnx_tool.Node;
Change tensor data or type with onnx_tool.Tensor.
To apply your changes, just call save_model method of onnx_tool.Model or onnx_tool.Graph.

Please refer benchmark/examples.py.


Shape Inference & Profile Model

All profiling data must be built on shape inference result.
ONNX graph with tensor shapes:

Regular model profiling table:



Sparse profiling table:



Introduction: data/Profile.md.
pytorch usage: data/PytorchUsage.md.
tensorflow usage: data/TensorflowUsage.md.
examples: benchmark/examples.py.


Compute Graph with Shape Engine

From a raw graph to a compute graph:

Remove shape calculation layers(created by ONNX export) to get a Compute Graph. Use Shape Engine to update tensor shapes at runtime.
Examples: benchmark/shape_regress.py. benchmark/examples.py.
Integrate Compute Graph and Shape Engine into a cpp inference engine: data/inference_engine.md


Memory Compression

Activation Compression

Activation memory also called temporary memory is created by each OP's output. Only the last activation marked as the model's output will be kept. So you don't have to prepare memory space for each activation tensor. They better reuse an optimized memory size.

For large language models and high-resolution CV models, the activation memory compression is a key to save memory.
The compression method achieves 5% memory compression on most models.
For example:

model Native Memory Size(MB) Compressed Memory Size(MB) Compression Ratio(%)
StableDiffusion(VAE_encoder) 14,245 540 3.7
StableDiffusion(VAE_decoder) 25,417 1,140 4.48
StableDiffusion(Text_encoder) 215 5 2.5
StableDiffusion(UNet) 36,135 2,232 6.2
GPT2 40 2 6.9
BERT 2,170 27 1.25

code example: benchmark/compression.py

Weight Compression

A fp32 model with 7B parameters will take 28GB disk space and memory space. You can not even run the model if your device doesn't have that much memory space. So weight compression is critical to run large language models. As a reference, 7B model with int4 symmetric per block(32) quantization(llama.cpp's q4_0 quantization method) only has ~0.156x model size compared with fp32 model.

Current support:

  • [fp16]
  • [int8]x[symmetric/asymmetric]x[per tensor/per channel/per block]
  • [int4]x[symmetric/asymmetric]x[per tensor/per channel/per block]

code examples:benchmark/examples.py.


How to install

pip install onnx-tool

OR

pip install --upgrade git+https://github.com/ThanatosShinji/onnx-tool.git

python>=3.6

If pip install onnx-tool failed by onnx's installation, you may try pip install onnx==1.8.1 (a lower version like this) first.
Then pip install onnx-tool again.


Known Issues

  • Loop op is not supported
  • Sequence type is not supported

Results of ONNX Model Zoo and SOTA models

Some models have dynamic input shapes. The MACs varies from input shapes. The input shapes used in these results are writen to data/public/config.py. These onnx models with all tensors' shape can be downloaded: baidu drive(code: p91k) google drive

Model Params(M) MACs(M)
GPT-J 1 layer 464 173,398
MPT 1 layer 261 79,894
text_encoder 123.13 6,782
UNet2DCondition 859.52 888,870
VAE_encoder 34.16 566,371
VAE_decoder 49.49 1,271,959
SqueezeNet 1.0 1.23 351
AlexNet 60.96 665
GoogleNet 6.99 1,606
googlenet_age 5.98 1,605
LResNet100E-IR 65.22 12,102
BERT-Squad 113.61 22,767
BiDAF 18.08 9.87
EfficientNet-Lite4 12.96 1,361
Emotion 12.95 877
Mask R-CNN 46.77 92,077
Model Params(M) MACs(M)
LLaMa 1 layer 618 211,801
BEVFormer Tiny 33.7 210,838
rvm_mobilenetv3 3.73 4,289
yolov4 64.33 3,319
ConvNeXt-L 229.79 34,872
edgenext_small 5.58 1,357
SSD 19.98 216,598
RealESRGAN 16.69 73,551
ShuffleNet 2.29 146
GPT-2 137.02 1,103
T5-encoder 109.62 686
T5-decoder 162.62 1,113
RoBERTa-BASE 124.64 688
Faster R-CNN 44.10 46,018
FCN ResNet-50 35.29 37,056
ResNet50 25 3,868