New project: AI-Enhancement-Filter powered by onnx-tool

onnx-tool

A tool for ONNX model:

构建LLM模型
解析ONNX模型并且编辑: 常量层折叠, Ops fusion.
模型分析：Tensor形状推理，每个Op的MACs统计
Compute Graph 和 Shape Engine.
内存压缩：激活Tenosr的内存压缩和权重的内存压缩
支持量化模型和稀疏模型.

支持的模型有:

NLP: BERT, T5, GPT, LLaMa, MPT(TransformerModel)
Diffusion: Stable Diffusion(TextEncoder, VAE, UNET)
CV: Detic, BEVFormer, SSD300_VGG16, ...
Audio: sovits, LPCNet

构建LLM模型并分析

在1秒内快速分析10个hugging face模型. 将模型保存为和llama.cpp一样简单的ONNX格式. code ref

model name(1k input)	MACs(G)	Parameters(G)	KV Cache(G)
gpt-j-6b	6277	6.05049	0.234881
yi-1.5-34B	35862	34.3889	0.125829
microsoft/phi-2	2948	2.77944	0.167772
Phi-3-mini-4k	4083	3.82108	0.201327
Phi-3-small-8k-instruct	7912	7.80167	0.0671089
Phi-3-medium-4k-instruct	14665	13.9602	0.104858
Llama3-8B	8029	8.03026	0.0671089
Llama-3.1-70B-Japanese-Instruct-2407	72888	70.5537	0.167772
QWen-7B	7509	7.61562	0.0293601
Qwen2_72B_Instruct	74895	72.7062	0.167772

通过硬件参数快速获取每个模型的第一个token延时和后续token延时

model_type_4bit_kv16bit	memory_size(GB)	Ultra-155H_first_latency	Ultra-155H_next_latency	Arc-A770_first_latency	Arc-A770_next_latency	H100-PCIe_first_latency	H100-PCIe_next_latency
gpt-j-6b	3.75678	1.0947	0.041742	0.0916882	0.00670853	0.0164015	0.00187839
yi-1.5-34B	19.3369	5.77095	0.214854	0.45344	0.0345302	0.0747854	0.00966844
microsoft/phi-2	1.82485	0.58361	0.0202761	0.0529628	0.00325866	0.010338	0.000912425
Phi-3-mini-4k	2.49649	0.811173	0.0277388	0.0745356	0.00445802	0.0147274	0.00124825
Phi-3-small-8k-instruct	4.2913	1.38985	0.0476811	0.117512	0.00766303	0.0212535	0.00214565
Phi-3-medium-4k-instruct	7.96977	2.4463	0.088553	0.198249	0.0142317	0.0340576	0.00398489
Llama3-8B	4.35559	1.4354	0.0483954	0.123333	0.00777784	0.0227182	0.00217779
Llama-3.1-70B-Japanese-Instruct-2407	39.4303	11.3541	0.438114	0.868475	0.0704112	0.137901	0.0197151
QWen-7B	4.03576	1.34983	0.0448417	0.11722	0.00720671	0.0218461	0.00201788
Qwen2_72B_Instruct	40.5309	11.6534	0.450343	0.890816	0.0723766	0.14132	0.0202654

解析与编辑

你可以用onnx_tool.Model类去加载任意ONNX模型，变成易于编辑的python类实例，你可以:
用onnx_tool.Graph类去改变图结构;
用onnx_tool.Node类去改变每个Op的属性和输入输出Tensor;
用onnx_tool.Tensor改变任意Tensor的数据类型和数据内容.
修改完成后，只需要调用Graph或者Model类的save_model接口可以保存所有的修改内容到新的ONNX模型.

请参考 benchmark/examples.py.

形状推理和模型分析

每个模型分析报告需要基于某个特定的输入Tensor的形状。所以在分析模型之前要先进行一次形状推理。

浮点乘加数（等于2倍的浮点操作数）, 内存占用(字节数), 参数量(参数个数)

稀疏的块的形状, 稀疏块的稀疏率（全为0的稀疏块的稀疏率）, 参数的稀疏率（数值为0的稀疏率）

how to use: data/Profile.md.
pytorch usage: data/PytorchUsage.md.
tensorflow usage: data/TensorflowUsage.md.
examples: benchmark/examples.py.

Compute Graph with Shape Engine

移除了所有的Tensor形状计算op，更新动态Tensor的形状可以用Shape Engine来替代。推理引擎只需要负责计算图的计算，不需要考虑Tensor的形状更新。
examples:
benchmark/shape_regress.py.
benchmark/examples.py.
如何集成 Compute Graph 和 Shape Engine 到cpp推理引擎中: data/inference_engine.md

多OP融合为新OP

MHA and Layernorm Fusion for Transformers

Resnet18 fusion

how to use: data/Subgraph.md.
BERT examples: benchmark/examples.py.
Pattern fusion: benchmark/do_fusion.py.

从模型中提取一个子模型

可以帮助实现model parallel。

how to use: data/Subgraph.md.

Memory Compression

对于LLM和高分辨CV模型, 激活内存的压缩可以帮助节省整个模型的内存使用.
压缩方法可以在大多数模型上实现 5% 内存压缩率.
例如:

model	Native Memory Size(MB)	Compressed Memory Size(MB)	Compression Ratio(%)
StableDiffusion(VAE_encoder)	14,245	540	3.7
StableDiffusion(VAE_decoder)	25,417	1,140	4.48
StableDiffusion(Text_encoder)	215	5	2.5
StableDiffusion(UNet)	36,135	2,232	6.2
GPT2	40	2	6.9
BERT	2,170	27	1.25

code example: benchmark/compression.py

How to install

pip install onnx-tool

OR

pip install --upgrade git+https://github.com/ThanatosShinji/onnx-tool.git

python>=3.6

If pip install onnx-tool failed by onnx's installation, you may try pip install onnx==1.8.1 (a lower version like this) first.
Then pip install onnx-tool again.

Known Issues

Loop op is not supported

Results of ONNX Model Zoo and SOTA models

注意对于支持动态输入形状的模型，模型的MACs随输入形状的改变而改变。下表中的MACs数据是基于data/public/config.py中的配置输入形状得到。带有所有Tensor形状的模型和分析报告可以从下面的网盘中下载: baidu drive(code: p91k) google drive

Model	Params(M)	MACs(M)
GPT-J 1 layer	464	173,398
MPT 1 layer	261	79,894
text_encoder	123.13	6,782
UNet2DCondition	859.52	888,870
VAE_encoder	34.16	566,371
VAE_decoder	49.49	1,271,959
SqueezeNet 1.0	1.23	351
AlexNet	60.96	665
GoogleNet	6.99	1,606
googlenet_age	5.98	1,605
LResNet100E-IR	65.22	12,102
BERT-Squad	113.61	22,767
BiDAF	18.08	9.87
EfficientNet-Lite4	12.96	1,361
Emotion	12.95	877
Mask R-CNN	46.77	92,077

Model	Params(M)	MACs(M)
LLaMa 1 layer	618	211,801
BEVFormer Tiny	33.7	210,838
rvm_mobilenetv3	3.73	4,289
yolov4	64.33	3,319
ConvNeXt-L	229.79	34,872
edgenext_small	5.58	1,357
SSD	19.98	216,598
RealESRGAN	16.69	73,551
ShuffleNet	2.29	146
GPT-2	137.02	1,103
T5-encoder	109.62	686
T5-decoder	162.62	1,113
RoBERTa-BASE	124.64	688
Faster R-CNN	44.10	46,018
FCN ResNet-50	35.29	37,056
ResNet50	25	3,868

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_CN.md

README_CN.md

onnx-tool

构建LLM模型并分析

解析与编辑

形状推理和模型分析

Compute Graph with Shape Engine

多OP融合为新OP

从模型中提取一个子模型

Memory Compression

How to install

Known Issues

Results of ONNX Model Zoo and SOTA models

Files

README_CN.md

Latest commit

History

README_CN.md

File metadata and controls

onnx-tool

构建LLM模型并分析

解析与编辑

形状推理 和 模型分析

Compute Graph with Shape Engine

多OP融合为新OP

从模型中提取一个子模型

Memory Compression

How to install

Known Issues

Results of ONNX Model Zoo and SOTA models

形状推理和模型分析