diff --git a/README.md b/README.md
index 84968a22..e93e568f 100644
--- a/README.md
+++ b/README.md
@@ -47,15 +47,6 @@ DashInfer is a highly optimized LLM inference engine with the following core fea
- **Multi-Programming-Language API**: Both C++ and Python interfaces are provided. It is possible to extend C++ interface to Java, Rust and other programming languages, via standard cross-language interfaces.
-
-## Documentation
-- [Release Note](https://dashinfer.readthedocs.io/en/latest/#release-note)
-- [User Manual](https://dashinfer.readthedocs.io/en/latest/)
-- [Installation](docs/EN/installation.md)
-- [C++ Examples](docs/EN/examples_cpp.md)
-- [Python Examples](docs/EN/examples_python.md)
-- [Performance](docs/EN/performance.md)
-
# Supported Hardware and Data Types
## Hardware
@@ -94,86 +85,6 @@ In terms of quantization granularity, there are two types:
- **Per-Channel**: AllSpark's quantization techniques at least adopt the Per-Channel (also known as Per-Token) quantization granularity, and some also provide Sub-Channel quantization granularity. Generally speaking, Per-Channel quantization can meet most accuracy requirements due to its simple implementation and optimal performance. Only when the accuracy of Per-Channel quantization is insufficient should the Sub-Channel quantization strategy be considered.
- **Sub-Channel**: Compared to Per-Channel quantization, Sub-Channel refers to dividing a channel into N groups, and calculating quantization parameters within each group. This quantization granularity typically provides better accuracy, but due to increased implementation complexity, it comes with many limitations. For example, performance may be slightly slower than Per-Channel quantization, and Activation quantization is difficult to implement Sub-Channel quantization due to computational formula constraints (AllSpark's Activation quantization is all Per-Channel).
-# Supported Models
-
-DashInfer support two kind of model load method:
-1. HF format: directly load model from Hugging Face, which provides most convenient method, the model can be downloaded from huggingface or modelscope.
-2. DashInfer format: serialized model file by DashInfer, which provided less python dependency and can be loaded by c++ library.
-
-| Architecture | Models | HuggingFace Models | ModelScope Models |
-|:------------:|:---------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------:|
-| QWenLMHeadModel | Qwen | [Qwen/Qwen-1_8B-Chat](https://huggingface.co/Qwen/Qwen-1_8B-Chat),
[Qwen/Qwen-7B-Chat](https://huggingface.co/Qwen/Qwen-7B-Chat),
[Qwen/Qwen-14B-Chat](https://huggingface.co/Qwen/Qwen-14B-Chat), etc. | [qwen/Qwen-1_8B-Chat](https://modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary),
[qwen/Qwen-7B-Chat](https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary),
[qwen/Qwen-14B-Chat](https://modelscope.cn/models/qwen/Qwen-14B-Chat/summary), etc. |
-| Qwen2ForCausalLM | Qwen1.5-Qwen2.5 | [Qwen/Qwen1.5-0.5B-Chat](https://huggingface.co/Qwen/Qwen1.5-0.5B-Chat),
[Qwen/Qwen1.5-1.8B-Chat](https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat),
[Qwen/Qwen1.5-4B-Chat](https://huggingface.co/Qwen/Qwen1.5-4B-Chat),
[Qwen/Qwen1.5-7B-Chat](https://huggingface.co/Qwen/Qwen1.5-7B-Chat),
[Qwen/Qwen1.5-14B-Chat](https://huggingface.co/Qwen/Qwen1.5-14B-Chat), etc. | [qwen/Qwen1.5-0.5B-Chat](https://modelscope.cn/models/qwen/Qwen1.5-0.5B-Chat/summary),
[qwen/Qwen1.5-1.8B-Chat](https://modelscope.cn/models/qwen/Qwen1.5-1.8B-Chat/summary),
[qwen/Qwen1.5-4B-Chat](https://modelscope.cn/models/qwen/Qwen1.5-4B-Chat/summary),
[qwen/Qwen1.5-7B-Chat](https://modelscope.cn/models/qwen/Qwen1.5-7B-Chat/summary),
[qwen/Qwen1.5-14B-Chat](https://modelscope.cn/models/qwen/Qwen1.5-14B-Chat/summary), etc. |
-| Qwen2VLForConditionalGeneration | QwenVL | [Qwen/Qwen-1_8B-Chat](https://huggingface.co/Qwen/Qwen-1_8B-Chat),
[Qwen/Qwen-7B-Chat](https://huggingface.co/Qwen/Qwen-7B-Chat),
[Qwen/Qwen-14B-Chat](https://huggingface.co/Qwen/Qwen-14B-Chat), etc. | [qwen/Qwen-1_8B-Chat](https://modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary),
[qwen/Qwen-7B-Chat](https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary),
[qwen/Qwen-14B-Chat](https://modelscope.cn/models/qwen/Qwen-14B-Chat/summary), etc. |
-| ChatGLMModel | ChatGLM | [THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat) | [ZhipuAI/glm-4-9b-chat](https://modelscope.cn/models/ZhipuAI/glm-4-9b-chat/summary) |
-| LlamaForCausalLM | LLaMA-2 | [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf),
[meta-llama/Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) | [modelscope/Llama-2-7b-chat-ms](https://modelscope.cn/models/modelscope/Llama-2-7b-chat-ms/summary),
[modelscope/Llama-2-13b-chat-ms](https://modelscope.cn/models/modelscope/Llama-2-13b-chat-ms/summary) |
-| LlamaForCausalLM | LLaMA-3 | [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | [modelscope/Meta-Llama-3-8B-Instruct](https://modelscope.cn/models/modelscope/Meta-Llama-3-8B-Instruct/summary) |
-| BaichuanForCausalLM | Baichuan2 | [baichuan-inc/Baichuan2-7B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat),
[baichuan-inc/Baichuan2-13B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat) | [baichuan-inc/Baichuan2-7B-Chat](https://modelscope.cn/models/baichuan-inc/Baichuan2-7B-Chat),
[baichuan-inc/Baichuan2-13B-Chat](https://modelscope.cn/models/baichuan-inc/Baichuan2-13B-Chat) |
-
-# Software Architecture
-
-## Workflow
-
-![Workflow and Dependency](docs/resources/image/workflow-deps.jpg?row=true)
-
-1. **Model Loading**: This procedure involves loading model weights, setting up transformation parameters, and quantization settings. Based on this information, the model is serialized and converted into the DashInfer format (.dimodel, .ditensors, or .asparams, .asmodel). This functionality is accessible exclusively through a Python interface and relies on the PyTorch and transformers libraries to access the weights. The version requirements for PyTorch and transformers may vary from model to model. DashInfer itself does not impose any specific version constraints.
-
-2. **Model Inference**: This step is responsible for executing the model inference using the serialized model with DashInfer, without depending on components like PyTorch. DashInfer employs [DLPack](https://github.com/dmlc/dlpack) format tensors to facilitate interaction with external frameworks, such as PyTorch. Tensors in DLPack format can be manually created or generated through tensor conversion functions provided by deep learning frameworks. Regarding the C++ interface, since most dependencies have been statically linked, it primarily relies on the OpenMP runtime library and C++ system libraries. We applied [control over symbol exports](https://anadoxin.org/blog/control-over-symbol-exports-in-gcc.html/) to ensure that only DashInfer's API interface symbols are visible, thereby preventing version conflicts with existing libraries in the user's system, such as protobuf.
-
-> Note:
-> - After 2.0 version, user rarely needs to care about the model type, which will detected by DashInfer Runtime automatically.
-> - ~~.dimodel, .ditensors is a special model format defined by DashInfer kernel.~~
-> - When utilizing the Python interface, you can combine the code from steps 1 and 2. However, due to the lack of functionality for loading Huggingface models at the C++ level, the C++ interface is limited to conducting inferences with models in the DashInfer format. Therefore, it's essential to serialize the model first using the Python interface before proceeding with the C++ interface.
-
-## GPU and Single-NUMA Architecture
-
-![Single-NUMA Arch](docs/resources/image/arch-single-numa.jpg?row=true)
-
-GPU and Single NUMA CPU Inference share same interface and architecture, in the model inference phase, an inference request can be initiated by passing in input tokens and generation parameters via `StartRequest`, and when the request is successful, the DashInfer engine will return an output queue `ResultQueue` and a control handle `RequestHandle`.
-
-- The `ResultQueue` is used to get output tokens and the status of the generation. DashInfer will **asynchronously** put the generated token into the queue, and tokens in the queue can be fetched either in a blocking (`ResultQueue.Get()`) or non-blocking (`ResultQueue.GetNoWait()`) way.
-
-- The `RequestHandle` is the handle used to manage the request. DashInfer `engine` provides Sync, Stop, and Release primitives for the request specified by the `RequestHandle`. The `SyncRequest` primitive, which returns at the end of generation (when the number of generated tokens reaches the limit, or when an EOS has been generated), is used to simulate the behavior of the synchronous interface.
-
-In GPU and single-NUMA mode, DashInfer Runtime uses multi-threading and a thread pool for scheduling.
-
-## Multi-NUMA Architecture
-
-![Multi-NUMA Arch](docs/resources/image/arch-multi-numa.jpg?row=true)
-
-Due to the inability of some Linux kernels to control CPU affinity at the thread level, running engine on multi-NUMA CPUs may result in remote memory node access, thereby causing a decline in performance. To enable precise control of a thread's CPU affinity, DashInfer multi-NUMA solution employs a multi-process client-server architecture to achieve tensor parallel model inference. On each NUMA node, an independent process runs the server, with each server handling a part of the tensor parallel inference, and the processes use OpenMPI to collaborate (e.g., via the allreduce operation). The client interacts with the servers via gRPC, providing a unique external interface to avoid the need to manage multiple processes when invoking the DashInfer interface.
-
-In terms of API, multi-NUMA and single-NUMA inference need to use different header files and .so libraries (or call different python interfaces). Except for the header and the library, the rest of the interface is consistent and no code changes are required. For details, you can refer to the examples.
-
-- Single-NUMA
- - header: allspark/allspark.h
- - .so library: liballspark_framework.so
- - python API: allspark.Engine()
-- MultiNUMA
- - header: allspark/allspark_client.h
- - .so library: liballspark_client.so
- - python API: allspark.ClientEngine()
-
-> Note: C++ liballspark_framework.so (called for single-NUMA inference) and liballspark_client.so (called for multi-NUMA inference) are mutually exclusive, you cannot link both libraries.
-
-# Performance Test
-
-Please refer to [documentation](docs/EN/performance.md) for detailed performance test results.
-
-The results of this performance test can be reproduced with the scripts in `/examples/python/1_performance`.
-
-# Inference Accuracy
-
-Tested model: [Qwen/Qwen-7B-Chat](https://huggingface.co/Qwen/Qwen-7B-Chat)
-
-| Engine | DataType | MMLU | C-Eval | GSM8K | HumanEval |
-|:------:|:--------:|:----:|:------:|:-----:|:---------:|
-| transformers | BF16 | 55.8 | 59.7 | 50.3 | 37.2 |
-| DashInfer | A16W8 | 55.78 | 61.10 | 51.25 | 37.19 |
-
-- A16W8: The model weight is quantized to 8-bit and is recovered as bfloat16 for matrix multiplication during inference.
-- The results of this accuracy evaluation can be reproduced with the scripts in `/examples/python/2_evaluation`.
-
# Examples
In `/examples` there are examples for C++ and Python interfaces, and please refer to the documentation in `/documents/EN` to run the examples.
@@ -182,36 +93,9 @@ In `/examples` there are examples for C++ and Python interfac
- [Documentation for All Python Examples](docs/EN/examples_python.md)
- [Documentation for C++ Examples](docs/EN/examples_cpp.md)
-## Multi-Modal Model(VLMs)) Support
-
-The VLM Support in [multimodal](multimodal/) folder,
-it's a toolkit to support Vision Language Models (VLMs) inference based on the DashInfer engine. It's compatible with the OpenAI Chat Completion API, supporting text and image/video inputs.
-
-
-# Third-party Dependencies
-
-This subsection lists the third-party dependencies for the different stages of DashInfer.
-
-> Note: These dependency packages are managed through conan and are automatically downloaded when compiling DashInfer.
-
-## Code Compilation Phase
-
-- [conan](https://conan.io/) (1.60.0): For managing C++ third-party dependencies.
-- [cmake](https://cmake.org/) (3.18+): Build system.
-
-## Model Conversion Phase
-
-- [PyTorch](https://pytorch.org/) (CPU): For loading model files, no special version requirements.
-- [transformers](https://github.com/huggingface/transformers): For loading model parameters and tokenizer.
-
-## Model Inference Phase
+## Multi-Modal Model(VLMs) Support
-- [protobuf](https://protobuf.dev/)(3.18.3): For parsing model files.
-- [pybind11](https://github.com/pybind/pybind11)(2.8): For binding python interfaces.
-- [onednn](https://github.com/oneapi-src/oneDNN), [mkl](https://www.intel.com/content/www/us/en/docs/onemkl/get-started-guide/2023-0/overview.html): BLAS libraries, for accelerating GEMM calculations.
-- [openmp](https://www.openmp.org/): A standard parallel programming library.
-- [openmpi](https://www.open-mpi.org/): For implementing multi-NUMA service architecture.
-- [grpc](https://grpc.io/): For implementing multi-NUMA service architecture.
+The VLM Support in [multimodal](multimodal/) folder, it's a toolkit to support Vision Language Models (VLMs) inference based on the DashInfer engine. It's compatible with the OpenAI Chat Completion API, supporting text and image/video inputs.
# Future Plans
- [x] GPU Support
diff --git a/README_CN.md b/README_CN.md
index 1cbabf9d..459b2439 100644
--- a/README_CN.md
+++ b/README_CN.md
@@ -35,14 +35,6 @@ DashInfer 是一个高度优化的 LLM 推理引擎,具有以下核心特性
- **OpenAI API 服务器**: DashInfer 可以轻松与 fastChat 配合使用,实现兼容 OpenAI 的 API 服务器。
- **多编程语言 API**: 提供 C++ 和 Python 接口。通过标准的跨语言接口,可以将 C++ 接口扩展到 Java、Rust 等编程语言。
-## 文档
-- [Release Note](https://dashinfer.readthedocs.io/en/latest/#release-note)
-- [User Manual](https://dashinfer.readthedocs.io/en/latest/)
-- [安装](docs/CN/installation.md)
-- [C++示例](docs/CN/examples_cpp.md)
-- [Python示例](docs/CN/examples_python.md)
-- [性能测试](docs/EN/performance.md)
-- [使用魔搭notebook部署](docs/CN/modelscope_notebook.md)
# 硬件支持和数据类型
@@ -71,83 +63,6 @@ DashInfer 为 LLM 权重提供了多种量化技术,例如 int{8,4} 仅权重
- **每通道量化**: AllSpark 的量化技术至少采用了每通道(也称为每 Token)量化粒度,有些还提供了子通道量化粒度。一般而言,每通道量化由于实现简单且性能最佳,通常能满足大多数准确性需求。只有当每通道量化的准确性不足时,才应考虑子通道量化策略。
- **子通道量化**: 与每通道量化相比,子通道量化是指将一个通道划分为 N 组,并在每组内计算量化参数。这种量化粒度通常能提供更好的准确性,但由于实现复杂度增加,带来了许多限制。例如,性能可能比每通道量化稍慢,并且由于计算公式限制,激活量化难以实现子通道量化(AllSpark 的激活量化都是每通道量化)。
-# 模型支持
-DashInfer 支持两种模型加载方式:
-1. **HF 格式**:直接从 Hugging Face 加载模型,这是最方便的方法,模型可以从 Hugging Face 或 ModelScope 下载。
-2. **DashInfer 格式**:由 DashInfer 序列化的模型文件,依赖更少的 Python 组件,可以通过 C++ 库加载。
-
-| Architecture | Models | HuggingFace Models | ModelScope Models |
-|:------------:|:---------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------:|
-| QWenLMHeadModel | Qwen | [Qwen/Qwen-1_8B-Chat](https://huggingface.co/Qwen/Qwen-1_8B-Chat),
[Qwen/Qwen-7B-Chat](https://huggingface.co/Qwen/Qwen-7B-Chat),
[Qwen/Qwen-14B-Chat](https://huggingface.co/Qwen/Qwen-14B-Chat), etc. | [qwen/Qwen-1_8B-Chat](https://modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary),
[qwen/Qwen-7B-Chat](https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary),
[qwen/Qwen-14B-Chat](https://modelscope.cn/models/qwen/Qwen-14B-Chat/summary), etc. |
-| Qwen2ForCausalLM | Qwen1.5-Qwen2.5 | [Qwen/Qwen1.5-0.5B-Chat](https://huggingface.co/Qwen/Qwen1.5-0.5B-Chat),
[Qwen/Qwen1.5-1.8B-Chat](https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat),
[Qwen/Qwen1.5-4B-Chat](https://huggingface.co/Qwen/Qwen1.5-4B-Chat),
[Qwen/Qwen1.5-7B-Chat](https://huggingface.co/Qwen/Qwen1.5-7B-Chat),
[Qwen/Qwen1.5-14B-Chat](https://huggingface.co/Qwen/Qwen1.5-14B-Chat), etc. | [qwen/Qwen1.5-0.5B-Chat](https://modelscope.cn/models/qwen/Qwen1.5-0.5B-Chat/summary),
[qwen/Qwen1.5-1.8B-Chat](https://modelscope.cn/models/qwen/Qwen1.5-1.8B-Chat/summary),
[qwen/Qwen1.5-4B-Chat](https://modelscope.cn/models/qwen/Qwen1.5-4B-Chat/summary),
[qwen/Qwen1.5-7B-Chat](https://modelscope.cn/models/qwen/Qwen1.5-7B-Chat/summary),
[qwen/Qwen1.5-14B-Chat](https://modelscope.cn/models/qwen/Qwen1.5-14B-Chat/summary), etc. |
-| Qwen2VLForConditionalGeneration | QwenVL | [Qwen/Qwen-1_8B-Chat](https://huggingface.co/Qwen/Qwen-1_8B-Chat),
[Qwen/Qwen-7B-Chat](https://huggingface.co/Qwen/Qwen-7B-Chat),
[Qwen/Qwen-14B-Chat](https://huggingface.co/Qwen/Qwen-14B-Chat), etc. | [qwen/Qwen-1_8B-Chat](https://modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary),
[qwen/Qwen-7B-Chat](https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary),
[qwen/Qwen-14B-Chat](https://modelscope.cn/models/qwen/Qwen-14B-Chat/summary), etc. |
-| ChatGLMModel | ChatGLM | [THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat) | [ZhipuAI/glm-4-9b-chat](https://modelscope.cn/models/ZhipuAI/glm-4-9b-chat/summary) |
-| LlamaForCausalLM | LLaMA-2 | [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf),
[meta-llama/Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) | [modelscope/Llama-2-7b-chat-ms](https://modelscope.cn/models/modelscope/Llama-2-7b-chat-ms/summary),
[modelscope/Llama-2-13b-chat-ms](https://modelscope.cn/models/modelscope/Llama-2-13b-chat-ms/summary) |
-| LlamaForCausalLM | LLaMA-3 | [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | [modelscope/Meta-Llama-3-8B-Instruct](https://modelscope.cn/models/modelscope/Meta-Llama-3-8B-Instruct/summary) |
-| BaichuanForCausalLM | Baichuan2 | [baichuan-inc/Baichuan2-7B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat),
[baichuan-inc/Baichuan2-13B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat) | [baichuan-inc/Baichuan2-7B-Chat](https://modelscope.cn/models/baichuan-inc/Baichuan2-7B-Chat),
[baichuan-inc/Baichuan2-13B-Chat](https://modelscope.cn/models/baichuan-inc/Baichuan2-13B-Chat) |
-
-# 软件框架
-
-## 推理流程
-
-![Workflow and Dependency](documents/resources/image/workflow-deps.jpg?row=true)
-
-1. **模型加载**:该过程包括加载模型权重、设置转换参数和量化设置。基于这些信息,模型会被序列化并转换成 DashInfer 格式(.dimodel, .ditensors 或 .asparams, .asmodel) 。此功能仅通过 Python 接口访问,并依赖于 PyTorch 和 transformers 库来访问权重。PyTorch 和 transformers 的版本要求可能因模型而异。DashInfer 本身没有具体的版本限制。
-2. **模型推理**:此步骤负责使用 DashInfer 执行序列化模型的推理,而不依赖于 PyTorch 等组件。DashInfer 采用 [DLPack](https://github.com/dmlc/dlpack) 格式的张量,以便与外部框架(如 PyTorch)进行交互。DLPack 格式的张量可以手动创建,也可以通过深度学习框架提供的张量转换函数生成。对于 C++ 接口,由于大多数依赖项已经被静态链接,它主要依赖于 OpenMP 运行时库和 C++ 系统库。我们应用了 [控制符号导出](https://anadoxin.org/blog/control-over-symbol-exports-in-gcc.html/) 技术,以确保只有 DashInfer 的 API 接口符号是可见的,从而防止与用户系统中的现有库(如 protobuf)发生版本冲突。
-
-> 注意:
-> - 版本 2.0 之后,用户很少需要关心模型类型(在1.0中),它会被 DashInfer Runtime 自动检测。
-> - ~~.dimodel, .ditensors 是 DashInfer 内核定义的一种特殊模型格式。~~
-> - 使用 Python 接口时,可以将步骤 1 和步骤 2 的代码结合起来。然而,由于在 C++ 层面缺乏加载 Huggingface 模型的功能,C++ 接口仅限于使用 DashInfer 格式的模型进行推理。因此,必须先使用 Python 接口序列化模型,然后再进行 C++ 接口的推理。
-## GPU 和 CPU 单NUMA架构图
-
-![Single-NUMA Arch](docs/resources/image/arch-single-numa.jpg?row=true)
-
-GPU 和单 NUMA CPU 推理共享相同的接口和架构。在模型推理阶段,可以通过 `StartRequest` 传入输入标记和生成参数来启动推理请求,当请求成功时,DashInfer 引擎将返回一个输出队列 `ResultQueue` 和一个控制句柄 `RequestHandle`。
-
-- `ResultQueue`用来获取输出token以及生成的状态,推理引擎会**异步**地把生成的token放到该队列中,可以阻塞(`ResultQueue.Get()`)或非阻塞(`ResultQueue.GetNoWait()`)地获取队列中的token。
-
-- `RequestHandle`是用来管理请求的句柄,DashInfer `engine`根据传入的`RequestHandle`实现对指定request的同步(Sync)、停止(Stop)和释放(Release)操作。其中`SyncRequest`操作,会在生成结束(生成的token数达到上限,或产生结束符)后返回,用来模拟同步接口的行为。
-
-在GPU 和 单NUMA的模式下,DashInfer Runtime采用多线程和线程池的结构做调度。
-
-## 多NUMA架构图
-
-![Multi-NUMA Arch](docs/resources/image/arch-multi-numa.jpg?row=true)
-
-由于部分Linux内核无法在线程级别控制CPU亲和性,在多NUMA的CPU上采用单进程推理可能会出现跨NUMA访问内存访问,从而导致性能下降。为了能够精确地控制程序的CPU亲和性,DashInfer的多NUMA方案采用了多进程的client-server架构,实现tensor parallel的模型推理。在每个NUMA节点上,都有一个独立的进程运行DashInfer server,每个server负责一部分的tensor parallel推理,进程间使用OpenMPI进行协同(例如allreduce操作)。DashInfer client通过gRPC与server交互,提供唯一的对外接口,避免在调用DashInfer接口时,需要对多进程进行管理。
-
-在API使用上,多NUMA和单NUMA的推理需要引用不同的头文件、.so库(或调用不同的python接口)。除了引用阶段外,其余接口一致,无需修改代码。具体可以参考examples中的示例。
-
-- 单NUMA
- - 头文件:allspark/allspark.h
- - .so库:liballspark_framework.so
- - python接口:allspark.Engine()
-- 多NUMA
- - 头文件:allspark/allspark_client.h
- - .so库:liballspark_client.so
- - python接口:allspark.ClientEngine()
-
-> 注意:C++的liballspark_framework.so(单NUMA推理时调用)和liballspark_client.so(多NUMA推理时调用)是互斥的,不能同时链接两个库。
-
-# 性能测试
-
-详细的性能测试结果请参考[文档](docs/EN/performance.md)。
-
-该性能测试结果可用`/examples/python/1_performance`中的脚本复现。
-
-# 精度测试
-
-测试模型:[Qwen/Qwen-7B-Chat](https://huggingface.co/Qwen/Qwen-7B-Chat)
-
-| Engine | DataType | MMLU | C-Eval | GSM8K | HumanEval |
-|:------:|:--------:|:----:|:------:|:-----:|:---------:|
-| transformers | BF16 | 55.8 | 59.7 | 50.3 | 37.2 |
-| DashInfer | A16W8 | 55.78 | 61.10 | 51.25 | 37.19 |
-
-- A16W8:指weight采用8-bit量化,在推理过程中恢复为bfloat16进行矩阵乘法计算;
-- 该精度评测结果,可用`/examples/python/2_evaluation`中的脚本复现。
-
# 示例代码
在`/examples`下提供了C++、python接口的调用示例,请参考`/documents/CN`目录下的文档运行示例。
@@ -156,30 +71,9 @@ GPU 和单 NUMA CPU 推理共享相同的接口和架构。在模型推理阶段
- [所有Python示例文档](docs/CN/examples_python.md)
- [C++示例文档](docs/CN/examples_cpp.md)
-# 依赖库
-
-本小节列出了DashInfer不同阶段的第三方依赖。
-
-> 注:这些依赖包通过conan管理,在编译DashInfer时自动下载。
-
-## 代码编译阶段
-
-- [conan](https://conan.io/) (1.60.0): For managing C++ third-party dependencies.
-- [cmake](https://cmake.org/) (3.18+): Build system.
-
-## 模型转换阶段
-
-- [PyTorch](https://pytorch.org/) (CPU): For reading model files, no special version requirements.
-- [transformers](https://github.com/huggingface/transformers): For loading model parameters and tokenizer.
-
-## 模型推理阶段
+## 多模态模型支持
-- [protobuf](https://protobuf.dev/)(3.18.3): For parsing model files.
-- [pybind11](https://github.com/pybind/pybind11)(2.8): For binding python interfaces.
-- [onednn](https://github.com/oneapi-src/oneDNN), [mkl](https://www.intel.com/content/www/us/en/docs/onemkl/get-started-guide/2023-0/overview.html): BLAS libraries, for accelerating GEMM calculations.
-- [openmp](https://www.openmp.org/): A standard parallel programming library.
-- [openmpi](https://www.open-mpi.org/): For implementing multi-NUMA service architecture.
-- [grpc](https://grpc.io/): For implementing multi-NUMA service architecture.
+[multimodal](multimodal/) 目录下是基于DashInfer实现的多模态模型推理工具,兼容OpenAI Chat Completion API,支持文字、图片、视频输入。
# 未来规划