Nexa SDK is a comprehensive toolkit for supporting ONNX and GGML models. It supports text generation, image generation, vision-language models (VLM), and speech-to-text (ASR), and text-to-speech (TTS) capabilities. Additionally, it offers an OpenAI-compatible API server with JSON schema mode for function calling and streaming support, and a user-friendly Streamlit UI. Users can run Nexa SDK in any device with Python environment, and GPU acceleration is supported, including CUDA, Metal, and ROCm. An executable version is also available.
- [2024/10] Support embedding model:
nexa embed <model_path> <prompt>
- [2024/10] Support pull and run supported Computer Vision models in GGUF format from HuggingFace:
nexa run -hf <model_id> -mt COMPUTER_VISION
- [2024/10] Support VLM in local server.
- [2024/10] Added option to customize maximum context window for NLP and VLM models.
- [2024/10] Support running model from user's local path
- [2024/10] Added LoRA support for NLP models.
- [2024/10] Added support for whisper-large-v3-turbo:
nexa run faster-whisper-large-turbo
- [2024/10] Added support for AMD-Llama-135m:
nexa run AMD-Llama-135m:fp16
- [2024/09] Nexa now has executables for easy installation: Install Nexa SDK β¨
- [2024/09] Added support for Llama 3.2 models:
nexa run llama3.2
- [2024/09] Added support for Qwen2.5, Qwen2.5-coder and Qwen2.5-Math models:
nexa run qwen2.5
- [2024/09] Support pull and run NLP models in GGUF format from HuggingFace:
nexa run -hf <model_id> -mt NLP
- [2024/09] Added support for ROCm
- [2024/09] Added support for Phi-3.5 models:
nexa run phi3.5
- [2024/09] Added support for OpenELM models:
nexa run openelm
- [2024/09] Introduced logits API support for more advanced model interactions
- [2024/09] Added support for Flux models:
nexa run flux
- [2024/09] Added support for Stable Diffusion 3 model:
nexa run sd3
- [2024/09] Added support for Stable Diffusion 2.1 model:
nexa run sd2-1
Welcome to submit your requests through issues, we ship weekly.
curl -fsSL https://public-storage.nexa4ai.com/install.sh | sh
Coming soon. Install with Python package below π
We have released pre-built wheels for various Python versions, platforms, and backends for convenient installation on our index page.
Note
- If you want to use ONNX model, just replace
pip install nexaai
withpip install "nexaai[onnx]"
in provided commands. - If you want to convert and quantize huggingface models to GGUF models, just replace
pip install nexaai
withpip install "nexaai[nexa-gguf]"
. - For Chinese developers, we recommend you to use Tsinghua Open Source Mirror as extra index url, just replace
--extra-index-url https://pypi.org/simple
with--extra-index-url https://pypi.tuna.tsinghua.edu.cn/simple
in provided commands.
pip install nexaai --prefer-binary --index-url https://nexaai.github.io/nexa-sdk/whl/cpu --extra-index-url https://pypi.org/simple --no-cache-dir
For the GPU version supporting Metal (macOS):
CMAKE_ARGS="-DGGML_METAL=ON -DSD_METAL=ON" pip install nexaai --prefer-binary --index-url https://nexaai.github.io/nexa-sdk/whl/metal --extra-index-url https://pypi.org/simple --no-cache-dir
FAQ: cannot use Metal/GPU on M1
Try the following command:
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
bash Miniforge3-MacOSX-arm64.sh
conda create -n nexasdk python=3.10
conda activate nexasdk
CMAKE_ARGS="-DGGML_METAL=ON -DSD_METAL=ON" pip install nexaai --prefer-binary --index-url https://nexaai.github.io/nexa-sdk/whl/metal --extra-index-url https://pypi.org/simple --no-cache-dir
For Linux:
CMAKE_ARGS="-DGGML_CUDA=ON -DSD_CUBLAS=ON" pip install nexaai --prefer-binary --index-url https://nexaai.github.io/nexa-sdk/whl/cu124 --extra-index-url https://pypi.org/simple --no-cache-dir
For Windows PowerShell:
$env:CMAKE_ARGS="-DGGML_CUDA=ON -DSD_CUBLAS=ON"; pip install nexaai --prefer-binary --index-url https://nexaai.github.io/nexa-sdk/whl/cu124 --extra-index-url https://pypi.org/simple --no-cache-dir
For Windows Command Prompt:
set CMAKE_ARGS="-DGGML_CUDA=ON -DSD_CUBLAS=ON" & pip install nexaai --prefer-binary --index-url https://nexaai.github.io/nexa-sdk/whl/cu124 --extra-index-url https://pypi.org/simple --no-cache-dir
For Windows Git Bash:
CMAKE_ARGS="-DGGML_CUDA=ON -DSD_CUBLAS=ON" pip install nexaai --prefer-binary --index-url https://nexaai.github.io/nexa-sdk/whl/cu124 --extra-index-url https://pypi.org/simple --no-cache-dir
FAQ: Building Issues for llava
If you encounter the following issue while building:
try the following command:
CMAKE_ARGS="-DCMAKE_CXX_FLAGS=-fopenmp" pip install nexaai
For Linux:
CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install nexaai --prefer-binary --index-url https://nexaai.github.io/nexa-sdk/whl/rocm621 --extra-index-url https://pypi.org/simple --no-cache-dir
How to clone this repo
git clone --recursive https://github.com/NexaAI/nexa-sdk
If you forget to use --recursive
, you can use below command to add submodule
git submodule update --init --recursive
Then you can build and install the package
pip install -e .
-
Model Support:
- ONNX & GGML models
- Conversion Engine
- Inference Engine:
- Text Generation
- Image Generation
- Vision-Language Models (VLM)
- Speech-to-Text (ASR)
Detailed API documentation is available here.
- Server:
- OpenAI-compatible API
- JSON schema mode for function calling
- Streaming support
- Streamlit UI for interactive model deployment and testing
Below is our differentiation from other similar tools:
Feature | Nexa SDK | ollama | Optimum | LM Studio |
---|---|---|---|---|
GGML Support | β | β | β | β |
ONNX Support | β | β | β | β |
Text Generation | β | β | β | β |
Image Generation | β | β | β | β |
Vision-Language Models | β | β | β | β |
Text-to-Speech | β | β | β | β |
Server Capability | β | β | β | β |
User Interface | β | β | β | β |
Our on-device model hub offers all types of quantized models (text, image, audio, multimodal) with filters for RAM, file size, Tasks, etc. to help you easily explore models with UI. Explore on-device models at On-device Model Hub
Supported models (full list at Model Hub):
Model | Type | Format | Command |
---|---|---|---|
octopus-v2 | NLP | GGUF | nexa run octopus-v2 |
octopus-v4 | NLP | GGUF | nexa run octopus-v4 |
gpt2 | NLP | GGUF | nexa run gpt2 |
tinyllama | NLP | GGUF | nexa run tinyllama |
llama2 | NLP | GGUF/ONNX | nexa run llama2 |
llama2-uncensored | NLP | GGUF | nexa run llama2-uncensored |
llama2-function-calling | NLP | GGUF | nexa run llama2-function-calling |
llama3 | NLP | GGUF/ONNX | nexa run llama3 |
llama3.1 | NLP | GGUF/ONNX | nexa run llama3.1 |
llama3.2 | NLP | GGUF | nexa run llama3.2 |
llama3-uncensored | NLP | GGUF | nexa run llama3-uncensored |
gemma | NLP | GGUF/ONNX | nexa run gemma |
gemma2 | NLP | GGUF | nexa run gemma2 |
qwen1.5 | NLP | GGUF | nexa run qwen1.5 |
qwen2 | NLP | GGUF/ONNX | nexa run qwen2 |
qwen2.5 | NLP | GGUF | nexa run qwen2.5 |
mathqwen | NLP | GGUF | nexa run mathqwen |
codeqwen | NLP | GGUF | nexa run codeqwen |
mistral | NLP | GGUF/ONNX | nexa run mistral |
dolphin-mistral | NLP | GGUF | nexa run dolphin-mistral |
codegemma | NLP | GGUF | nexa run codegemma |
codellama | NLP | GGUF | nexa run codellama |
deepseek-coder | NLP | GGUF | nexa run deepseek-coder |
phi2 | NLP | GGUF | nexa run phi2 |
phi3 | NLP | GGUF/ONNX | nexa run phi3 |
phi3.5 | NLP | GGUF | nexa run phi3.5 |
openelm | NLP | GGUF | nexa run openelm |
AMD-Llama-135m | NLP | GGUF | nexa run AMD-Llama-135m:fp16 |
nanollava | Multimodal | GGUF | nexa run nanollava |
llava-phi3 | Multimodal | GGUF | nexa run llava-phi3 |
llava-llama3 | Multimodal | GGUF | nexa run llava-llama3 |
llava1.6-mistral | Multimodal | GGUF | nexa run llava1.6-mistral |
llava1.6-vicuna | Multimodal | GGUF | nexa run llava1.6-vicuna |
stable-diffusion-v1-4 | Computer Vision | GGUF | nexa run sd1-4 |
stable-diffusion-v1-5 | Computer Vision | GGUF/ONNX | nexa run sd1-5 |
stable-diffusion-v2-1 | Computer Vision | GGUF | nexa run sd2-1 |
stable-diffusion-3-medium | Computer Vision | GGUF | nexa run sd3 |
FLUX.1-schnell | Computer Vision | GGUF | nexa run flux |
lcm-dreamshaper | Computer Vision | GGUF/ONNX | nexa run lcm-dreamshaper |
hassaku-lcm | Computer Vision | GGUF | nexa run hassaku-lcm |
anything-lcm | Computer Vision | GGUF | nexa run anything-lcm |
faster-whisper-tiny | Audio | BIN | nexa run faster-whisper-tiny |
faster-whisper-small | Audio | BIN | nexa run faster-whisper-small |
faster-whisper-medium | Audio | BIN | nexa run faster-whisper-medium |
faster-whisper-base | Audio | BIN | nexa run faster-whisper-base |
faster-whisper-large | Audio | BIN | nexa run faster-whisper-large |
whisper-large-v3-turbo | Audio | BIN | nexa run faster-whisper-large-turbo |
whisper-tiny.en | Audio | ONNX | nexa run whisper-tiny.en |
whisper-tiny | Audio | ONNX | nexa run whisper-tiny |
whisper-small.en | Audio | ONNX | nexa run whisper-small.en |
whisper-small | Audio | ONNX | nexa run whisper-small |
whisper-base.en | Audio | ONNX | nexa run whisper-base.en |
whisper-base | Audio | ONNX | nexa run whisper-base |
mxbai-embed-large-v1 | Embedding | GGUF | nexa embed mxbai |
nomic-embed-text-v1.5 | Embedding | GGUF | nexa embed nomic |
all-MiniLM-L6-v2 | Embedding | GGUF | nexa embed all-MiniLM-L6-v2:fp16 |
all-MiniLM-L12-v2 | Embedding | GGUF | nexa embed all-MiniLM-L12-v2:fp16 |
Here's a brief overview of the main CLI commands:
nexa run
: Run inference for various tasks using GGUF models.nexa onnx
: Run inference for various tasks using ONNX models.nexa convert
: Convert and quantize huggingface models to GGUF models.nexa server
: Run the Nexa AI Text Generation Service.nexa eval
: Run the Nexa AI Evaluation Tasks.nexa pull
: Pull a model from official or hub.nexa remove
: Remove a model from local machine.nexa clean
: Clean up all model files.nexa list
: List all models in the local machine.nexa login
: Login to Nexa API.nexa whoami
: Show current user information.nexa logout
: Logout from Nexa API.
For detailed information on CLI commands and usage, please refer to the CLI Reference document.
To start a local server using models on your local computer, you can use the nexa server
command.
For detailed information on server setup, API endpoints, and usage examples, please refer to the Server Reference document.
We would like to thank the following projects: