Skip to content

Triton Inference Server with different Deep Learning inference backends for AI product deployment

Notifications You must be signed in to change notification settings

kyle-paul/triton

Repository files navigation

Triton Documentation

Triton is a Inference Server enabling teams to deploy any AI model from multiple deep learning and machine learning frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton supports inference across cloud, data center, edge and embedded devices on NVIDIA GPUs, x86 and ARM CPU, or AWS Inferentia. Triton Inference Server delivers optimized performance for many query types, including real time, batched, ensembles and audio/video streaming. Triton inference Server is part of NVIDIA AI Enterprise, a software platform that accelerates the data science pipeline and streamlines the development and deployment of production AI.

This is the tree summary of all Triton documents but main documentation is here.

Triton Deployment Documentation

Triton Tutorials

Details

This tutorial provide a quick start process for beginners to start using triton through many examples and do not goes into too much technical details.

Discussion

Model max_batch_size

Triton Inference Server

User guide

Architecture
Decoupled models
Model configuration
Model management
Model repository
Ragged batching
Optimization
Perf analyzer
Performance tuning

Customization guide

Inference protocols
Repository Agent

Triton Client

Perf Analyzer

Triton Performance Analyzer

GRPC Protocol

Ensemble image client
GRPC client
GRPC byte content client
GRPC explicit int8 content client
GRPC explicit int content client
GPRC image client
Image client
Memory growth test
Reuse infer objects client
Simple GRPC AIO infer client
Simple GRPC AIO sequence stream
Simple GRPC async infer client
Simple GRPC cudashm client
Simple GRPC custom args client
Simple GRPC custom repeat
Simple GRPC health metadata
Simple GRPC infer client
Simple GRPC keepalive client
Simple GRPC model control
GRPC sequence stream infer client
GRPC sequence sync infer client
Simple GRPC shm client
Simple GRPC shm string client
Simple GRPC string infer client

HTTP Protocol

Simple HTTP aio infer client
Simple HTTP async infer client
Simple HTTP cudashm client
Simple HTTP health metadata
Simple HTTP infer client
Simple HTTP model control
Simple HTTP sequence sync infer client
Simple HTTP shm client
Simple HTTP shm string client
Simple HTTP string infer client

Model Analyzer

Details

Install
CLI
Config Config search Ensemble quick start
Kubernets Deploy
Launch mode
Metrics
Reports
Multi-model quick start
BLS model quick start
Checkpointing in Model Anlayzer

Model Navigator

An inference toolkit designed for optimizing and deploying Deep Learning models with a focus on NVIDIA GPUs. The Triton Model Navigator streamlines the process of moving models and pipelines implemented in PyTorch, TensorFlow, and/or ONNX to TensorRT.

Details

Optimize Torch linear model
Optimize and verify model
Python Profile function

Backends

A Triton backend is the implementation that executes a model. A backend can be a wrapper around a deep-learning framework, like PyTorch, TensorFlow, TensorRT or ONNX Runtime. Or a backend can be custom C/C++ logic performing any operation (for example, image pre-processing).

Python Backend

Business logic scripting
Add sub BLS decoupled
Custom metrics

Onnx Runtime Backend

ONNX runtime with TensoRT EP
ONNX runtime with CUDA EP
ONNX runtime with OpenVino
Other optimization with ONNX

TensorRT Backend

Intro to Notebooks Semantic segmentation
Deploy to Triton Quantization tutorial
Torch-TensorRT with Triton

Dali Backend

Training to inference
Examples Dali plugin
Efficient net
Inception ensemble
Perf Analyzer
ResNet50 TRT

Pytorch Backend

Docs

Paddle Paddle Backend

Quick start Model configuration
Examples

Fast Transformer Backend

Faster Transformer
Faster Transformer backend

Pytriton

A Flask/FastAPI-like framework designed to streamline the use of NVIDIA's Triton Inference Server within Python environments. PyTriton enables serving Machine Learning models with ease, supporting direct deployment from Python.

Quick Start

Add sub notebook
Hugging face Resnet Pytorch

Quick start

Setup docker images

Pull nividia Triton server image

docker pull nvcr.io/nvidia/tritonserver:24.06-py3

Pull nvidia Triton client for inference

docker pull nvcr.io/nvidia/tritonserver:24.06-py3-sdk

Triton server

Create and run container for Triton server

docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /home/dev/triton/model-repository:/models nvcr.io/nvidia/tritonserver:24.06-py3 tritonserver --model-repository=/models

Recommend using docker-compose file

services:
  triton-server:
    image: nvcr.io/nvidia/tritonserver:24.06-py3
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command: tritonserver --model-repository=/models --model-control-mode=explicit --load-model=densenet_onnx
    ports:
      - "8000:8000"
      - "8001:8001"
      - "8002:8002"
    volumes:
      - ../model_repository:/models
    environment:
      - NVIDIA_VISIBLE_DEVICES=1

Then enter bash or attach shell in vscode for tracking logging

docker-compose ps
docker-compose exec triton-server bash

Triton client

Create and run container for Triton client:

docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:24.06-py3-sdk

Then in the bash, run the premade file image_client

/workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg

However, we can also create our own client with own http or grpc protocols:

import numpy as np
import requests
import json

# Define the server URL
url = "http://localhost:8000/v2/models/densenet_onnx/infer"

# Create input data (example: an array of zeros)
input_data = np.zeros((3, 224, 224), dtype=np.float32)

# Prepare the data in JSON format
inputs = [
    {
        "name": "data_0",
        "shape": input_data.shape,
        "datatype": "FP32",
        "data": input_data.tolist()
    }
]

outputs = [
    {
        "name": "fc6_1"
    }
]

request_payload = {
    "inputs": inputs,
    "outputs": outputs
}

# Send the request to the Triton server
response = requests.post(url, json=request_payload)

# Check the response status
if response.status_code == 200:
    response_json = response.json()
    print(response_json.keys())
    output_data = np.array(response_json["outputs"][0]["data"]).reshape(response_json["outputs"][0]["shape"])
    print("Output Data: ", output_data)
    
else:
    print("Request failed with status code: ", response.status_code)
    print("Response: ", response.text)

Then run this docker-compose.yml in client directory:

services:
  triton-client:
    image: nvcr.io/nvidia/tritonserver:24.06-py3-sdk
    network_mode: host
    tty: true
    stdin_open: true
    restart: unless-stopped
    volumes:
      - ../:/workspace/inference/

Model analyzer

Create the output dir first to avoid error

mkdir output_model/output

Run the triton server with above docker compose file. And now run the container for it to automatically connect to triton server

docker run -it --gpus all -v /var/run/docker.sock:/var/run/docker.sock -v d/Documents/GitHub/MY-REPO/triton/model_analyzer:/workspace/model_analyzer --net=host nvcr.io/nvidia/tritonserver:24.06-py3-sdk

Now in the containter/machine, we run triton model analyzer with this:

model-analyzer profile \
    --model-repository /workspace/model-analyzer/ \
    --profile-models densenet_onnx --triton-launch-mode=remote \
    --output-model-repository-path /workspace/model-analyzer/model-output/output \
    --export-path /workspace/model_analyzer/profile_results \
    --override-output-model-repository

If you just want to test with limit experiments, use this:

--run-config-search-max-concurrency 2
--run-config-search-max-model-batch-size 2
--run-config-search-max-instance-count 2

About

Triton Inference Server with different Deep Learning inference backends for AI product deployment

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published