Images, audio, and video content in augmented reality (AR) applications must be generated within milliseconds. Therefore, AR applications generate digital content on-device, but quality is limited by device capabilities. However, content created on a remote server with enough resources takes sub-seconds to be served. As on-device models enrich, this trend pushes inference capabilities back to the cloud within the submillisecond timeframe that cloud edge services such as CDN and LocalZone offer.
This example shows how AR app developers can decouple content quality from hardware by hosting models like Stable Diffusion by Stability AI on a chip such as NVIDIA or Neuron-based AI accelerators as close to the user device as possible.
You compile and deploy Stable Diffusion 2.1 on EKS in LocalZone to 1/ reduce deploy-time by caching 20GB model's graph artifacts on LocalZone by storing the compiled model on S3 and load it with InitContainer prior the endpoint startup. / simplify a secured network path between the user device and remote server with K8s node-port service; and finally 3/ run the model on any compatible and available AI accelerators.
[build-time] This sample starts with the build pipeline that compiles the PyTorch code into optimized lower level hardware specific code to accelerate inference on GPU and Neuron-enabled instances. This model compiler utilizes neuron(torch_neuronx) or GPU specific features such as mixed precision support, performance optimized kernels, and minimized communication between the CPU and AI accelerator. The output Docker images are stored in regional image registers (ECR) and ready to deploy. We use Volcano, a Kubernetes native batch scheduler, to improve inference pipline orchestration.
/*The build phase compiles the model and stores it in S3. In Dockerfile-assets, models are pulled from S3 and stored as Docker image layers. i.e., neuron model are pulled for Inf2 images and CUDA model pulled for GPU images with the same Dockerfile. Note that using if
statement in RUN
section will not cache the model, line RUN wget https://sdinfer.s3.us-west-2.amazonaws.com/sd2_compile_dir_512_${VAR}.tar.gz -O /model.tar.gz
in our case.
ARG ai_chip
FROM public.ecr.aws/docker/library/python:latest as base
FROM base AS assets-amd64-cuda
ENV VAR=cuda
FROM base AS assets-amd64-neuron
ENV VAR=xla
FROM assets-${ai_chip} AS final
RUN wget https://sdinfer.s3.us-west-2.amazonaws.com/sd2_compile_dir_512_${VAR}.tar.gz -O /model.tar.gz
*/
Then, the SDK binaries are loaded at the next stage into the relevant AWS deep-learning containers. Specifically, we used:
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.0.1-gpu-py310-cu118-ubuntu20.04-ec2
for G5 instances and 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference-neuronx:1.13.1-transformers4.34.1-neuronx-py310-sdk2.15.0-ubuntu20.04
for Inf2 instances.
[deploy-time] Next, EKS will instanciate the Docker image on EC2 instances launched by Karpenter based on availability, performance and cost policies. The inference endpoint uses a NodePort-based K8s service endpoint behind an EC2 security group. Each available endpoint is published to inference endpoints inventory that is pulled by the user device for ad-hoc inference.
[run-time] KEDA will control K8s deployment size based on specific AI accelerator usage at run-time. Karpenter terminates unused pods to reclaim compute capacity.
- Install CDK k8s
npm install -g cdk8s-cli
-
Use Service Quotas console to allocate Amazon Elastic Compute Cloud (Amazon EC2) "Running On-Demand Inf instances" and "Running On-Demand G and VT instances" limits.
-
Deploy NVIDIA device plugin for Kubernetes
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml
The build process creates OCI images for x86-based instances. You add another build step to create OCI images for Graviton-based instances. This new build process creates a OCI image manifest list that references both OCI images. The container runtime (Docker Engine or containerd) will pull the correct platform-specific image at deployment time. To automate the OCI image build process, we use AWS CodePipeline. AWS CodePipeline starts by building a OCI image from the code in AWS CodeBuild that is pushed to Amazon Elastic Container Registry (Amazon ECR).
- Deploy Karpenter NodePools for Inf2 and G instances
kubectl apply -f amd-nvidia-provisioner.yaml kubectl apply -f amd-neuorn-provisioner.yaml
The model file is stored in S3 between compiling and deploy the model as docker asset image so need to grant access to s3 via k8s service account
kubectl apply -f appsimulator_sa.yaml
TBD - need to set EKS Pod Identities or IRSA
aws iam create-policy --policy-name allow-access-to-model-assets --policy-document file://allow-access-to-model-assets.json
eksctl create iamserviceaccount --name appsimulator --namespace default --cluster tlvsummit-demo --role-name appsimulator \
--attach-policy-arn arn:aws:iam::891377065549:policy/allow-access-to-model-assets --approve
-
Compile the model in a region (batch/v1 Job)
kubectl apply -f sd2-512-cuda-compile-job.yaml kubectl apply -f sd2-512-xla-compile-job.yaml
-
Deploy the model in a region (apps/v1 Deployment)
kubectl apply -f sd2-512-xla-serve-deploy.yaml
-
Discover the inference endpoint
kubectl get svc
e.g.,
$kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.100.0.1 <none> 443/TCP 64d
stablediffusion-serve-inf-56dbffc68c-zcphj-svc-18-246-11-46 NodePort 10.100.228.62 <none> 7860:32697/TCP 2d20h
The endpoint is http://18.246.11.46:32697/
. Observe the AI chips utilization e.g., neuron-top
kubectl exec -it stablediffusion-serve-inf-56dbffc68c-zcphj -- neuron-top
Feel the prompt and enjoy the images generated. Note the the processing time. We will need that for the LocalZoe case.
- Deploy inference endpoint with NVIDIA G5 (G4dn is not supported by Stable Diffusion)
kubectl apply -f sd2-512-cuda-serve-deploy.yaml
Wait few minutes for the node provisioning and pod startup and discover the new service
kubectl get svc
e.g.,
kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.100.0.1 <none> 443/TCP 66d
stablediffusion-serve-gpu-857c86776d-2wpb6-svc-35-90-0-175 NodePort 10.100.117.207 <none> 7860:31071/TCP 9m18s
stablediffusion-serve-inf-56dbffc68c-zcphj-svc-18-246-11-46 NodePort 10.100.228.62 <none> 7860:32697/TCP 4d17h
The relevant service is stablediffusion-serve-gpu-857c86776d-2wpb6-svc-35-90-0-175
. Endpoint is http://35.90.0.175:31071
Observe the NVIDIA core usage while generating an image by:
watch kubectl exec -it stablediffusion-serve-gpu-857c86776d-2wpb6 -- nvidia-smi
Fri Dec 1 16:50:41 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10G On | 00000000:00:1B.0 Off | 0 |
| 0% 33C P0 222W / 300W | 3930MiB / 23028MiB | 99% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A10G On | 00000000:00:1C.0 Off | 0 |
| 0% 16C P8 18W / 300W | 7MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A10G On | 00000000:00:1D.0 Off | 0 |
| 0% 17C P8 15W / 300W | 7MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 16C P8 9W / 300W | 7MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
Note the first GPU core and memory utilization.
- Deploy node pools on LocalZone TBD