edge_diffusion inferences

Images, audio, and video content in augmented reality (AR) applications must be generated within milliseconds. Therefore, AR applications generate digital content on-device, but quality is limited by device capabilities. However, content created on a remote server with enough resources takes sub-seconds to be served. As on-device models enrich, this trend pushes inference capabilities back to the cloud within the submillisecond timeframe that cloud edge services such as CDN and LocalZone offer.

This example shows how AR app developers can decouple content quality from hardware by hosting models like Stable Diffusion by Stability AI on a chip such as NVIDIA or Neuron-based AI accelerators as close to the user device as possible.

You compile and deploy Stable Diffusion 2.1 on EKS in LocalZone to 1/ reduce deploy-time by caching 20GB model's graph artifacts on LocalZone by storing the compiled model on S3 and load it with InitContainer prior the endpoint startup. / simplify a secured network path between the user device and remote server with K8s node-port service; and finally 3/ run the model on any compatible and available AI accelerators.

[build-time] This sample starts with the build pipeline that compiles the PyTorch code into optimized lower level hardware specific code to accelerate inference on GPU and Neuron-enabled instances. This model compiler utilizes neuron(torch_neuronx) or GPU specific features such as mixed precision support, performance optimized kernels, and minimized communication between the CPU and AI accelerator. The output Docker images are stored in regional image registers (ECR) and ready to deploy. We use Volcano, a Kubernetes native batch scheduler, to improve inference pipline orchestration.

/*The build phase compiles the model and stores it in S3. In Dockerfile-assets, models are pulled from S3 and stored as Docker image layers. i.e., neuron model are pulled for Inf2 images and CUDA model pulled for GPU images with the same Dockerfile. Note that using if statement in RUN section will not cache the model, line RUN wget https://sdinfer.s3.us-west-2.amazonaws.com/sd2_compile_dir_512_${VAR}.tar.gz -O /model.tar.gz in our case.

ARG ai_chip

FROM public.ecr.aws/docker/library/python:latest as base

FROM base AS assets-amd64-cuda
ENV VAR=cuda

FROM base AS assets-amd64-neuron
ENV VAR=xla

FROM assets-${ai_chip} AS final
RUN wget https://sdinfer.s3.us-west-2.amazonaws.com/sd2_compile_dir_512_${VAR}.tar.gz -O /model.tar.gz

*/ Then, the SDK binaries are loaded at the next stage into the relevant AWS deep-learning containers. Specifically, we used: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.0.1-gpu-py310-cu118-ubuntu20.04-ec2 for G5 instances and 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference-neuronx:1.13.1-transformers4.34.1-neuronx-py310-sdk2.15.0-ubuntu20.04 for Inf2 instances.

[deploy-time] Next, EKS will instanciate the Docker image on EC2 instances launched by Karpenter based on availability, performance and cost policies. The inference endpoint uses a NodePort-based K8s service endpoint behind an EC2 security group. Each available endpoint is published to inference endpoints inventory that is pulled by the user device for ad-hoc inference.

[run-time] KEDA will control K8s deployment size based on specific AI accelerator usage at run-time. Karpenter terminates unused pods to reclaim compute capacity.

Setup

Install CDK k8s

  npm install -g cdk8s-cli

Create EKS cluster and deploy Karpenter
Use Service Quotas console to allocate Amazon Elastic Compute Cloud (Amazon EC2) "Running On-Demand Inf instances" and "Running On-Demand G and VT instances" limits.

Deploy NVIDIA device plugin for Kubernetes

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml

Deploy Neuron device plugin for Kubernetes

Build multi-arch CPU and accelerator image

The build process creates OCI images for x86-based instances. You add another build step to create OCI images for Graviton-based instances. This new build process creates a OCI image manifest list that references both OCI images. The container runtime (Docker Engine or containerd) will pull the correct platform-specific image at deployment time. To automate the OCI image build process, we use AWS CodePipeline. AWS CodePipeline starts by building a OCI image from the code in AWS CodeBuild that is pushed to Amazon Elastic Container Registry (Amazon ECR).

Deploy the CI-pipeline of the Stable Diffusion image

Deploy the inference pipeline

Deploy Karpenter NodePools for Inf2 and G instances

kubectl apply -f amd-nvidia-provisioner.yaml
kubectl apply -f amd-neuorn-provisioner.yaml

The model file is stored in S3 between compiling and deploy the model as docker asset image so need to grant access to s3 via k8s service account

kubectl apply -f appsimulator_sa.yaml

TBD - need to set EKS Pod Identities or IRSA

aws iam create-policy --policy-name allow-access-to-model-assets --policy-document file://allow-access-to-model-assets.json
eksctl create iamserviceaccount --name appsimulator --namespace default --cluster tlvsummit-demo --role-name appsimulator \
  --attach-policy-arn arn:aws:iam::891377065549:policy/allow-access-to-model-assets --approve

Compile the model in a region (batch/v1 Job)

kubectl apply -f sd2-512-cuda-compile-job.yaml
kubectl apply -f sd2-512-xla-compile-job.yaml

Deploy the model in a region (apps/v1 Deployment)
```
kubectl apply -f sd2-512-xla-serve-deploy.yaml
```
Discover the inference endpoint
```
kubectl get svc
```
e.g.,

  $kubectl get svc
NAME                                                          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
kubernetes                                                    ClusterIP   10.100.0.1      <none>        443/TCP          64d
stablediffusion-serve-inf-56dbffc68c-zcphj-svc-18-246-11-46   NodePort    10.100.228.62   <none>        7860:32697/TCP   2d20h

The endpoint is http://18.246.11.46:32697/. Observe the AI chips utilization e.g., neuron-top

kubectl exec -it stablediffusion-serve-inf-56dbffc68c-zcphj -- neuron-top

Feel the prompt and enjoy the images generated. Note the the processing time. We will need that for the LocalZoe case.

Deploy inference endpoint with NVIDIA G5 (G4dn is not supported by Stable Diffusion)
```
kubectl apply -f sd2-512-cuda-serve-deploy.yaml
```

Wait few minutes for the node provisioning and pod startup and discover the new service

kubectl get svc

e.g.,

kubectl get svc
NAME                                                          TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE
kubernetes                                                    ClusterIP   10.100.0.1       <none>        443/TCP          66d
stablediffusion-serve-gpu-857c86776d-2wpb6-svc-35-90-0-175    NodePort    10.100.117.207   <none>        7860:31071/TCP   9m18s
stablediffusion-serve-inf-56dbffc68c-zcphj-svc-18-246-11-46   NodePort    10.100.228.62    <none>        7860:32697/TCP   4d17h

The relevant service is stablediffusion-serve-gpu-857c86776d-2wpb6-svc-35-90-0-175. Endpoint is http://35.90.0.175:31071

Observe the NVIDIA core usage while generating an image by:

watch kubectl exec -it stablediffusion-serve-gpu-857c86776d-2wpb6 -- nvidia-smi

Fri Dec  1 16:50:41 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    On  | 00000000:00:1B.0 Off |                    0 |
|  0%   33C    P0             222W / 300W |   3930MiB / 23028MiB |     99%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A10G                    On  | 00000000:00:1C.0 Off |                    0 |
|  0%   16C    P8              18W / 300W |      7MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A10G                    On  | 00000000:00:1D.0 Off |                    0 |
|  0%   17C    P8              15W / 300W |      7MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A10G                    On  | 00000000:00:1E.0 Off |                    0 |
|  0%   16C    P8               9W / 300W |      7MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

Note the first GPU core and memory utilization.

Deploy node pools on LocalZone TBD

Name		Name	Last commit message	Last commit date
Latest commit History 398 Commits
app		app
infra-build		infra-build
model-ci-build		model-ci-build
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
amd-neuron-provisioner.yaml		amd-neuron-provisioner.yaml
amd-nvidia-provisioner.yaml		amd-nvidia-provisioner.yaml
appsimulator_sa.yaml		appsimulator_sa.yaml
infer-in-region-g5.png		infer-in-region-g5.png
infer-in-region.png		infer-in-region.png
neuron-top.png		neuron-top.png
sd2-512-cuda-compile-job.yaml		sd2-512-cuda-compile-job.yaml
sd2-512-cuda-serve-deploy.yaml		sd2-512-cuda-serve-deploy.yaml
sd2-512-xla-compile-job.yaml		sd2-512-xla-compile-job.yaml
sd2-512-xla-serve-deploy.yaml		sd2-512-xla-serve-deploy.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

edge_diffusion inferences

Setup

Build multi-arch CPU and accelerator image

Deploy the inference pipeline

About

Releases

Packages

Contributors 5

Languages

License

aws-samples/edge_diffusion_on_eks

Folders and files

Latest commit

History

Repository files navigation

edge_diffusion inferences

Setup

Build multi-arch CPU and accelerator image

Deploy the inference pipeline

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages