This document demonstrates, running fast transformers HuggingFace BERT example with Torchserve in kubernetes setup.
Refer: FasterTransformer_HuggingFace_Bert
Once the cluster and the PVCs are ready, we can generate MAR file.
Follow steps from here to generate MAR file
docker cp <container-id>:/workspace/serve/examples/FasterTransformer_HuggingFace_Bert/BERTSeqClassification.mar ./BERTSeqClassification.mar
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
NUM_WORKERS=1
number_of_gpu=1
install_py_dep_per_model=true
number_of_netty_threads=32
job_queue_size=1000
model_store=/home/model-server/shared/model-store
model_snapshot={"name":"startup.cfg","modelCount":1,"models":{"bert":{"1.0":{"defaultVersion":true,"marName":"BERTSeqClassification.mar","minWorkers":2,"maxWorkers":3,"batchSize":1,"maxBatchDelay":100,"responseTimeout":120}}}}
kubectl exec --tty pod/model-store-pod -- mkdir /pv/model-store/
kubectl cp BERTSeqClassification.mar model-store-pod:/pv/model-store/BERTSeqClassification.mar
kubectl exec --tty pod/model-store-pod -- mkdir /pv/config/
kubectl cp config.properties model-store-pod:/pv/config/config.properties
- Clone Torchserve Repo
git clone https://github.com/pytorch/serve.git
cd serve/docker
- Modify Python and Pip paths in
Dockerfile
as below
sed -i 's#/usr/bin/python3#/opt/conda/bin/python3#g' Dockerfile
sed -i 's#/usr/local/bin/pip3#/opt/conda/bin/pip3#g' Dockerfile
- Change GPU check in
Dockerfile
for nvcr.io image
sed -i 's#grep -q "cuda:"#grep -q "nvidia:"#g' Dockerfile
- Add
transformers==2.5.1
toDockerfile
sed -i 's#pip install --no-cache-dir captum torchtext torchserve torch-model-archiver#& transformers==2.5.1#g' Dockerfile
-
Build image
DOCKER_BUILDKIT=1 docker build -file Dockerfile -t <image-name> --build-arg BASE_IMAGE=nvcr.io/nvidia/pytorch:20.12-py3 --build-arg CUDA_VERSION=cu102 .
- Push image
docker push <image-name>
- Navigate to kubernetes TS Helm package folder
cd ../kubernetes/Helm
- Modify values.yaml with image and memory
torchserve_image: <image build in previous step>
namespace: torchserve
torchserve:
management_port: 8081
inference_port: 8080
metrics_port: 8082
pvd_mount: /home/model-server/shared/
n_gpu: 1
n_cpu: 4
memory_limit: 32Gi
memory_request: 32Gi
deployment:
replicas: 1
persitant_volume:
size: 1Gi
- Install TS
helm install torchserve .
- Check TS installation
Kubectl get pods -n default
Kubectl logs <pod-name> -n default
- Start a shell session into the TS pod
kubectl exec -it <pod-name> -- bash
- Create input file
Sample_text_captum_input.txt
{
"text": "Bloomberg has decided to publish a new report on the global economy.",
"target": 1
}
- Run inference
curl -X POST http://127.0.0.1:8080/predictions/bert -T ../Huggingface_Transformers/Seq_classification_artifacts/sample_text_captum_input.txt