This project is no longer actively maintained. While existing releases remain available, there are no planned updates, bug fixes, new features, or security patches. Users should be aware that vulnerabilities may not be addressed.
This document demonstrates, running fast transformers HuggingFace BERT example with Torchserve in kubernetes setup.
Refer: FasterTransformer_HuggingFace_Bert
Once the cluster and the PVCs are ready, we can generate MAR file.
Follow steps from here to generate MAR file
docker cp <container-id>:/workspace/serve/examples/FasterTransformer_HuggingFace_Bert/BERTSeqClassification.mar ./BERTSeqClassification.mar
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
NUM_WORKERS=1
number_of_gpu=1
install_py_dep_per_model=true
number_of_netty_threads=32
job_queue_size=1000
model_store=/home/model-server/shared/model-store
model_snapshot={"name":"startup.cfg","modelCount":1,"models":{"bert":{"1.0":{"defaultVersion":true,"marName":"BERTSeqClassification.mar","minWorkers":2,"maxWorkers":3,"batchSize":1,"maxBatchDelay":100,"responseTimeout":120}}}}
kubectl exec --tty pod/model-store-pod -- mkdir /pv/model-store/
kubectl cp BERTSeqClassification.mar model-store-pod:/pv/model-store/BERTSeqClassification.mar
kubectl exec --tty pod/model-store-pod -- mkdir /pv/config/
kubectl cp config.properties model-store-pod:/pv/config/config.properties
- Clone Torchserve Repo
git clone https://github.com/pytorch/serve.git
cd serve/docker
- Modify Python and Pip paths in
Dockerfile
as below
sed -i 's#/usr/bin/python3#/opt/conda/bin/python3#g' Dockerfile
sed -i 's#/usr/local/bin/pip3#/opt/conda/bin/pip3#g' Dockerfile
- Change GPU check in
Dockerfile
for nvcr.io image
sed -i 's#grep -q "cuda:"#grep -q "nvidia:"#g' Dockerfile
- Add
transformers==2.5.1
toDockerfile
sed -i 's#pip install --no-cache-dir captum torchtext torchserve torch-model-archiver#& transformers==2.5.1#g' Dockerfile
-
Build image
DOCKER_BUILDKIT=1 docker build -file Dockerfile -t <image-name> --build-arg BASE_IMAGE=nvcr.io/nvidia/pytorch:20.12-py3 --build-arg CUDA_VERSION=cu102 .
- Push image
docker push <image-name>
- Navigate to kubernetes TS Helm package folder
cd ../kubernetes/Helm
- Modify values.yaml with image and memory
torchserve_image: <image build in previous step>
namespace: torchserve
torchserve:
management_port: 8081
inference_port: 8080
metrics_port: 8082
pvd_mount: /home/model-server/shared/
n_gpu: 1
n_cpu: 4
memory_limit: 32Gi
memory_request: 32Gi
deployment:
replicas: 1
persitant_volume:
size: 1Gi
- Install TS
helm install torchserve .
- Check TS installation
Kubectl get pods -n default
Kubectl logs <pod-name> -n default
- Start a shell session into the TS pod
kubectl exec -it <pod-name> -- bash
- Create input file
Sample_text_captum_input.txt
{
"text": "Bloomberg has decided to publish a new report on the global economy.",
"target": 1
}
- Run inference
curl -X POST http://127.0.0.1:8080/predictions/bert -T ../Huggingface_Transformers/Seq_classification_artifacts/sample_text_captum_input.txt