Downloading large models from Hugging Face can take a significant amount of time. If a PVC containing the model files is already pre-populated, then mounting this path and supplying that to vLLM can drastically shorten the engine's warm up time.
There are some requirements for the PV or StorageClass. In particular, select a storage class that provides
- a retention policy of
Retain, to ensure that the model downloaded remains on the PV despite no PVCs attached - the desired access: ReadWriteMany RWX (preferred) vs. ReadWriteOnce RWO
- this is to ensure that at least one pod can mount to the storage volume and download the model, which can be later read by another pod
- RWX will also support multinode and multiple replicas, while RWO won't
You should then ask your administrator for a PVC that is available in your cluster. If such PVC is not present, the following example PVC spec is provided.
You may use a RWO PVC, but after the pod downloads the model, you must delete the pod so that the vllm pods can claim the PVC. Also, a RWO PVC may not work for multinode examples and for models deployed with multiple replicas.
You can apply the following PVC manifest, which is the bare minimal spec.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-pvc
spec:
accessModes:
- ReadWriteMany # <--- Change to ReadWriteOnce if your StorageClass only provides RWO
resources:
requests:
storage: 5Gi # <--- make sure your PV has enough storage for this claim
volumeName: model-pv # <--- change this to reflect the name of the PV
storageClass: standard # <--- change this to reflect the storage class of your PValias k=kubectl
k apply -f examples/pvc/pvc.yaml
Assuming that a RWX PVC is now available, which allows many pods across nodes to read-write to the volume, create a pod using the following spec which downloads your desire model onto the PVC through an InitContainer. We will need to fetch the path of the model later by exec into the running container. Check out this download-model.yaml for such manifest. You may need to edit the Python script which downloads the model with the name of your desired model. Also, some models require a token, which you must supply. Modify the Python script for your usecase.
k apply -f examples/pvc/download-model.yaml
Wait for the pod to be in the Running state.
NAME READY STATUS RESTARTS AGE
model-downloader 1/1 Running 0 106s
Then, exec into the pod and cd into the cache dir path containing the model. The Python script downloads the model into the /models directory. Confirm that a config.json is present. models is the path you should use in the ModelService uri.
$ k exec -it model-downloader -- /bin/bash
$ cd models && ls
LICENSE.md config.json generation_config.json pytorch_model.bin tf_model.h5 vocab.json
README.md flax_model.msgpack merges.txt special_tokens_map.json tokenizer_config.json
If using RWO: delete this pod so the pods created in the next step can claim this PVC. If you have a RWX PVC, then you do not need to delete the
model-downloaderpod.
Examine this values file for an example of how to use a PVC. The only change is the URI, which should follow the following format.
modelArtifacts:
uri: pvc://pvc-name/path/to/modelYou can install the ModelService quickly using this command:
helm install pvc-example llm-d-modelservice/llm-d-modelservice \
-f https://raw.githubusercontent.com/llm-d-incubation/llm-d-modelservice/refs/heads/main/examples/values-pd.yaml \
--set modelArtifacts.uri="pvc://pvc-name/path/to/model"
Examine output-pvc.yaml to view the Kubernetes resources that will be applied upon the above command.
Note that the path after the <pvc-name> is the path on the PVC which the downloaded files can be found. If you don't know the path, create a debug pod (see an example manifest here) and exec (k exec -it pvc-debugger -- bin/bash) into it to find out. The path should not contain the mountPath of that debug pod. For example, if inside the pod, the path is which model files can be found is /mnt/huggingface/cache/models/, then use just huggingface/cache/models/ as the <path/to/model> because /mnt is specific to the mountPath of that debug pod.
Make sure that for the container of your interst in prefill.containers or decode.containers, there's a field called mountModelVolume: true (see example) for the volume mounts to be created correctly.
- A PVC volume with the name
model-storageis created for the deployment (read-only by default) - A volumeMount with the mountPath:
model-cacheis created for each container wheremountModelVolume: true(read-only by default; setmodelArtifacts.readOnly: falseto allow writes) --modelarg for that container is set tomodel-cache/<path/to/model>wheremountModelVolume: true
mountModelVolume: true. ModelService will automatically populate the pod specification and mount the model files.
However, if you want to add your own volume specifications, you may do so under decode.volumes. If you would like to add more volumeMounts to a container, regardless whether if mountModelVolume is true, you may do so under decode.containers.
💡 You may optionally set the --served-model-name in your container to be used for the OpenAI request, otherwise the request name must be a long string like "model": "model-cache/<path/to/model>". Note that this argument is added automatically using the option modelCommand: vllmServe or imageDefault, using routing.modelName as the value to the --served-model-name argument.
For security purposes, the default is a read-only volume mount to the pods to prevent a pod from deleting the model files in case another model service installation uses the same PVC. If you would like to write to the PVC, you should not do so through ModelService, but rather through your own pod like the download-model/pvc-debugger without the read-only restriction. Alternatively, set
modelArtifacts.readOnly: falseif write access is required (for example, Hugging Face cache lock files withpvc+hf://).
The above steps work for PVCs regardless of model format. If you know that the PVC contains models that are specifically downloaded from Hugging Face, and would like vLLM to support HF model formats specifically, you should use the pvc+hf:// prefix so ModelService will set the appropriate HF_* environment variables.
Set the URI like the following
modelArtifacts:
uri: pvc+hf://pvc-name/path/to/hf_hub_cache/namespace/modelIDYou can install the ModelService quickly using this command:
helm install pvc-hf-example llm-d-modelservice/llm-d-modelservice \
-f https://raw.githubusercontent.com/llm-d-incubation/llm-d-modelservice/refs/heads/main/examples/values-pd.yaml \
--set modelArtifacts.uri="pvc+hf://pvc-name/path/to/hf_hub_cache/facebook/opt-125m"
Make sure that for the container of your interst in prefill.containers or decode.containers, there's a field called mountModelVolume: true (see example) for the volume mounts to be created correctly.
- A PVC volume with the name
model-storageis created for the deployment (read-only by default) - A volumeMount with the mountPath:
model-cacheis created for each container wheremountModelVolume: true(read-only by default; setmodelArtifacts.readOnly: falseto allow writes) HF_HUB_CACHEenvironment variable for that container is set tomodel-cache/path/to/hf_hub_cachewheremountModelVolume: true--modelarugment is set tofacebook/opt-125m
Note: Hugging Face Hub needs to write
.lockfiles and cache metadata. SetmodelArtifacts.readOnly: falseto allow writes:modelArtifacts: uri: pvc+hf://pvc-name/path/to/hf_hub_cache/namespace/modelID readOnly: false