Currently, for GenAI examples, whenever tensorrt-llm or hugging-face-local models are deployed, the service needs to download the model before starting. In order to accelerate model deployment, we should save the downloaded files of the models as artifacts, and load them from file whenever serving the model:
- For hugging-face-local, model is saved in the HF cache folders, set up on the utils.py file. The path with the model folder should be passed as artifact, and load should use this artifact as parameter, instead of the repository name
- For tensorrt-llm, model engine should be built, exported and saved as artifact. This should accelerate the loading of models