PyTorch quantization and inference optimization #890
-
Hello! I am using PyTorch model in my Kotlin application, but i need to make inference time faster, I noticed that there is quantize method in Model class, but is not implemented. I also tried to load dynamically quantized model into DJL, but could not see any improvement. So I have 2 questions:
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
To make prediction faster, you can take a look at our inference performance optimization document for some ideas. DJL doesn't affect the weights of an imported pytorch model. The model is imported entirely inside the C++ native pytorch engine (the same one underlying the pytorch python code) and we can just rely on that. It may be possible to use static quantization, but I haven't looked into it too much. If you can quantize your model and then save the quantized format, executing the model through DJL may execute it quantized |
Beta Was this translation helpful? Give feedback.
To make prediction faster, you can take a look at our inference performance optimization document for some ideas.
DJL doesn't affect the weights of an imported pytorch model. The model is imported entirely inside the C++ native pytorch engine (the same one underlying the pytorch python code) and we can just rely on that.
It may be possible to use static quantization, but I haven't looked into it too much. If you can quantize your model and then save the quantized format, executing the model through DJL may execute it quantized