While going through vllm_llm.py, I noticed the quantization parameter is commented out with a TODO:
#quantization=self.quantization # TODO need to align with vllm API
Looked into why this was disabled - BaseLLM._parse_kwargs() never parses quantization from kwargs, so self.quantization is undefined when VllmLLM._load() tries to use it.
Other examples like examples/PIPL/edge-cloud_collaborative_learning_bench already use quantization in their configs, and the original proposal also mentions it as a planned feature.
Checked the vLLM docs - the quantization parameter works directly with values like bitsandbytes, awq, gptq.
Fix:
Parse quantization in BaseLLM._parse_kwargs()
Conditionally pass it to vLLM in VllmLLM._load()
Happy to submit a PR for this.
While going through
vllm_llm.py, I noticed the quantization parameter is commented out with a TODO: