feat: enable quantization support for vLLM backend#374
feat: enable quantization support for vLLM backend#374yashasviyadav30 wants to merge 1 commit intokubeedge:mainfrom
Conversation
Signed-off-by: YASHASVIYADAV30 <yashasviydv30@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: yashasviyadav30 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces support for model quantization within the vLLM backend, addressing a previously commented-out feature. It ensures that the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request successfully enables quantization support for the vLLM backend. The changes are well-implemented, parsing the quantization parameter in the base class and conditionally applying it when initializing the vLLM model. This is a great feature addition. My review includes a few suggestions to improve flexibility and security by making some previously hardcoded parameters in the vLLM initialization configurable. These are not regressions introduced by this PR but rather opportunities for improvement in the code touched by your refactoring.
|
|
||
| llm_kwargs = { | ||
| "model": model, | ||
| "trust_remote_code": True, |
There was a problem hiding this comment.
| llm_kwargs = { | ||
| "model": model, | ||
| "trust_remote_code": True, | ||
| "dtype": "float16", |
There was a problem hiding this comment.
The dtype is hardcoded to "float16". While this is a common default, some models perform better with "bfloat16" (if supported by the hardware), and certain quantization methods might have specific dtype requirements. To improve flexibility, consider making this a configurable parameter, which could be parsed in BaseLLM with a default of "auto" or "float16".
| "dtype": "float16", | ||
| "tensor_parallel_size": self.tensor_parallel_size, | ||
| "gpu_memory_utilization": self.gpu_memory_utilization, | ||
| "max_model_len": 8192 |
There was a problem hiding this comment.
The max_model_len is hardcoded to 8192. This could be restrictive for models with larger context windows or inefficient for models with smaller ones. To improve flexibility, consider making this a configurable parameter. You could add max_model_len to _parse_kwargs in base_llm.py with a sensible default, and then use self.max_model_len here.
|
Those were already hardcoded in the original code, I kept them as-is to keep this PR focused on just the quantization support. Can open a follow-up for making them configurable. |
/kind feature
What this PR does / why we need it:
The
quantizationparameter invllm_llm.pywas commented out with a TODO (# TODO need to align with vllm API). Traced the issue —BaseLLM._parse_kwargs()wasn't parsingquantizationfrom kwargs, soself.quantizationstayed undefined.Fixed both sides:
quantizationinBaseLLM._parse_kwargs()(defaults toNone)LLM()only when set, using a dict-based approach instead of hardcoded argsBackward compatible — existing configs without
quantizationwork exactly as before. Users who want quantized inference can now addquantization: bitsandbytes(orawq,gptq) to their config and it just works.Cross-checked with vLLM docs and the PIPL example which already uses quantization in its configs.
Which issue(s) this PR fixes:
Fixes #372