feat: defer activation qparam calculation to sequential epoch end#2455
feat: defer activation qparam calculation to sequential epoch end#2455dzhengAP wants to merge 2 commits intovllm-project:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refactors the activation quantization parameter (qparam) calculation within the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a deferred activation quantization parameter calculation, which is a significant improvement for efficiency. The approach of accumulating statistics during calibration and calculating qparams once at the end of the sequential epoch is well-implemented. The code changes are clear and well-documented. My review includes a couple of minor suggestions to improve code style by consolidating imports, which will enhance maintainability.
| # Disable quantization during calibration batches so that fp32 activations | ||
| # flow through the model unmodified while hooks accumulate running stats. | ||
| # Re-enable once after epoch end when qparams have been flushed. | ||
| from compressed_tensors.quantization import disable_quantization |
There was a problem hiding this comment.
| :param base_name: one of "input", "output", "q", "k", "v" | ||
| :return: True if qparams were updated, False if observer had no accumulated stats | ||
| """ | ||
| from compressed_tensors.utils import align_module_device, update_offload_parameter |
There was a problem hiding this comment.
This local import is partially redundant and should be moved to the top of the file for consistency. align_module_device is already imported at the top of the file. You can add update_offload_parameter to that existing import statement and remove this local import. This will improve code readability and maintainability.
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Local ValidationRan a smoke test on Result: all 72 Linear modules across 12 decoder layers have ✅ 72 modules have input_scale — deferred qparam flush working! Full pipeline ran in ~1 min on CPU (13 subgraphs × 32 batches). Result: all 72 Linear modules across 12 decoder layers have |
Local Validation (updated)Model: facebook/opt-125m, 32 calibration samples, macOS CPU
✅ All checks pass. Full lm_eval regression tests pending GPU access. @kylesayrs could you add the |
|
Hi @kylesayrs, any more work we wanna continue or we need to run more experiments and ablation study before merging or close this PR? |
|
Hey, you have fp32, and quant(deferred) we'd primarily compare with quant(main) which is missing |
233d180 to
ada51e6
Compare
Updated Local ValidationAdded quant(main) baseline and Qwen2-0.5B per @HDCharles's feedback. facebook/opt-125m (32 calibration samples, macOS CPU)
Deferred vs main: -0.47 ✅ better Qwen/Qwen2-0.5B (32 calibration samples, macOS CPU)
Deferred vs main: -1.51 ✅ better Note: both methods show large degradation on Qwen2-0.5B due to activation outliers (max scale=15.0). This is pre-existing behavior unrelated to this PR — per-tensor INT8 activation quantization without SmoothQuant is known to degrade on models with outlier activations. Deferred still outperforms main by 1.51 PPL. All correctness checks pass (168/168 modules have input_scale, no observer leaks). @kylesayrs @HDCharles could you add the |
Fixes vllm-project#2446 Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>
MemorylessMinMaxObserver has no past_min_vals, so get_accumulated_min_max() always returned None, causing scale to remain 0. Fix: add update_deferred_stats() to Observer base class which maintains _deferred_min/_deferred_max independently of subclass implementation. calibrate_activations(stats_only=True) now calls this instead of observer(value). Local validation on opt-125m (CPU, 32 calibration samples): - 72/72 modules have input_scale - Perplexity: 28.86 (FP32) -> 30.78 (INT8), 6.7% degradation - No observer stats leaked after calibration Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>
ada51e6 to
26c29ad
Compare
Fixes #2446
Ready for review. A smoke test on local Mac CPU with all passed attached. Happy to run the lm_eval regression tests (fp8_static_per_tensor, w4a16_awq_sym, w4a4_nvfp4) if access to the test infrastructure can be arranged, or point me to the right hardware setup.
Summary
Switches
QuantizationModifierfrom per-batch activation qparamcalculation to a deferred model where qparams are computed once at
SEQUENTIAL_EPOCH_ENDfrom accumulated running statistics.Changes
observers/base.pyObserver.get_accumulated_min_max(): returns storedpast_min/maxwithout observing a new tensor. Memoryless observers return
None.Observer.clear_accumulated_stats(): deletespast_*attrs to freememory after qparams have been written.
calibrate_module_from_observer(): module-level helper that flushesone observer's accumulated stats into the parent module's scale/zero_point.
modifiers/quantization/calibration.pycalibrate_activations(): newstats_onlykwarg; whenTrue,skips
calculate_qparams/gparam— only accumulates running min/max.input,output,q,k,v) now passstats_only=True.flush_activation_qparams(): iterates over all activation observerbase_names for a module and calls
calibrate_module_from_observer.modifiers/quantization/quantization/base.pyon_start(): disables quantization after weight calibration socalibration forward passes run in fp32.
on_event(): handlesSEQUENTIAL_EPOCH_ENDto callflush_activation_qparamson all targeted modules.Local Validation
Ran a smoke test on
facebook/opt-125mwith explicit static int8 activations (dynamic=False) using the sequential pipeline on macOS CPU. Full pipeline ran in ~1 min on CPU (13 subgraphs × 32 batches).Notes
calibrate_output_hookpreviously calledforward_quantizeafterupdating stats. This call is intentionally removed — quantization is
disabled during calibration batches so there is nothing to quantize.
fp8_static_per_tensorw4a16_awq_symw4a4_nvfp4