feat: defer activation qparam calculation to sequential epoch end by dzhengAP · Pull Request #2455 · vllm-project/llm-compressor

dzhengAP · 2026-03-09T07:30:29Z

Fixes #2446
Ready for review. A smoke test on local Mac CPU with all passed attached. Happy to run the lm_eval regression tests (fp8_static_per_tensor, w4a16_awq_sym, w4a4_nvfp4) if access to the test infrastructure can be arranged, or point me to the right hardware setup.

Summary

Switches QuantizationModifier from per-batch activation qparam
calculation to a deferred model where qparams are computed once at
SEQUENTIAL_EPOCH_END from accumulated running statistics.

Changes

observers/base.py

Observer.get_accumulated_min_max(): returns stored past_min/max
without observing a new tensor. Memoryless observers return None.
Observer.clear_accumulated_stats(): deletes past_* attrs to free
memory after qparams have been written.
calibrate_module_from_observer(): module-level helper that flushes
one observer's accumulated stats into the parent module's scale/zero_point.

modifiers/quantization/calibration.py

calibrate_activations(): new stats_only kwarg; when True,
skips calculate_qparams/gparam — only accumulates running min/max.
All activation hooks (input, output, q, k, v) now pass
stats_only=True.
flush_activation_qparams(): iterates over all activation observer
base_names for a module and calls calibrate_module_from_observer.

modifiers/quantization/quantization/base.py

on_start(): disables quantization after weight calibration so
calibration forward passes run in fp32.
on_event(): handles SEQUENTIAL_EPOCH_END to call
flush_activation_qparams on all targeted modules.

Local Validation

Ran a smoke test on facebook/opt-125m with explicit static int8 activations (dynamic=False) using the sequential pipeline on macOS CPU. Full pipeline ran in ~1 min on CPU (13 subgraphs × 32 batches).

Notes

calibrate_output_hook previously called forward_quantize after
updating stats. This call is intentionally removed — quantization is
disabled during calibration batches so there is nothing to quantize.
Regression tests from the issue still need to be run:
- fp8_static_per_tensor
- w4a16_awq_sym
- w4a4_nvfp4

gemini-code-assist · 2026-03-09T07:30:47Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the activation quantization parameter (qparam) calculation within the QuantizationModifier to improve efficiency and potentially accuracy. Instead of computing qparams for activations on a per-batch basis, the system now accumulates running statistics throughout the calibration process. At the end of a sequential epoch, these accumulated statistics are used to compute and apply the final qparams, after which the temporary statistics are cleared to conserve memory. This change ensures that calibration forward passes run in full precision (FP32) while statistics are gathered, preventing premature quantization from affecting the calibration process.

Highlights

Deferred Activation Qparam Calculation: Switched QuantizationModifier to compute activation quantization parameters (qparams) once at SEQUENTIAL_EPOCH_END from accumulated running statistics, rather than per-batch.
Observer Enhancements: Added get_accumulated_min_max to retrieve stored statistics and clear_accumulated_stats to free memory in observers, along with a new helper calibrate_module_from_observer.
Calibration Logic Update: Modified calibrate_activations to support a stats_only mode for accumulating statistics without immediate qparam computation, and introduced flush_activation_qparams to finalize qparams.
Quantization Control: Disabled quantization during calibration batches in on_start to ensure FP32 forward passes, and added SEQUENTIAL_EPOCH_END event handling to trigger flush_activation_qparams.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/llmcompressor/modifiers/quantization/calibration.py
- Imported calibrate_module_from_observer for deferred qparam flushing.
- Exported flush_activation_qparams in __all__.
- Modified calibrate_activations to accept a stats_only boolean parameter, allowing it to accumulate statistics without immediately calculating scale/zero_point.
- Updated calibrate_input_hook, calibrate_output_hook, calibrate_query_hook, calibrate_key_hook, and calibrate_value_hook to call calibrate_activations with stats_only=True.
- Removed forward_quantize call from calibrate_output_hook as quantization is disabled during calibration.
- Added flush_activation_qparams function to compute and write final activation qparams from accumulated observer statistics and clear those statistics.
src/llmcompressor/modifiers/quantization/quantization/base.py
- Imported flush_activation_qparams.
- Updated on_start to disable quantization during calibration batches, ensuring FP32 forward passes while statistics are accumulated.
- Added handling for SEQUENTIAL_EPOCH_END event to iterate over targeted modules and call flush_activation_qparams to finalize activation qparams.
src/llmcompressor/observers/base.py
- Exported calibrate_module_from_observer in __all__.
- Added get_accumulated_min_max method to Observer to retrieve stored min/max statistics without new observation.
- Added clear_accumulated_stats method to Observer to delete accumulated running statistics and free memory.
- Added calibrate_module_from_observer function to compute and apply qparams from an observer's accumulated statistics and then clear them.

Activity

No human activity (comments, reviews) has occurred on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a deferred activation quantization parameter calculation, which is a significant improvement for efficiency. The approach of accumulating statistics during calibration and calculating qparams once at the end of the sequential epoch is well-implemented. The code changes are clear and well-documented. My review includes a couple of minor suggestions to improve code style by consolidating imports, which will enhance maintainability.

gemini-code-assist · 2026-03-09T07:31:47Z

src/llmcompressor/modifiers/quantization/quantization/base.py

+        # Disable quantization during calibration batches so that fp32 activations
+        # flow through the model unmodified while hooks accumulate running stats.
+        # Re-enable once after epoch end when qparams have been flushed.
+        from compressed_tensors.quantization import disable_quantization


For consistency and better maintainability, it's recommended to move this import to the top of the file. Local imports can make it harder to track dependencies and are generally discouraged unless there's a specific reason like avoiding circular dependencies, which doesn't seem to be the case here.

gemini-code-assist · 2026-03-09T07:31:47Z

src/llmcompressor/observers/base.py

+    :param base_name: one of "input", "output", "q", "k", "v"
+    :return: True if qparams were updated, False if observer had no accumulated stats
+    """
+    from compressed_tensors.utils import align_module_device, update_offload_parameter


This local import is partially redundant and should be moved to the top of the file for consistency. align_module_device is already imported at the top of the file. You can add update_offload_parameter to that existing import statement and remove this local import. This will improve code readability and maintainability.

github-actions · 2026-03-09T07:32:08Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

dzhengAP · 2026-03-09T07:55:31Z

Local Validation

Ran a smoke test on facebook/opt-125m with explicit static int8
activations (dynamic=False) using the sequential pipeline on macOS CPU.

Result: all 72 Linear modules across 12 decoder layers have input_scale
correctly populated after calibration via the deferred flush at
SEQUENTIAL_EPOCH_END.

✅ 72 modules have input_scale — deferred qparam flush working!

Full pipeline ran in ~1 min on CPU (13 subgraphs × 32 batches).
Regression tests (fp8_static, w4a16_awq, w4a4_nvfp4) still pending GPU access.

Result: all 72 Linear modules across 12 decoder layers have input_scale
correctly populated after calibration via the deferred flush at
SEQUENTIAL_EPOCH_END.

dzhengAP · 2026-03-09T08:29:23Z

Local Validation (updated)

Model: facebook/opt-125m, 32 calibration samples, macOS CPU

Metric	Value
FP32 perplexity	28.86
Deferred INT8 perplexity	30.78
Perplexity delta	+1.92 (6.7%)
Modules with input_scale	72/72
Observer stats leaked	None
Calibration time	51.3s

✅ All checks pass. Full lm_eval regression tests pending GPU access. @kylesayrs could you add the ready label to trigger CI?

dzhengAP · 2026-03-10T18:56:15Z

Hi @kylesayrs, any more work we wanna continue or we need to run more experiments and ablation study before merging or close this PR?

HDCharles · 2026-03-11T13:57:30Z

Hey, you have fp32, and quant(deferred) we'd primarily compare with quant(main) which is missing

dzhengAP · 2026-03-11T21:13:04Z

Updated Local Validation

Added quant(main) baseline and Qwen2-0.5B per @HDCharles's feedback.

facebook/opt-125m (32 calibration samples, macOS CPU)

Mode	Perplexity	Delta vs FP32
FP32 baseline	28.86	—
quant(main) per-batch	30.85	+1.99 (6.9%)
quant(deferred) this PR	30.38	+1.52 (5.3%)

Deferred vs main: -0.47 ✅ better

Qwen/Qwen2-0.5B (32 calibration samples, macOS CPU)

Mode	Perplexity	Delta vs FP32
FP32 baseline	12.04	—
quant(main) per-batch	38.81	+222%
quant(deferred) this PR	37.30	+210%

Deferred vs main: -1.51 ✅ better

Note: both methods show large degradation on Qwen2-0.5B due to activation outliers (max scale=15.0). This is pre-existing behavior unrelated to this PR — per-tensor INT8 activation quantization without SmoothQuant is known to degrade on models with outlier activations. Deferred still outperforms main by 1.51 PPL.

All correctness checks pass (168/168 modules have input_scale, no observer leaks).

@kylesayrs @HDCharles could you add the ready label to trigger CI? DCO has been fixed.

Fixes vllm-project#2446 Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>

MemorylessMinMaxObserver has no past_min_vals, so get_accumulated_min_max() always returned None, causing scale to remain 0. Fix: add update_deferred_stats() to Observer base class which maintains _deferred_min/_deferred_max independently of subclass implementation. calibrate_activations(stats_only=True) now calls this instead of observer(value). Local validation on opt-125m (CPU, 32 calibration samples): - 72/72 modules have input_scale - Perplexity: 28.86 (FP32) -> 30.78 (INT8), 6.7% degradation - No observer stats leaked after calibration Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>

HDCharles · 2026-03-12T13:50:08Z

ready doesn't run CI, that has to be done manually

mergify · 2026-03-12T13:51:34Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

src/llmcompressor/modifiers/quantization/quantization/base.py

…or propagation - pipeline.py: remove disable_qac / DISABLE_QAC_MODIFIERS conditional logic; quantization is now unconditionally disabled during calibration pass and re-enabled during propagation pass so downstream subgraphs receive quantized inputs - quantization/base.py: remove erroneous disable_quantization call from on_start; control now lives entirely in pipeline layer - observers/base.py: move update_offload_parameter to top-level import - calibration.py: fix hook docstrings to accurately describe stats-only behavior Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>

dzhengAP · 2026-03-15T04:37:10Z

Hi @HDCharles, I've pushed a follow-up commit addressing the review feedback:

Removed the disable_qac / DISABLE_QAC_MODIFIERS conditional logic in pipeline.py — quantization is now unconditionally disabled during calibration and re-enabled during propagation
Removed the erroneous disable_quantization call from on_start in base.py — control now lives entirely in the pipeline layer
Fixed misleading docstrings and moved imports to top-level (Gemini nits)

Validation Results

Ran local regression tests comparing main (per-batch qparam) vs this PR (deferred qparam) with W8A8 static quantization, 32 calibration samples on CPU.

facebook/opt-125m

Mode	Perplexity	Delta vs FP32
FP32 baseline	28.86	—
main (per-batch)	30.76	+1.90 (6.6%)
this PR (deferred)	30.79	+1.94 (6.7%)

✅ 72/72 modules have input_scale · ✅ No observer stats leaked · ✅ Perplexity within 15% of FP32

Qwen/Qwen2-0.5B

Mode	Perplexity	Delta vs FP32
FP32 baseline	12.04	—
main (per-batch)	48.01	+35.97 (298.8%)
this PR (deferred)	42.00	+29.96 (248.9%)

✅ 168/168 modules have input_scale · ✅ No observer stats leaked · ✅ Deferred better than main by 6.01 PPL

Notes

Qwen large perplexity gap vs FP32 is expected on CPU with 32 calibration samples; deferred is notably better than main here
Memory overhead ~400-800MB vs main due to observer stats held until epoch end — expected tradeoff
Calibration time essentially unchanged

dzhengAP · 2026-03-16T03:35:23Z

Hey @HDCharles , I tried running the lm_eval regression tests on a 4x GPU setup but the tests don't support multi-GPU inference via device_map="auto" — lm_eval requires a single large GPU (e.g. A100 80GB).

src/llmcompressor/modifiers/quantization/quantization/base.py

mergify · 2026-03-16T20:03:22Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

src/llmcompressor/modifiers/quantization/calibration.py

HDCharles · 2026-03-16T20:07:44Z

src/llmcompressor/modifiers/quantization/calibration.py

+        return
+
+    for base_name in ("input", "output", "q", "k", "v"):
+        calibrate_module_from_observer(module, base_name)


this name is bad, sounds like its doing weight quantization?

src/llmcompressor/pipelines/sequential/pipeline.py

src/llmcompressor/observers/base.py

src/llmcompressor/modifiers/quantization/calibration.py

HDCharles

see comments

dzhengAP · 2026-03-16T23:40:02Z

see comments

Addressed all review comments @HDCharles

Renamed flush_activation_qparams → write_activation_qparams
Renamed calibrate_module_from_observer → update_module_qparams_from_observer
Extracted ACTIVATION_BASE_NAMES constant in calibration.py
Moved SEQUENTIAL_EPOCH_END note from on_start docstring to on_event
Used ExitStack for propagation pass quantization management
Updated observer.forward() to accumulate stats alongside computing qparams

- rename flush_activation_qparams -> write_activation_qparams - rename calibrate_module_from_observer -> update_module_qparams_from_observer - extract ACTIVATION_BASE_NAMES constant in calibration.py - move SEQUENTIAL_EPOCH_END docstring note from on_start to on_event - use ExitStack for propagation pass quantization management - update observer.forward() to accumulate stats alongside computing qparams Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>

dzhengAP requested review from HDCharles, dsikka and kylesayrs as code owners March 9, 2026 07:30

gemini-code-assist bot reviewed Mar 9, 2026

View reviewed changes

dzhengAP force-pushed the feat/deferred-activation-qparams branch from 233d180 to ada51e6 Compare March 11, 2026 21:12

mergify bot added the documentation Improvements or additions to documentation label Mar 11, 2026

dzhengAP added 2 commits March 11, 2026 14:16

feat: defer activation qparam calculation to sequential epoch end

2320a6a

Fixes vllm-project#2446 Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>

dzhengAP force-pushed the feat/deferred-activation-qparams branch from ada51e6 to 26c29ad Compare March 11, 2026 21:16

HDCharles added ready When a PR is ready for review and removed documentation Improvements or additions to documentation labels Mar 12, 2026

HDCharles self-assigned this Mar 12, 2026

mergify bot added the quality-failed label Mar 12, 2026

HDCharles removed their assignment Mar 12, 2026

HDCharles reviewed Mar 12, 2026

View reviewed changes

src/llmcompressor/modifiers/quantization/quantization/base.py Outdated Show resolved Hide resolved

HDCharles reviewed Mar 12, 2026

View reviewed changes

src/llmcompressor/modifiers/quantization/quantization/base.py Outdated Show resolved Hide resolved

mergify bot removed the quality-failed label Mar 15, 2026

dzhengAP force-pushed the feat/deferred-activation-qparams branch from 31b824a to 316114a Compare March 15, 2026 04:17

HDCharles assigned HDCharles and unassigned HDCharles Mar 16, 2026