Skip to content

feat: defer activation qparam calculation to sequential epoch end#2455

Open
dzhengAP wants to merge 2 commits intovllm-project:mainfrom
dzhengAP:feat/deferred-activation-qparams
Open

feat: defer activation qparam calculation to sequential epoch end#2455
dzhengAP wants to merge 2 commits intovllm-project:mainfrom
dzhengAP:feat/deferred-activation-qparams

Conversation

@dzhengAP
Copy link

@dzhengAP dzhengAP commented Mar 9, 2026

Fixes #2446
Ready for review. A smoke test on local Mac CPU with all passed attached. Happy to run the lm_eval regression tests (fp8_static_per_tensor, w4a16_awq_sym, w4a4_nvfp4) if access to the test infrastructure can be arranged, or point me to the right hardware setup.

Summary

Switches QuantizationModifier from per-batch activation qparam
calculation to a deferred model where qparams are computed once at
SEQUENTIAL_EPOCH_END from accumulated running statistics.

Changes

observers/base.py

  • Observer.get_accumulated_min_max(): returns stored past_min/max
    without observing a new tensor. Memoryless observers return None.
  • Observer.clear_accumulated_stats(): deletes past_* attrs to free
    memory after qparams have been written.
  • calibrate_module_from_observer(): module-level helper that flushes
    one observer's accumulated stats into the parent module's scale/zero_point.

modifiers/quantization/calibration.py

  • calibrate_activations(): new stats_only kwarg; when True,
    skips calculate_qparams/gparam — only accumulates running min/max.
  • All activation hooks (input, output, q, k, v) now pass
    stats_only=True.
  • flush_activation_qparams(): iterates over all activation observer
    base_names for a module and calls calibrate_module_from_observer.

modifiers/quantization/quantization/base.py

  • on_start(): disables quantization after weight calibration so
    calibration forward passes run in fp32.
  • on_event(): handles SEQUENTIAL_EPOCH_END to call
    flush_activation_qparams on all targeted modules.

Local Validation

Ran a smoke test on facebook/opt-125m with explicit static int8 activations (dynamic=False) using the sequential pipeline on macOS CPU. Full pipeline ran in ~1 min on CPU (13 subgraphs × 32 batches).

Notes

  • calibrate_output_hook previously called forward_quantize after
    updating stats. This call is intentionally removed — quantization is
    disabled during calibration batches so there is nothing to quantize.
  • Regression tests from the issue still need to be run:
    • fp8_static_per_tensor
    • w4a16_awq_sym
    • w4a4_nvfp4

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the activation quantization parameter (qparam) calculation within the QuantizationModifier to improve efficiency and potentially accuracy. Instead of computing qparams for activations on a per-batch basis, the system now accumulates running statistics throughout the calibration process. At the end of a sequential epoch, these accumulated statistics are used to compute and apply the final qparams, after which the temporary statistics are cleared to conserve memory. This change ensures that calibration forward passes run in full precision (FP32) while statistics are gathered, preventing premature quantization from affecting the calibration process.

Highlights

  • Deferred Activation Qparam Calculation: Switched QuantizationModifier to compute activation quantization parameters (qparams) once at SEQUENTIAL_EPOCH_END from accumulated running statistics, rather than per-batch.
  • Observer Enhancements: Added get_accumulated_min_max to retrieve stored statistics and clear_accumulated_stats to free memory in observers, along with a new helper calibrate_module_from_observer.
  • Calibration Logic Update: Modified calibrate_activations to support a stats_only mode for accumulating statistics without immediate qparam computation, and introduced flush_activation_qparams to finalize qparams.
  • Quantization Control: Disabled quantization during calibration batches in on_start to ensure FP32 forward passes, and added SEQUENTIAL_EPOCH_END event handling to trigger flush_activation_qparams.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/llmcompressor/modifiers/quantization/calibration.py
    • Imported calibrate_module_from_observer for deferred qparam flushing.
    • Exported flush_activation_qparams in __all__.
    • Modified calibrate_activations to accept a stats_only boolean parameter, allowing it to accumulate statistics without immediately calculating scale/zero_point.
    • Updated calibrate_input_hook, calibrate_output_hook, calibrate_query_hook, calibrate_key_hook, and calibrate_value_hook to call calibrate_activations with stats_only=True.
    • Removed forward_quantize call from calibrate_output_hook as quantization is disabled during calibration.
    • Added flush_activation_qparams function to compute and write final activation qparams from accumulated observer statistics and clear those statistics.
  • src/llmcompressor/modifiers/quantization/quantization/base.py
    • Imported flush_activation_qparams.
    • Updated on_start to disable quantization during calibration batches, ensuring FP32 forward passes while statistics are accumulated.
    • Added handling for SEQUENTIAL_EPOCH_END event to iterate over targeted modules and call flush_activation_qparams to finalize activation qparams.
  • src/llmcompressor/observers/base.py
    • Exported calibrate_module_from_observer in __all__.
    • Added get_accumulated_min_max method to Observer to retrieve stored min/max statistics without new observation.
    • Added clear_accumulated_stats method to Observer to delete accumulated running statistics and free memory.
    • Added calibrate_module_from_observer function to compute and apply qparams from an observer's accumulated statistics and then clear them.
Activity
  • No human activity (comments, reviews) has occurred on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a deferred activation quantization parameter calculation, which is a significant improvement for efficiency. The approach of accumulating statistics during calibration and calculating qparams once at the end of the sequential epoch is well-implemented. The code changes are clear and well-documented. My review includes a couple of minor suggestions to improve code style by consolidating imports, which will enhance maintainability.

# Disable quantization during calibration batches so that fp32 activations
# flow through the model unmodified while hooks accumulate running stats.
# Re-enable once after epoch end when qparams have been flushed.
from compressed_tensors.quantization import disable_quantization
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency and better maintainability, it's recommended to move this import to the top of the file. Local imports can make it harder to track dependencies and are generally discouraged unless there's a specific reason like avoiding circular dependencies, which doesn't seem to be the case here.

:param base_name: one of "input", "output", "q", "k", "v"
:return: True if qparams were updated, False if observer had no accumulated stats
"""
from compressed_tensors.utils import align_module_device, update_offload_parameter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This local import is partially redundant and should be moved to the top of the file for consistency. align_module_device is already imported at the top of the file. You can add update_offload_parameter to that existing import statement and remove this local import. This will improve code readability and maintainability.

@github-actions
Copy link

github-actions bot commented Mar 9, 2026

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@dzhengAP
Copy link
Author

dzhengAP commented Mar 9, 2026

Local Validation

Ran a smoke test on facebook/opt-125m with explicit static int8
activations (dynamic=False) using the sequential pipeline on macOS CPU.

Result: all 72 Linear modules across 12 decoder layers have input_scale
correctly populated after calibration via the deferred flush at
SEQUENTIAL_EPOCH_END.

✅ 72 modules have input_scale — deferred qparam flush working!

Full pipeline ran in ~1 min on CPU (13 subgraphs × 32 batches).
Regression tests (fp8_static, w4a16_awq, w4a4_nvfp4) still pending GPU access.

Result: all 72 Linear modules across 12 decoder layers have input_scale
correctly populated after calibration via the deferred flush at
SEQUENTIAL_EPOCH_END.

@dzhengAP
Copy link
Author

dzhengAP commented Mar 9, 2026

Local Validation (updated)

Model: facebook/opt-125m, 32 calibration samples, macOS CPU

Metric Value
FP32 perplexity 28.86
Deferred INT8 perplexity 30.78
Perplexity delta +1.92 (6.7%)
Modules with input_scale 72/72
Observer stats leaked None
Calibration time 51.3s

✅ All checks pass. Full lm_eval regression tests pending GPU access. @kylesayrs could you add the ready label to trigger CI?

@dzhengAP
Copy link
Author

Hi @kylesayrs, any more work we wanna continue or we need to run more experiments and ablation study before merging or close this PR?

@HDCharles
Copy link
Collaborator

Hey, you have fp32, and quant(deferred) we'd primarily compare with quant(main) which is missing

@dzhengAP dzhengAP force-pushed the feat/deferred-activation-qparams branch from 233d180 to ada51e6 Compare March 11, 2026 21:12
@mergify mergify bot added the documentation Improvements or additions to documentation label Mar 11, 2026
@dzhengAP
Copy link
Author

Updated Local Validation

Added quant(main) baseline and Qwen2-0.5B per @HDCharles's feedback.

facebook/opt-125m (32 calibration samples, macOS CPU)

Mode Perplexity Delta vs FP32
FP32 baseline 28.86
quant(main) per-batch 30.85 +1.99 (6.9%)
quant(deferred) this PR 30.38 +1.52 (5.3%)

Deferred vs main: -0.47 ✅ better

Qwen/Qwen2-0.5B (32 calibration samples, macOS CPU)

Mode Perplexity Delta vs FP32
FP32 baseline 12.04
quant(main) per-batch 38.81 +222%
quant(deferred) this PR 37.30 +210%

Deferred vs main: -1.51 ✅ better

Note: both methods show large degradation on Qwen2-0.5B due to activation outliers (max scale=15.0). This is pre-existing behavior unrelated to this PR — per-tensor INT8 activation quantization without SmoothQuant is known to degrade on models with outlier activations. Deferred still outperforms main by 1.51 PPL.

All correctness checks pass (168/168 modules have input_scale, no observer leaks).

@kylesayrs @HDCharles could you add the ready label to trigger CI? DCO has been fixed.

Fixes vllm-project#2446

Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>
MemorylessMinMaxObserver has no past_min_vals, so get_accumulated_min_max()
always returned None, causing scale to remain 0.

Fix: add update_deferred_stats() to Observer base class which maintains
_deferred_min/_deferred_max independently of subclass implementation.
calibrate_activations(stats_only=True) now calls this instead of observer(value).

Local validation on opt-125m (CPU, 32 calibration samples):
  - 72/72 modules have input_scale
  - Perplexity: 28.86 (FP32) -> 30.78 (INT8), 6.7% degradation
  - No observer stats leaked after calibration

Signed-off-by: dqzhengAP <dqzheng1996@gmail.com>
@dzhengAP dzhengAP force-pushed the feat/deferred-activation-qparams branch from ada51e6 to 26c29ad Compare March 11, 2026 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Research] Investigate the effect of calculating activation qparams on sequential epoch end, not for every batch

2 participants