Observers refactor by HDCharles · Pull Request #2585 · vllm-project/llm-compressor

HDCharles · 2026-04-08T04:34:41Z

Summary:

previously observers would always calculate qparams when they were ran, now they just accumulate statistics

they also do qparam and gparam calculations at the same time, no need to separate them.

all statistics are stored in a statistics dict

qparams are returned as a dict. Observers have to implement _update_statistics, compute_gparams_from_statistics and compute_qparams_from_statistics

this is a lead up to the lifecycle refactor where we will change when quantization is applied during calibration

Summary by CodeRabbit

Refactor
- Reworked quantization calibration flow: observation and qparamization are separated, with improved activation/stat syncing and clearer weight vs activation phases.
- Observer API now returns qparams via dictionary access and exposes min/max statistics attributes.
- Distributed sync and fused-weight handling streamlined for consistent global-scale behavior.
New Features
- Added histogram-based observer and a utility to fuse sibling weight observers.
Tests
- Updated tests to follow the new observer/qparams contract.

github-actions · 2026-04-08T04:34:50Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

coderabbitai · 2026-04-08T04:34:58Z

Warning

`.coderabbit.yaml` has a parsing error

The CodeRabbit configuration file in this repository has a parsing error and default settings were used instead. Please fix the error(s) in the configuration file. You can initialize chat with CodeRabbit to get help with the configuration file.

💥 Parsing errors (1)

duplicated mapping key in ".coderabbit.yaml" (56:7)

 53 |     - label: "fp8"
 54 |       instructions: "Issues or PRs re ...
 55 | 
 56 |       instructions: "Issues or PRs re ...
------------^
 57 |     - label: "w4a16"
 58 |       instructions: "Issues or PRs re ...

⚙️ Configuration instructions

Please see the configuration documentation for more information.
You can also validate your configuration using the online YAML validator.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

📝 Walkthrough

Walkthrough

Refactors observer and calibration flows from immediate tuple-return semantics to a two-phase statistics-then-qparams design. Observers now accumulate stats via observe(...)/forward() and expose quantization parameters through get_qparams(); update_qparams(...) materializes scales/zero-points. Adds observer fusion and DDP stat synchronization APIs.

Changes

Cohort / File(s)	Summary
Observer Core `src/llmcompressor/observers/base.py`	Replaced min/max getter API with statistics accumulation. `forward()` updates stats and returns `self`. New `get_qparams()`/`compute_qparams_from_statistics()`/`sync_activation_stats()`/`fuse()` added; constructor no longer takes `module`.
MinMax & Moving Avg Observers `src/llmcompressor/observers/min_max.py`, `src/llmcompressor/observers/moving_base.py`	Switched to `update_statistics()` storing `min_vals`/`max_vals`. Static/min-moving behavior unified; `_act_sync_dict` used for DDP reductions.
MSE & IMatrix Observers `src/llmcompressor/observers/mse.py`, `src/llmcompressor/observers/imatrix.py`	Moved to `update_statistics()` flow, removed in-search global-scale optimization; imatrix accumulates `_imatrix_sum/_imatrix_count` and defers importance/global-scale handling.
Histogram Observer `src/llmcompressor/observers/hist.py`	New `HistogramObserver` stub registered as `"histogram"` that begins per-batch xmax/histogram accumulation (histogram accumulation currently unimplemented).
Observer Helpers & Fusion `src/llmcompressor/observers/helpers.py`	Added `FUSED_LAYER_NAMES` and `fuse_weight_observers(model)` to link/fuse weight observers for shared `global_scale`. Removed `g_idx`-based flatten reordering.
Calibration API `src/llmcompressor/modifiers/quantization/calibration.py`	Replaced `call_observer`/`calibrate_activations`/weight-specific helpers with `observe(...)` (accumulate) and `update_qparams(...)` (materialize qparams). Dynamic activations skip qparam writes.
Quantization Mixin `src/llmcompressor/modifiers/quantization/quantization/mixin.py`	Fuses weight observers at calibration start. Split sync into `sync_obs_act_stats()` (stat-only DDP sync) and `update_activation_qparams()` (calls `update_qparams`).
GPTQ & Quantization Modifiers `src/llmcompressor/modifiers/gptq/gptq_quantize.py`, `.../gptq/base.py`, `.../quantization/base.py`	Use module `weight_observer` directly; remove prior conv weight-shaping and old global-scale helpers. Replace activation sync with `sync_obs_act_stats()` + `update_activation_qparams()` and run `observe`/`update_qparams` for weights.
AWQ Modifier & IMatrix Gatherer `src/llmcompressor/modifiers/transform/awq/base.py`, `.../imatrix/base.py`	AWQ adopts `observe`/`update_qparams()` flow and drops fused-global-scale recompute helpers. IMatrixGatherer registers/attaches module `weight_observer` directly and clears `_imatrix_sum/_imatrix_count` on finalize.
Entrypoints & Processing `src/llmcompressor/entrypoints/model_free/lifecycle.py`, `.../process.py`	Consolidated weight calibration into `calibrate_weight()` calling `observe` + `update_qparams`. Fused-module second pass now fuses observers, runs `observe`/`update_qparams`, and freezes quantization per member.
Pipelines & DisableQuantization `src/llmcompressor/pipelines/basic/pipeline.py`, `.../sequential/pipeline.py`, `src/llmcompressor/utils/helpers.py`	Calibration forward pass now runs inside `DisableQuantization(model)` unconditionally; removed `DISABLE_QAC_MODIFIERS` constant and conditional branches.
Removed Utilities `src/llmcompressor/modifiers/utils/helpers.py`, `src/llmcompressor/modifiers/utils/__init__.py`	Deleted `helpers.py` (removed `update_fused_layer_weight_global_scales`). Stopped wildcard re-export from `modifiers.utils`.
Tests `tests/.../observers/`, `tests/.../calibration/`, `tests/.../utils/*`	Updated tests to new observer contract: use `observer(tensor).get_qparams()` for qparams, read `observer.min_vals/max_vals` for stats, call `observer.sync_activation_stats()` for DDP sync, and validate fused-observer behavior. Removed `g_idx` usage and old qparam shims.

Sequence Diagram(s)

sequenceDiagram
  participant Model
  participant Observer
  participant Quantizer
  participant DDP

  Note over Model,Observer: Calibration pass (batches)
  Model->>Observer: forward(observed) / observe(...)  (Observer.update_statistics)
  Observer->>DDP: sync_activation_stats() (all-reduce per _act_sync_dict)
  DDP-->>Observer: reduced stats
  Observer->>Quantizer: get_qparams()
  Quantizer-->>Observer: qparams dict (scale, zero_point, [global_scale])
  Observer->>Model: update_qparams(...) writes qparams to module tensors
  Model->>Quantizer: fake_quantize(... using qparams)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

#2526: Similar AWQ and quantization integration changes; overlaps on AWQ modifier adjustments.
#2491: Related fused-weight/global-scale handling changes that this PR replaces with observer fusion.

Suggested labels

Refactor, gptq

Suggested reviewers

dsikka
kylesayrs
dbarbuzzi

Poem

🐰
I gathered stats in twilight’s hush,
Then stitched qparams without the rush.
Fused friends share a single scale,
Observers sync across the gale.
One observe, one update — carrot cake to hail! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 44.93% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Observers refactor' accurately and concisely summarizes the primary objective of this changeset, which is a comprehensive refactoring of the observer architecture to separate statistics accumulation from qparam computation.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch 90_refactor_obs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request refactors the observer API to unify quantization parameter calculations. It introduces a statistics dictionary for internal state management and a get_qparams() method that automatically handles global_scale for TENSOR_GROUP strategies. Implementation details were updated across various observers (MinMax, MSE, IMatrix) and modifiers (AWQ, GPTQ), along with corresponding test updates. Feedback identifies a critical bug in MovingAverageObserverBase where statistics are divided by the world size twice during distributed synchronization and suggests moving a local import in the base observer to the top of the file for better maintainability.

gemini-code-assist · 2026-04-08T04:38:03Z

+        for key in ("min_vals", "max_vals", "global_min_vals", "global_max_vals"):
+            val = self.statistics.get(key)
            if val is not None:
                val.div_(world_size)


The dist.ReduceOp.AVG operation already computes the average of the tensors across all processes by summing them and dividing by the world size. By calling val.div_(world_size) before the all_reduce with AVG, you are effectively dividing by the world size twice, which will result in incorrect statistics in a distributed setting. This line should be removed.

hmm, is this a bug?

gemini-code-assist · 2026-04-08T04:38:03Z

+                param_name = f"{self.base_name}_global_scale"
+                if hasattr(module, param_name):
+                    # Parameter already exists, update it
+                    from compressed_tensors.utils import update_offload_parameter


Local imports can obscure dependencies and make code harder to maintain. It's best practice to move this import statement to the top of the file with the other imports.

Please add from compressed_tensors.utils import update_offload_parameter to the file's import section and remove this line.

mergify · 2026-04-08T04:38:04Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

src/llmcompressor/observers/mse.py (2)
165-178: ⚠️ Potential issue | 🟠 Major

MovingAverageMSEObserver is now one gparam update behind.

src/llmcompressor/observers/moving_base.py calls get_current_min_max() before it computes global stats. Because this method still pulls module.global_scale, tensor-group EMA stats are optimized using the previous pass’s gparam, or the initializer on the first pass.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/observers/mse.py` around lines 165 - 178,
get_current_min_max pulls module.global_scale via
_get_module_param("global_scale") which makes MovingAverageMSEObserver use the
previous pass's global param; stop reading module.global_scale inside
get_current_min_max and instead let the grid search optimize using current
stats: remove the _get_module_param call and call _grid_search_mse with
global_scale=None (or not passed) and set optimize_global_scale=True so the
routine will compute the current-pass gparam; update get_current_min_max
accordingly (refer to function get_current_min_max and helper _grid_search_mse
and note that moving_base.py calls get_current_min_max before global stats are
computed).
58-89: ⚠️ Potential issue | 🟠 Major

Recompute tensor-group global_scale before the per-group MSE search.

Line 62 reads the module’s current global_scale, but Line 78 only derives the new global min/max after the per-group search has already finished. For TENSOR_GROUP, that means min_vals/max_vals are optimized against the previous gparam while get_qparams() later combines them with the freshly generated one from this observation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/observers/mse.py` around lines 58 - 89, Run the global MSE
grid search first and use its computed global scale when doing the per-group
search: call _grid_search_mse on global_observed with optimize_global_scale=True
to obtain the global min/max (global_min/global_max), derive the corresponding
global_scale from those results, then pass that global_scale into the subsequent
per-group call to _grid_search_mse (instead of reading the old module value via
_get_module_param("global_scale")). Update self.statistics keys the same way
(statistics['global_min_vals'], ['global_max_vals'], ['min_vals'], ['max_vals'])
after the two searches.
src/llmcompressor/observers/moving_base.py (1)
125-133: ⚠️ Potential issue | 🟠 Major

Remove manual division before ReduceOp.AVG to avoid double scaling.

Line 129 divides by world_size, but all_reduce with ReduceOp.AVG (line 132) already divides by world_size automatically. This causes EMA stats to shrink by an extra factor of 1 / world_size on every multi-rank sync. Remove the manual division.

Note: ReduceOp.AVG requires the NCCL backend (version ≥ 2.10); if using a different backend, ReduceOp.SUM would require manual division.
Proposed fix
-                val.div_(world_size)
                 comms.append(
                     dist.all_reduce(
                         as_broadcastable(val), op=dist.ReduceOp.AVG, async_op=True
                     )
                 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/observers/moving_base.py` around lines 125 - 133, The loop
in MovingBase (use self.statistics and keys
"min_vals","max_vals","global_min_vals","global_max_vals") is manually dividing
val by world_size before calling dist.all_reduce with ReduceOp.AVG, causing
double-scaling; remove the val.div_(world_size) call so all_reduce alone
performs the averaging. If your environment does not support ReduceOp.AVG
(non-NCCL backend), instead switch the all_reduce op to ReduceOp.SUM and keep a
single division by world_size after the sum, but do not perform both.

🧹 Nitpick comments (1)

src/llmcompressor/observers/base.py (1)
165-170: The synchronize_statistics() hard-coding is architecturally inconsistent but currently non-violating.

The method assumes observers use only four keys (min_vals, max_vals, global_min_vals, global_max_vals), yet the contract says observers own a generic self.statistics dictionary. Current implementations (imatrix, min_max, moving_base, mse) all conform—no extra keys detected—but the design does not gracefully support future extensions. Adding a histogram, count, or sum statistic would silently skip synchronization across ranks.

Consider an overridable method that returns a {key: ReduceOp} map, allowing observers to declare their reduce strategy without modifying base.py.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/observers/base.py` around lines 165 - 170,
synchronize_statistics() currently hard-codes four keys; change it to call an
overridable method (e.g., get_statistics_reduce_map()) on the observer to obtain
a dict mapping statistic keys to torch.distributed ReduceOp values and iterate
that map instead of the fixed list, so any observer can declare which keys to
synchronize and with which ReduceOp; update base class to provide a default
implementation returning the existing {"min_vals": ReduceOp.MIN, "max_vals":
ReduceOp.MAX, "global_min_vals": ReduceOp.MIN, "global_max_vals": ReduceOp.MAX}
and have synchronize_statistics() use that map to perform the reductions on
self.statistics.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/llmcompressor/modifiers/quantization/calibration.py`:
- Around line 92-124: The current calibration code ignores the
should_calculate_qparams flag and always persists scale/zero_point; change it so
that after observer.get_qparams() you still update global_scale when
qparams["global_scale"] is present, but only call
update_offload_parameter(module, f"{base_name}_scale", qparams["scale"]) and the
zero-point update (hasattr check + update_offload_parameter(module,
f"{base_name}_zero_point", qparams["zero_point"])) when should_calculate_qparams
is True; keep the existing observer(value) and qparams = observer.get_qparams()
flow and only gate the scale/zero_point writes on the should_calculate_qparams
parameter.

In `@src/llmcompressor/observers/base.py`:
- Around line 49-51: The accumulates_statistics flag is currently only
descriptive; update the base observer lifecycle so memoryless observers opt out:
in methods that reset state (where self.statistics is cleared) add a guard
checking accumulates_statistics and skip clearing when False, and in
synchronization/sync paths (the block handling distributed/DDP sync)
short-circuit or return early when accumulates_statistics is False so those
observers neither carry stale values nor participate in DDP sync; reference the
class attribute accumulates_statistics, the instance attribute self.statistics,
and the lifecycle/sync methods in the base observer to implement these guards.
- Around line 112-129: The TENSOR_GROUP path computes global_scale but only
stores it on the module when present, leaving _compute_qparams_from_statistics()
blind when module is None; ensure global_scale is always made available by
assigning it to the observer instance before calling
_compute_qparams_from_statistics() (e.g. set self._global_scale = global_scale)
or by changing _compute_qparams_from_statistics() to accept a global_scale
argument and pass the computed value; update references to read from
self._global_scale (or the new parameter) so TENSOR_GROUP observers work
correctly even when self.module is None, using the existing symbols
_compute_gparams_from_statistics, _compute_qparams_from_statistics,
args.strategy, self.module, and base_name to locate the code.

In `@src/llmcompressor/observers/imatrix.py`:
- Around line 119-154: The weighted min/max can be computed with a stale
tensor-group gparam because the code reads module_global_scale only once
earlier; update _update_statistics to recompute/derive the current tensor-group
gparam (call self._get_module_param("global_scale") or get_qparams() for
TENSOR_GROUP) immediately before performing the importance-weighted _grid_search
so the returned weighted min_vals/max_vals correspond to the same gparam used
later for global_min/global_max; ensure the same gparam value is used for both
the weighted and subsequent global searches (symbols: _update_statistics,
_get_module_param, _grid_search, get_qparams).

In `@tests/llmcompressor/modifiers/quantization/test_quantization_ddp.py`:
- Around line 84-88: The first test test_observer_synchronize_reduces_min_max is
using observer.past_min_vals and observer.past_max_vals but the
StaticMinMaxObserver expects/uses the statistics dict; update that test to set
observer.statistics['min_vals'] and observer.statistics['max_vals'] (same
shape/values you previously assigned) so it matches the second test
test_synced_qparams_are_identical_across_ranks and the implementation, then run
the same synchronize/stat retrieval (observer.synchronize_statistics(),
wait_for_comms, observer.get_qparams()) to assert the reduced min/max and
qparams.

---

Outside diff comments:
In `@src/llmcompressor/observers/moving_base.py`:
- Around line 125-133: The loop in MovingBase (use self.statistics and keys
"min_vals","max_vals","global_min_vals","global_max_vals") is manually dividing
val by world_size before calling dist.all_reduce with ReduceOp.AVG, causing
double-scaling; remove the val.div_(world_size) call so all_reduce alone
performs the averaging. If your environment does not support ReduceOp.AVG
(non-NCCL backend), instead switch the all_reduce op to ReduceOp.SUM and keep a
single division by world_size after the sum, but do not perform both.

In `@src/llmcompressor/observers/mse.py`:
- Around line 165-178: get_current_min_max pulls module.global_scale via
_get_module_param("global_scale") which makes MovingAverageMSEObserver use the
previous pass's global param; stop reading module.global_scale inside
get_current_min_max and instead let the grid search optimize using current
stats: remove the _get_module_param call and call _grid_search_mse with
global_scale=None (or not passed) and set optimize_global_scale=True so the
routine will compute the current-pass gparam; update get_current_min_max
accordingly (refer to function get_current_min_max and helper _grid_search_mse
and note that moving_base.py calls get_current_min_max before global stats are
computed).
- Around line 58-89: Run the global MSE grid search first and use its computed
global scale when doing the per-group search: call _grid_search_mse on
global_observed with optimize_global_scale=True to obtain the global min/max
(global_min/global_max), derive the corresponding global_scale from those
results, then pass that global_scale into the subsequent per-group call to
_grid_search_mse (instead of reading the old module value via
_get_module_param("global_scale")). Update self.statistics keys the same way
(statistics['global_min_vals'], ['global_max_vals'], ['min_vals'], ['max_vals'])
after the two searches.

---

Nitpick comments:
In `@src/llmcompressor/observers/base.py`:
- Around line 165-170: synchronize_statistics() currently hard-codes four keys;
change it to call an overridable method (e.g., get_statistics_reduce_map()) on
the observer to obtain a dict mapping statistic keys to torch.distributed
ReduceOp values and iterate that map instead of the fixed list, so any observer
can declare which keys to synchronize and with which ReduceOp; update base class
to provide a default implementation returning the existing {"min_vals":
ReduceOp.MIN, "max_vals": ReduceOp.MAX, "global_min_vals": ReduceOp.MIN,
"global_max_vals": ReduceOp.MAX} and have synchronize_statistics() use that map
to perform the reductions on self.statistics.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 49f2adf3-6d0f-4dea-bfa0-a042745b45fb

📥 Commits

Reviewing files that changed from the base of the PR and between 82b31ec and 755bc04.

📒 Files selected for processing (17)

src/llmcompressor/modifiers/awq/base.py
src/llmcompressor/modifiers/gptq/gptq_quantize.py
src/llmcompressor/modifiers/quantization/calibration.py
src/llmcompressor/modifiers/quantization/quantization/mixin.py
src/llmcompressor/observers/base.py
src/llmcompressor/observers/imatrix.py
src/llmcompressor/observers/min_max.py
src/llmcompressor/observers/moving_base.py
src/llmcompressor/observers/mse.py
tests/llmcompressor/modifiers/calibration/test_lifecycle.py
tests/llmcompressor/modifiers/calibration/test_observers.py
tests/llmcompressor/modifiers/quantization/test_quantization_ddp.py
tests/llmcompressor/observers/test_imatrix.py
tests/llmcompressor/observers/test_min_max.py
tests/llmcompressor/observers/test_mse.py
tests/llmcompressor/observers/test_mse_vs_minmax.py
tests/llmcompressor/utils/test_distributed.py

coderabbitai · 2026-04-08T04:48:55Z

+    should_calculate_gparam: bool = False,  # Kept for backward compatibility, ignored
+    should_calculate_qparams: bool = True,  # Kept for backward compatibility, ignored
 ):
    """
    Call a module's attached input/weight/output observer using a provided value.
-    Update the module's scale and zp using the observer's return values.
+    Update the module's scale, zero_point, and global_scale using the observer's return values.
+
+    The observer's get_qparams() method automatically computes global_scale for TENSOR_GROUP strategy.

    :param module: torch.nn.Module
    :param base_name: substring used to fetch the observer, scales, and zp
    :param value: torch.Tensor to be passed to the observer for activations. If
        base_name is "weight", then the module's weight tensor will be used
+    :param should_calculate_gparam: Kept for backward compatibility, ignored
+    :param should_calculate_qparams: Kept for backward compatibility, ignored
    """
    with align_module_device(module):
        if value is None and base_name == "weight":
            value = module.weight
        observer: Observer = getattr(module, f"{base_name}_observer")

-        if should_calculate_gparam:
-            global_scale = observer.get_global_scale(value)
-            update_offload_parameter(module, f"{base_name}_global_scale", global_scale)
+        # Update statistics then get qparams
+        observer(value)
+        qparams = observer.get_qparams()
+
+        # Update global_scale if computed (TENSOR_GROUP strategy)
+        if qparams["global_scale"] is not None:
+            update_offload_parameter(module, f"{base_name}_global_scale", qparams["global_scale"])

-        if should_calculate_qparams:
-            scale, zero_point = observer(value)
-            update_offload_parameter(module, f"{base_name}_scale", scale)
-            if hasattr(module, f"{base_name}_zero_point"):
-                update_offload_parameter(module, f"{base_name}_zero_point", zero_point)
+        # Update scale and zero_point
+        update_offload_parameter(module, f"{base_name}_scale", qparams["scale"])
+        if hasattr(module, f"{base_name}_zero_point"):
+            update_offload_parameter(module, f"{base_name}_zero_point", qparams["zero_point"])


⚠️ Potential issue | 🟠 Major

Keep should_calculate_qparams on the write path.

Callers at Line 141 and Line 200 still pass should_calculate_qparams=False to suppress persisted qparam updates. With the flag ignored here, DynamicType.LOCAL activation calibration and the “global-scale-only” weight path now write *_scale/*_zero_point anyway.

💡 Proposed fix

# Update statistics then get qparams observer(value) qparams = observer.get_qparams() # Update global_scale if computed (TENSOR_GROUP strategy) if qparams["global_scale"] is not None: update_offload_parameter(module, f"{base_name}_global_scale", qparams["global_scale"]) - # Update scale and zero_point - update_offload_parameter(module, f"{base_name}_scale", qparams["scale"]) - if hasattr(module, f"{base_name}_zero_point"): - update_offload_parameter(module, f"{base_name}_zero_point", qparams["zero_point"]) + if should_calculate_qparams: + update_offload_parameter(module, f"{base_name}_scale", qparams["scale"]) + if hasattr(module, f"{base_name}_zero_point"): + update_offload_parameter(module, f"{base_name}_zero_point", qparams["zero_point"])

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/llmcompressor/modifiers/quantization/calibration.py` around lines 92 - 124, The current calibration code ignores the should_calculate_qparams flag and always persists scale/zero_point; change it so that after observer.get_qparams() you still update global_scale when qparams["global_scale"] is present, but only call update_offload_parameter(module, f"{base_name}_scale", qparams["scale"]) and the zero-point update (hasattr check + update_offload_parameter(module, f"{base_name}_zero_point", qparams["zero_point"])) when should_calculate_qparams is True; keep the existing observer(value) and qparams = observer.get_qparams() flow and only gate the scale/zero_point writes on the should_calculate_qparams parameter.

coderabbitai · 2026-04-08T04:48:55Z

+    # Class attribute indicating whether this observer accumulates statistics across calls
+    # Memoryless observers should override this to False
+    accumulates_statistics: bool = True


⚠️ Potential issue | 🟠 Major

Make accumulates_statistics affect the base lifecycle.

Right now the flag is descriptive only. Lines 143-145 never clear self.statistics, and Lines 155-176 never short-circuit synchronization, so a memoryless observer still carries stale values and participates in DDP sync unless every subclass reimplements both paths.

🧹 Small base-class fix

g_idx = self._get_module_param("g_idx") observed = flatten_for_calibration(observed, self.base_name, self.args, g_idx) + if not self.accumulates_statistics: + self.statistics.clear() self._update_statistics(observed) return self def synchronize_statistics(self) -> List[dist.Work]: + if not self.accumulates_statistics: + return [] """All-reduce accumulated min/max statistics across DDP ranks.

Also applies to: 143-145, 155-176

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/llmcompressor/observers/base.py` around lines 49 - 51, The accumulates_statistics flag is currently only descriptive; update the base observer lifecycle so memoryless observers opt out: in methods that reset state (where self.statistics is cleared) add a guard checking accumulates_statistics and skip clearing when False, and in synchronization/sync paths (the block handling distributed/DDP sync) short-circuit or return early when accumulates_statistics is False so those observers neither carry stale values nor participate in DDP sync; reference the class attribute accumulates_statistics, the instance attribute self.statistics, and the lifecycle/sync methods in the base observer to implement these guards.

coderabbitai · 2026-04-08T04:48:55Z

+    def _update_statistics(self, observed: torch.Tensor) -> None:
+        """Compute and store min/max statistics using importance-weighted MSE grid search.
+
+        Note: Importance weights are collected via module hooks during attach().
+        """
+        # Get existing global_scale from module if available
+        module_global_scale = self._get_module_param("global_scale")
+
+        # Perform importance-weighted MSE grid search for per-group/channel min/max
+        importance_weights = self._prepare_importance(observed)
+        min_vals, max_vals = _grid_search(
            observed,
            self.args,
            self.maxshrink,
            self.patience,
            self.grid,
            self.norm,
-            global_scale=self._get_module_param("global_scale"),
-            importance_weights=self._prepare_importance(observed),
+            global_scale=module_global_scale,
+            importance_weights=importance_weights,
        )
+        self.statistics['min_vals'] = min_vals
+        self.statistics['max_vals'] = max_vals

-    def get_global_min_max(self, observed: torch.Tensor) -> MinMaxTuple:
-        # TODO: support importance weights here by deferring the reshape
-        # to the grid search call. Currently the base class reshapes
-        # observed to (1, 1, -1) which loses the channel layout needed
-        # for importance broadcasting.
-        return _grid_search(
-            observed,
+        # Perform MSE grid search for global min/max (uniform weighting)
+        global_observed = observed.reshape((1, 1, -1))
+        global_min, global_max = _grid_search(
+            global_observed,
            self.args,
            self.maxshrink,
            self.patience,
            self.grid,
            self.norm,
            optimize_global_scale=True,
        )
+        self.statistics['global_min_vals'] = global_min
+        self.statistics['global_max_vals'] = global_max


⚠️ Potential issue | 🟠 Major

Derive the new tensor-group gparam before the weighted search.

Line 125 reads module_global_scale, but Line 144 does not recompute global_min_vals/global_max_vals until after the weighted _grid_search() has already completed. For TENSOR_GROUP, the stored weighted min/max can therefore correspond to an older gparam than the global_scale returned later by get_qparams().

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/llmcompressor/observers/imatrix.py` around lines 119 - 154, The weighted min/max can be computed with a stale tensor-group gparam because the code reads module_global_scale only once earlier; update _update_statistics to recompute/derive the current tensor-group gparam (call self._get_module_param("global_scale") or get_qparams() for TENSOR_GROUP) immediately before performing the importance-weighted _grid_search so the returned weighted min_vals/max_vals correspond to the same gparam used later for global_min/global_max; ensure the same gparam value is used for both the weighted and subsequent global searches (symbols: _update_statistics, _get_module_param, _grid_search, get_qparams).

brian-dellabetta

Things look good to me, the side effects that occurred in observers always made me nervous. I have one comment on the new observer API, it looks like the abstractmethods could be cleaned up a bit, and a few nits.

brian-dellabetta · 2026-04-08T18:54:22Z

+        else:
+            global_scale = None
+
+        qparams = self._compute_qparams_from_statistics()


i like the QParamsDict, but it seems weird that self._compute_qparams_from_statistics() returns a QParamsDict where global_scale will always be overwritten by the output of self._compute_gparams_from_statistics() on the next line. Can the API be reduced to just having each observer satisfy a single self._compute_qparams_from_statistics() abstractmethod, which returns a QParamsDict that has global_scale set if the quant strategy requires one? It looks like they're always called in conjunction

brian-dellabetta · 2026-04-08T18:58:27Z

+        for key in ("min_vals", "max_vals", "global_min_vals", "global_max_vals"):
+            val = self.statistics.get(key)
            if val is not None:
                val.div_(world_size)


hmm, is this a bug?

brian-dellabetta · 2026-04-08T19:00:13Z


-    def get_min_max(self, observed: torch.Tensor) -> MinMaxTuple:
-        return _get_min_max(observed)
+    accumulates_statistics = False


nit -- if it's the case that accumulates_statistics is False if and only if it is a "Memoryless" observer, consider renaming this to is_memoryless

brian-dellabetta · 2026-04-08T19:09:10Z

        self.module = ref(module) if module is not None else None
        self.base_name = base_name
        self.args = args
+        self.statistics = {}  # Dictionary to store observer statistics


nit -- consider a typed dict like you have for QParamsDict. It looks like there are only ever 4 keys -- min_vals, max_vals, global_min_vals, global_max_vals

currently there's only 4 values but it can theoretically expand.

mergify · 2026-04-10T07:02:00Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

brian-dellabetta

Thanks for updates, things LGTM. i just have some nits, biggest thing being the re-scaling of weight_scale you are doing after fusing weight_global_scale. we should either split out to a separate PR or add the changes to update_fused_layer_weight_global_scales

brian-dellabetta · 2026-04-10T16:02:10Z

            module_name, _ = name.rsplit(".", 1)
+            old_global_scale = module.weight_global_scale
+            scale_adjustment = fused_global_scale / old_global_scale
+            module.weight_scale.data.mul_(scale_adjustment)


Discussed with @HDCharles that we believe this is an oversight both in model_free_ptq and in update_fused_layer_weight_global_scales. If we want to change in this PR, we should probably propagate the changes for that function as well

brian-dellabetta · 2026-04-10T16:03:05Z

                if observer is None:
                    continue
-                pending_comms.extend(observer.synchronize())
+                pending_comms.extend(observer.synchronize_observer())


nit -- any reason for the rename here? observer.synchronize() is a little more elegant, it's implicit what we're synchronizing

brian-dellabetta · 2026-04-10T16:03:56Z

+    should_calculate_gparam: bool = False,  # Kept for backward compatibility, ignored
+    should_calculate_qparams: bool = True,  # Kept for backward compatibility, ignored


nit -- consider raising deprecation warning if these are set by user to something other than default value, or by using a sentinel

mergify · 2026-04-10T21:02:18Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

mergify · 2026-04-10T21:08:46Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

…utes - Add early return in Observer.synchronize_observer() for memoryless observers - Skip memoryless observers in QuantizationMixin.sync_activation_observers() - Fix test_quantization_ddp.py to use min_vals/max_vals instead of past_min_vals/past_max_vals Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Add is_memoryless guard in Observer.synchronize_observer() and sync_activation_observers() - Refactor call_observer() with incidence matrix pattern for qparam flags - Add update_global_scale and update_scale_zp flags to control parameter storage - Implement use_module_global_scale()/use_computed_global_scale() methods - Remove auto-storage of global_scale from Observer.get_qparams() - Fix dynamic quantization: skip scale/zp storage during calibration - Fix DDP test: update attribute names from past_min_vals to min_vals - Clean up test code: remove stale observer.statistics references - Fix test compatibility: use tensor strategy for activation tests - Remove unused variable assignments in tests Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

mergify · 2026-04-11T06:58:59Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

mergify · 2026-04-13T18:13:15Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

mergify · 2026-04-14T01:53:28Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

mergify · 2026-04-14T01:58:02Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

mergify · 2026-04-14T03:25:16Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

mergify · 2026-04-14T03:37:30Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

mergify · 2026-04-22T16:19:41Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @HDCharles.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/llmcompressor/observers/hist.py (1)
1-49: ⚠️ Potential issue | 🔴 Critical

This file is unfinished and will fail at import time — please remove from the PR or guard the registration.

HistogramObserver is registered via @Observer.register("histogram") at module load, but the module cannot be imported successfully. There are several blocker-level defects:

Line 19 — ScaleZpTuple is referenced as a return-type annotation but is never imported nor defined in the module. Without from __future__ import annotations, the annotation is evaluated at function-definition time, causing NameError when the module is imported.

Line 28 — forward() returns (None, None). The Observer.forward contract requires returning self for chaining (e.g., observer(tensor).get_qparams()).

Line 31 — self._get_module_param(...) does not exist on Observer or its base classes; this raises AttributeError at runtime.

Line 32 — flatten_for_calibration(observed, self.base_name, self.args, g_idx) passes 4 arguments; the function signature accepts only 3 (value, base_name, args), raising TypeError.

Line 34 — xmin(xmax <= -xmin) calls xmin as a callable instead of performing multiplication. The expression should be xmin * (xmax <= -xmin) if mirroring the pattern from the previous line.

Unimplemented contract — update_statistics() is not overridden; the base implementation raises NotImplementedError, causing failure whenever the observer processes a batch.

Incomplete implementation — Lines 38–48 contain only comments; the histogram accumulation logic does not exist.

Until this is fleshed out, either remove the file, drop the @Observer.register("histogram") decorator, or move it behind a clearly-marked experimental flag so it cannot be selected via config.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/observers/hist.py` around lines 1 - 49, This module
registers HistogramObserver but is incomplete and will crash; either remove or
guard the `@Observer.register`("histogram") decorator so it cannot be loaded, and
then fix the implementation: import or use a string literal for the return type
ScaleZpTuple in HistogramObserver.forward (or add from __future__ import
annotations) so the annotation doesn't raise NameError; change forward to return
self (not (None, None)) to satisfy Observer chaining; in _add_batch replace the
nonexistent self._get_module_param call with the correct mechanism for group
index retrieval (use existing Observer attributes or pass g_idx from args), call
flatten_for_calibration with the correct 3-arg signature (value, base_name,
args), fix the xmin/xmax expression to use multiplication (xmin * (xmax <=
-xmin)) instead of calling xmin as a function, and implement or override
update_statistics plus the histogram initialization/accumulation logic (or stub
it to no-op) so the observer meets the base-class contract and does not raise at
runtime.

🧹 Nitpick comments (9)

src/llmcompressor/observers/min_max.py (1)

41-44: Nit: missing blank line between top-level class definitions.

PEP 8 recommends two blank lines between top-level classes — there's only one between StaticMinMaxObserver and MinMaxObserver. Easy formatting fix:

📝 Suggested fix

         else:
             self.min_vals = min_vals
             self.max_vals = max_vals

+
 `@Observer.register`("minmax")
 class MinMaxObserver(MovingAverageObserverBase):

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/observers/min_max.py` around lines 41 - 44, Add a blank
line to separate the top-level class definitions: ensure there are two blank
lines between the end of StaticMinMaxObserver and the start of MinMaxObserver
(the class registered as "minmax") so the file follows PEP 8; update the spacing
around the StaticMinMaxObserver class end and the MinMaxObserver class
declaration to include one additional empty line.

src/llmcompressor/observers/base.py (1)

143-150: Docstring formatting is broken.

The opening line is followed by stray indentation and a newline that breaks the rendered docstring (and Sphinx output). Also, the body says list of async communication handles but the class-level annotation already names them dist.Work — fine, but please clean up the indentation:

📝 Suggested fix

     def sync_activation_stats(self) -> List[dist.Work]:
-        """All-reduce accumulated activation statistics across DDP ranks.
-            
-            note: weight statistics don't need to be synced since weights
-        are synced across ranks, only data differs by rank.
-
-        :return: list of async communication handles
-        """
+        """
+        All-reduce accumulated activation statistics across DDP ranks.
+
+        Note: weight statistics don't need to be synced since weights
+        are synced across ranks; only data differs by rank.
+
+        :return: list of async communication handles
+        """

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/observers/base.py` around lines 143 - 150, The docstring
for sync_activation_stats is broken by extra indentation and a stray newline;
reformat it into a proper triple-quoted docstring with a single-line summary, a
separating blank line, and then the longer description (e.g., "All-reduce
accumulated activation statistics across DDP ranks." followed by an indented
explanatory paragraph) and update the return line to read clearly (e.g.,
":return: list of dist.Work async communication handles") so there are no stray
indents or wrapped lines inside sync_activation_stats().

src/llmcompressor/modifiers/gptq/base.py (1)

215-228: Extract the duplicated activation-sync + observe + compress sequence.

The two event branches (lines 216–219 and 221–224) share an identical 4-step sequence (sync_obs_act_stats → update_activation_qparams → observe(...) → compress_modules), differing only in the trailing on_end call. Extracting a helper reduces maintenance risk as the two-phase calibration pathway evolves.

♻️ Proposed refactor

+    def _finalize_epoch(self, state: State):
+        self.sync_obs_act_stats(state.model)
+        self.update_activation_qparams(state.model)
+        observe(self._num_samples.keys(), base_name="weight")
+        self.compress_modules()
+
     def on_event(self, state: State, event: Event, **kwargs):
         if event.type_ == EventType.CALIBRATION_EPOCH_START:
             if not self.started_:
                 self.on_start(state, None)
 
         if event.type_ == EventType.SEQUENTIAL_EPOCH_END:
-            self.sync_obs_act_stats(state.model)
-            self.update_activation_qparams(state.model)
-            observe(self._num_samples.keys(), base_name="weight")
-            self.compress_modules()
+            self._finalize_epoch(state)
 
         if event.type_ == EventType.CALIBRATION_EPOCH_END:
-            self.sync_obs_act_stats(state.model)
-            self.update_activation_qparams(state.model)
-            observe(self._num_samples.keys(), base_name="weight")
-            self.compress_modules()
+            self._finalize_epoch(state)
 
             if not self.ended_:
                 self.on_end(state, None)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/modifiers/gptq/base.py` around lines 215 - 228, The two
event handlers for EventType.SEQUENTIAL_EPOCH_END and
EventType.CALIBRATION_EPOCH_END duplicate the four-step sequence
sync_obs_act_stats(state.model) → update_activation_qparams(state.model) →
observe(self._num_samples.keys(), base_name="weight") → compress_modules();
extract that sequence into a private helper (e.g.
_sync_update_observe_compress(state)) and call it from both branches to remove
duplication, then keep the extra on_end call only in the CALIBRATION_EPOCH_END
branch invoking on_end(state, None) when not self.ended_. Ensure the helper
calls the same methods in the same order and uses state.model and
self._num_samples.keys() so behavior is preserved.

src/llmcompressor/observers/imatrix.py (2)

35-38: Annotate _act_sync_dict as ClassVar (Ruff RUF012).

It's effectively a class-level constant, so giving it a ClassVar annotation silences RUF012 and documents intent.

♻️ Suggested change

+from typing import ClassVar
@@
-    _act_sync_dict = {
-        "_imatrix_sum": dist.ReduceOp.SUM,
-        "_imatrix_count": dist.ReduceOp.SUM,
-    }
+    _act_sync_dict: ClassVar[dict] = {
+        "_imatrix_sum": dist.ReduceOp.SUM,
+        "_imatrix_count": dist.ReduceOp.SUM,
+    }

As per static analysis hints (Ruff RUF012).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/observers/imatrix.py` around lines 35 - 38, The
_act_sync_dict is a class-level constant and should be annotated as a ClassVar
to satisfy Ruff RUF012; update the declaration of _act_sync_dict in the imatrix
observer class to use typing.ClassVar (e.g., annotate its type as
ClassVar[dict[str, dist.ReduceOp]] or an appropriate mapping type) and import
ClassVar from typing if not already imported so the linter recognizes it as a
class variable.

115-118: Trivial: en-dashes in docstring (Ruff RUF002).

Case 1 – / Case 2 – use EN DASH (–); replace with - to silence RUF002.

As per static analysis hints (Ruff RUF002).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/observers/imatrix.py` around lines 115 - 118, Docstring
uses EN DASH characters in the lines "Case 1 –" and "Case 2 –"; replace those EN
DASHes with ASCII hyphens ("-") in src/llmcompressor/observers/imatrix.py so
Ruff RUF002 is silenced—edit the docstring near the reference to
observer.attach() (the block that reads "Case 1 – accumulators..." / "Case 2 –
no accumulators...") to use "Case 1 - ..." and "Case 2 - ...".

src/llmcompressor/modifiers/gptq/gptq_quantize.py (1)

92-92: Direct attribute access — confirm weight_observer is always attached at this call site.

module.weight_observer will AttributeError if start_calibration / _initialize_observers did not run for this module (e.g. compression path invoked outside the standard GPTQ lifecycle, or a target that didn't match). Consider getattr(module, "weight_observer", None) with a clearer error message, or assert it explicitly so the failure mode is obvious.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/modifiers/gptq/gptq_quantize.py` at line 92, Replace the
direct attribute access to module.weight_observer in gptq_quantize.py with a
safe check: use getattr(module, "weight_observer", None) (or assert its
presence) before assigning to observer and raise or log a clear, actionable
error mentioning the module and that start_calibration/_initialize_observers may
not have run if it's missing; this ensures module.weight_observer is present or
fails with an explicit message instead of an AttributeError.

tests/llmcompressor/modifiers/utils/test_update_fused_layer_weights.py (1)

176-176: Asserting on _fused_observers reaches into observer internals.

If Observer.fuse([]) exposes a public way to inspect the fused set (e.g. a length helper or a is_fused flag), prefer that over poking at _fused_observers. Not blocking — just keeps the test resilient if the internal name changes.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/llmcompressor/modifiers/utils/test_update_fused_layer_weights.py` at
line 176, The test is reaching into observer internals by asserting on
layer.weight_observer._fused_observers; update the test to use a public observer
API instead (e.g., call Observer.fuse([]) helper or a public accessor on
weight_observer such as fused_count() or is_fused(...) rather than
_fused_observers). If the public accessor does not exist, add a small public
helper on the Observer class (e.g., fused_count() returning the fused set size
or is_fused(item) boolean) and replace the assertion with something like assert
layer.weight_observer.fused_count() == 0 or assert not
layer.weight_observer.is_fused(<relevant_key>) to avoid poking at
_fused_observers.

src/llmcompressor/modifiers/quantization/quantization/base.py (1)

79-92: Tighten the epoch-end branch.

A few small improvements:

Comment on line 79 is incomplete (# after calibration,) and line 84 is a blank-with-indent.
Use event.type_ in (...) instead of two == clauses joined with or.
sync_obs_act_stats and update_activation_qparams already iterate via match_named_modules internally, so building a separate modules list and passing state.model to the mixin methods is mildly inconsistent. Either pass the precomputed iterable down or always re-match — your call.

♻️ Suggested cleanup

-        # after calibration, 
-        if (
-            event.type_ == EventType.SEQUENTIAL_EPOCH_END or 
-            event.type_ == EventType.CALIBRATION_EPOCH_END
-        ):
-            
-            modules = [m for _, m in match_named_modules(
-                state.model, self.resolved_targets, self.ignore
-            )]
-
-            self.sync_obs_act_stats(state.model)
-            self.update_activation_qparams(state.model)
-            observe(modules, base_name="weight")
-            update_qparams(modules, base_name="weight")
+        # On epoch end, sync activation stats and recompute weight qparams.
+        if event.type_ in (
+            EventType.SEQUENTIAL_EPOCH_END,
+            EventType.CALIBRATION_EPOCH_END,
+        ):
+            modules = [
+                m for _, m in match_named_modules(
+                    state.model, self.resolved_targets, self.ignore
+                )
+            ]
+
+            self.sync_obs_act_stats(state.model)
+            self.update_activation_qparams(state.model)
+            observe(modules, base_name="weight")
+            update_qparams(modules, base_name="weight")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/modifiers/quantization/quantization/base.py` around lines
79 - 92, Fix the incomplete comment and tighten the epoch-end branch: replace
the comment "# after calibration," with a short complete sentence, change the
condition to use membership testing (if event.type_ in
(EventType.SEQUENTIAL_EPOCH_END, EventType.CALIBRATION_EPOCH_END):), compute the
module list once using match_named_modules (modules = [m for _, m in
match_named_modules(state.model, self.resolved_targets, self.ignore)]), and then
pass that precomputed modules iterable into self.sync_obs_act_stats(...) and
self.update_activation_qparams(...) (instead of passing state.model) before
calling observe(modules, base_name="weight") and update_qparams(modules,
base_name="weight"); this avoids redundant matching and keeps the flow
consistent (refer to match_named_modules, sync_obs_act_stats,
update_activation_qparams, observe, update_qparams).

tests/llmcompressor/observers/test_imatrix.py (1)

73-89: Stale docstring after refactor.

The test name and docstring (test_global_scale_then_forward_tensor_group, "Full flow: global_scale -> forward must produce valid qparams") describe a two-call sequence that no longer exists — both global_scale and scale are now produced by a single get_qparams() call. This test now overlaps almost entirely with test_global_scale_tensor_group_does_not_crash above (same module, same args, same strategy); the only difference is which key is asserted on. Consider either merging the two cases or updating the name/docstring to reflect what's actually being verified (e.g., that scale is finite for TENSOR_GROUP).

♻️ Possible consolidation

-    def test_global_scale_then_forward_tensor_group(self):
-        """Full flow: global_scale -> forward must produce valid qparams."""
-        module = _make_linear_with_importance(in_features=8, out_features=4)
-        args = QuantizationArgs(
-            num_bits=4,
-            symmetric=True,
-            strategy="tensor_group",
-            group_size=4,
-            observer="imatrix_mse",
-        )
-        observer = Observer.load_from_registry(
-            "imatrix_mse", base_name="weight", args=args
-        )
-        observer.attach(module)
-
-        qparams = observer(module.weight).get_qparams()
-        assert torch.isfinite(qparams["scale"]).all()
+    def test_tensor_group_scale_is_finite(self):
+        """Scale (in addition to global_scale) must be finite for TENSOR_GROUP."""
+        module = _make_linear_with_importance(in_features=8, out_features=4)
+        observer = _make_observer(module, strategy="tensor_group", group_size=4)
+        qparams = observer(module.weight).get_qparams()
+        assert torch.isfinite(qparams["scale"]).all()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/llmcompressor/observers/test_imatrix.py` around lines 73 - 89, Rename
or update the test function test_global_scale_then_forward_tensor_group to
reflect the current behavior (both scale and global_scale come from a single
get_qparams() call): either merge this test with the existing
test_global_scale_tensor_group_does_not_crash (since they use the same module,
args, and strategy) or change the function name and docstring to state that it
verifies that get_qparams() returns a finite "scale" for the TENSOR_GROUP
strategy; update references to observer(module.weight).get_qparams() and the
asserted key "scale" accordingly so the test description matches the actual
check.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/llmcompressor/modifiers/awq/base.py`:
- Around line 253-263: The AWQ modifier calls the wrong method name
sync_obs_act_status in the SEQUENTIAL_EPOCH_END and CALIBRATION_EPOCH_END
handlers; replace both calls with the correct QuantizationMixin method name
sync_obs_act_stats so the synchronization step runs (keep the surrounding calls
to update_activation_qparams and _apply_smoothing unchanged).

In `@src/llmcompressor/modifiers/gptq/gptq_quantize.py`:
- Around line 86-95: The quantize_weight routine is missing upfront shape
standardization for Conv2d and transformers.Conv1D, causing mismatched indexing;
modify quantize_weight to, when module is torch.nn.Conv2d, call W = W.flatten(1)
and when module is transformers.Conv1D call W.transpose_(0,1) before converting
W to GPTQ_PRECISION and computing num_rows/num_columns, and then apply the
inverse transforms (for Conv1D transpose back and for Conv2d reshape back to
original 4-D) just before the final reshape/assignment step around where the
current inverse Conv1D.transpose_(0,1) is applied so all column/row operations
in the GPTQ loop operate on a 2-D (out x in) matrix.

In `@src/llmcompressor/modifiers/quantization/calibration.py`:
- Around line 87-127: The observe and update_qparams helpers currently branch on
isinstance(module, Iterable), which treats container nn.Module subclasses (e.g.,
nn.Sequential, nn.ModuleList/Dict) as iterables and recurses into their
children; change the check to only treat explicit sequence/collection types (for
example isinstance(module, (list, tuple)) or isinstance(module, Sequence)) so
container modules are handled as single modules; update both observe and
update_qparams to use the narrower type check and ensure any required import
(e.g., from collections.abc import Sequence) is added if you choose Sequence.

In `@src/llmcompressor/observers/base.py`:
- Around line 84-95: Replace the runtime-only asserts with explicit exceptions
so missing preconditions always fail: in the method that begins with "assert
self.has_statistics" (the get_qparams / calculate_qparams block), check if not
self.has_statistics and raise a clear exception (e.g., RuntimeError or
ValueError) with a message instructing to call observer(value) first; similarly,
inside the loop over self._fused_observers replace "assert obs.has_statistics"
with a conditional that raises a descriptive exception referencing the fused
observer (e.g., RuntimeError("Fused observer X has no statistics; run
observer(...) on all fused partners first")), so failures are deterministic even
under PYTHONOPTIMIZE and before calling generate_gparam.

In `@src/llmcompressor/observers/helpers.py`:
- Around line 167-195: fuse_weight_observers can mistakenly match non-linear
attributes; add a concrete layer-type guard (e.g., ensure each layer in layers
is an instance of torch.nn.Linear or the expected Linear subclass) before
reading layer.weight_observer in fuse_weight_observers, and then prevent
duplicate fusion by checking Observer._fused_observers (or a provided API on
Observer) to skip fusing the same observer objects twice—ensure you reference
FUSED_LAYER_NAMES, fuse_weight_observers, weight_observer, and Observer.fuse
when locating where to add the isinstance check and the de-duplication guard;
optionally also consider clearing or resetting Observer._fused_observers at the
start/end of calibration (freeze_module_quantization / start_calibration /
end_calibration) to avoid accumulation across repeated calibrations.

In `@src/llmcompressor/pipelines/sequential/pipeline.py`:
- Around line 103-105: The sequential pipeline unconditionally disables
quantization during calibration by always entering DisableQuantization(model)
inside the calibration context, so the DatasetArguments/oneshot() parameter
quantization_aware_calibration is ignored; update the pipeline.py logic to
respect that parameter (or deprecate it). Specifically, locate the calibration
block that uses calibration_forward_context(model) and
DisableQuantization(model) and change it so that it checks the
quantization_aware_calibration flag passed into the pipeline/oneshot call (or
via DatasetArguments) and only enters DisableQuantization when
quantization_aware_calibration is False; if you choose to remove the parameter
instead, add a deprecation notice in oneshot() and DatasetArguments and delete
usages/readers of quantization_aware_calibration to avoid confusion.

In `@tests/llmcompressor/modifiers/calibration/test_lifecycle.py`:
- Line 242: The test is calling a non-existent attribute
linear.linear.input_observer which will raise AttributeError; change the
reference to the observer attached by initialize_observer by using
linear.input_observer.get_qparams() (i.e., call input_observer directly on the
Linear instance) so the hook uses the correct symbol (linear,
initialize_observer, input_observer, get_qparams).

In `@tests/llmcompressor/modifiers/utils/test_update_fused_layer_weights.py`:
- Around line 200-219: The tests call module.q_proj.weight_observer() and
module.k_proj.weight_observer() with no arguments which raises TypeError because
Observer.forward requires the tensor; update those calls to pass the weight
tensors (i.e., call module.q_proj.weight_observer(module.q_proj.weight) and
module.k_proj.weight_observer(module.k_proj.weight)) so that the observers
receive the observed tensor before calling update_qparams (references:
weight_observer on q_proj and k_proj, and update_qparams).

---

Outside diff comments:
In `@src/llmcompressor/observers/hist.py`:
- Around line 1-49: This module registers HistogramObserver but is incomplete
and will crash; either remove or guard the `@Observer.register`("histogram")
decorator so it cannot be loaded, and then fix the implementation: import or use
a string literal for the return type ScaleZpTuple in HistogramObserver.forward
(or add from __future__ import annotations) so the annotation doesn't raise
NameError; change forward to return self (not (None, None)) to satisfy Observer
chaining; in _add_batch replace the nonexistent self._get_module_param call with
the correct mechanism for group index retrieval (use existing Observer
attributes or pass g_idx from args), call flatten_for_calibration with the
correct 3-arg signature (value, base_name, args), fix the xmin/xmax expression
to use multiplication (xmin * (xmax <= -xmin)) instead of calling xmin as a
function, and implement or override update_statistics plus the histogram
initialization/accumulation logic (or stub it to no-op) so the observer meets
the base-class contract and does not raise at runtime.

---

Nitpick comments:
In `@src/llmcompressor/modifiers/gptq/base.py`:
- Around line 215-228: The two event handlers for EventType.SEQUENTIAL_EPOCH_END
and EventType.CALIBRATION_EPOCH_END duplicate the four-step sequence
sync_obs_act_stats(state.model) → update_activation_qparams(state.model) →
observe(self._num_samples.keys(), base_name="weight") → compress_modules();
extract that sequence into a private helper (e.g.
_sync_update_observe_compress(state)) and call it from both branches to remove
duplication, then keep the extra on_end call only in the CALIBRATION_EPOCH_END
branch invoking on_end(state, None) when not self.ended_. Ensure the helper
calls the same methods in the same order and uses state.model and
self._num_samples.keys() so behavior is preserved.

In `@src/llmcompressor/modifiers/gptq/gptq_quantize.py`:
- Line 92: Replace the direct attribute access to module.weight_observer in
gptq_quantize.py with a safe check: use getattr(module, "weight_observer", None)
(or assert its presence) before assigning to observer and raise or log a clear,
actionable error mentioning the module and that
start_calibration/_initialize_observers may not have run if it's missing; this
ensures module.weight_observer is present or fails with an explicit message
instead of an AttributeError.

In `@src/llmcompressor/modifiers/quantization/quantization/base.py`:
- Around line 79-92: Fix the incomplete comment and tighten the epoch-end
branch: replace the comment "# after calibration," with a short complete
sentence, change the condition to use membership testing (if event.type_ in
(EventType.SEQUENTIAL_EPOCH_END, EventType.CALIBRATION_EPOCH_END):), compute the
module list once using match_named_modules (modules = [m for _, m in
match_named_modules(state.model, self.resolved_targets, self.ignore)]), and then
pass that precomputed modules iterable into self.sync_obs_act_stats(...) and
self.update_activation_qparams(...) (instead of passing state.model) before
calling observe(modules, base_name="weight") and update_qparams(modules,
base_name="weight"); this avoids redundant matching and keeps the flow
consistent (refer to match_named_modules, sync_obs_act_stats,
update_activation_qparams, observe, update_qparams).

In `@src/llmcompressor/observers/base.py`:
- Around line 143-150: The docstring for sync_activation_stats is broken by
extra indentation and a stray newline; reformat it into a proper triple-quoted
docstring with a single-line summary, a separating blank line, and then the
longer description (e.g., "All-reduce accumulated activation statistics across
DDP ranks." followed by an indented explanatory paragraph) and update the return
line to read clearly (e.g., ":return: list of dist.Work async communication
handles") so there are no stray indents or wrapped lines inside
sync_activation_stats().

In `@src/llmcompressor/observers/imatrix.py`:
- Around line 35-38: The _act_sync_dict is a class-level constant and should be
annotated as a ClassVar to satisfy Ruff RUF012; update the declaration of
_act_sync_dict in the imatrix observer class to use typing.ClassVar (e.g.,
annotate its type as ClassVar[dict[str, dist.ReduceOp]] or an appropriate
mapping type) and import ClassVar from typing if not already imported so the
linter recognizes it as a class variable.
- Around line 115-118: Docstring uses EN DASH characters in the lines "Case 1 –"
and "Case 2 –"; replace those EN DASHes with ASCII hyphens ("-") in
src/llmcompressor/observers/imatrix.py so Ruff RUF002 is silenced—edit the
docstring near the reference to observer.attach() (the block that reads "Case 1
– accumulators..." / "Case 2 – no accumulators...") to use "Case 1 - ..." and
"Case 2 - ...".

In `@src/llmcompressor/observers/min_max.py`:
- Around line 41-44: Add a blank line to separate the top-level class
definitions: ensure there are two blank lines between the end of
StaticMinMaxObserver and the start of MinMaxObserver (the class registered as
"minmax") so the file follows PEP 8; update the spacing around the
StaticMinMaxObserver class end and the MinMaxObserver class declaration to
include one additional empty line.

In `@tests/llmcompressor/modifiers/utils/test_update_fused_layer_weights.py`:
- Line 176: The test is reaching into observer internals by asserting on
layer.weight_observer._fused_observers; update the test to use a public observer
API instead (e.g., call Observer.fuse([]) helper or a public accessor on
weight_observer such as fused_count() or is_fused(...) rather than
_fused_observers). If the public accessor does not exist, add a small public
helper on the Observer class (e.g., fused_count() returning the fused set size
or is_fused(item) boolean) and replace the assertion with something like assert
layer.weight_observer.fused_count() == 0 or assert not
layer.weight_observer.is_fused(<relevant_key>) to avoid poking at
_fused_observers.

In `@tests/llmcompressor/observers/test_imatrix.py`:
- Around line 73-89: Rename or update the test function
test_global_scale_then_forward_tensor_group to reflect the current behavior
(both scale and global_scale come from a single get_qparams() call): either
merge this test with the existing test_global_scale_tensor_group_does_not_crash
(since they use the same module, args, and strategy) or change the function name
and docstring to state that it verifies that get_qparams() returns a finite
"scale" for the TENSOR_GROUP strategy; update references to
observer(module.weight).get_qparams() and the asserted key "scale" accordingly
so the test description matches the actual check.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e2242870-00ac-472e-ba37-645807e42076

📥 Commits

Reviewing files that changed from the base of the PR and between 755bc04 and 5076faa.

📒 Files selected for processing (31)

src/llmcompressor/entrypoints/model_free/lifecycle.py
src/llmcompressor/entrypoints/model_free/process.py
src/llmcompressor/modifiers/awq/base.py
src/llmcompressor/modifiers/gptq/base.py
src/llmcompressor/modifiers/gptq/gptq_quantize.py
src/llmcompressor/modifiers/quantization/calibration.py
src/llmcompressor/modifiers/quantization/quantization/base.py
src/llmcompressor/modifiers/quantization/quantization/mixin.py
src/llmcompressor/modifiers/transform/imatrix/base.py
src/llmcompressor/modifiers/utils/__init__.py
src/llmcompressor/modifiers/utils/helpers.py
src/llmcompressor/observers/base.py
src/llmcompressor/observers/helpers.py
src/llmcompressor/observers/hist.py
src/llmcompressor/observers/imatrix.py
src/llmcompressor/observers/min_max.py
src/llmcompressor/observers/moving_base.py
src/llmcompressor/observers/mse.py
src/llmcompressor/pipelines/basic/pipeline.py
src/llmcompressor/pipelines/sequential/pipeline.py
src/llmcompressor/utils/helpers.py
tests/llmcompressor/modifiers/calibration/test_lifecycle.py
tests/llmcompressor/modifiers/calibration/test_observers.py
tests/llmcompressor/modifiers/quantization/test_quantization_ddp.py
tests/llmcompressor/modifiers/utils/test_update_fused_layer_weights.py
tests/llmcompressor/observers/test_helpers.py
tests/llmcompressor/observers/test_imatrix.py
tests/llmcompressor/observers/test_min_max.py
tests/llmcompressor/observers/test_mse.py
tests/llmcompressor/observers/test_mse_vs_minmax.py
tests/llmcompressor/utils/test_distributed.py

💤 Files with no reviewable changes (3)

src/llmcompressor/modifiers/utils/init.py
src/llmcompressor/utils/helpers.py
src/llmcompressor/modifiers/utils/helpers.py

🚧 Files skipped from review as they are similar to previous changes (3)

tests/llmcompressor/observers/test_min_max.py
src/llmcompressor/modifiers/quantization/quantization/mixin.py
tests/llmcompressor/utils/test_distributed.py

coderabbitai · 2026-04-25T02:15:12Z

    actorder = quant_args.actorder
-    global_scale = getattr(module, "weight_global_scale", None)
    final_shape = module.weight.shape
    final_dtype = module.weight.dtype
    W = module.weight.clone()
    H = hessian

-    # create observer for calculating quantization parameters
-    observer = Observer.load_from_registry(
-        quant_args.observer if quant_args.observer else "memoryless_minmax",
-        base_name="weight",
-        args=quant_args,
-        module=module,
-    )
+    observer = module.weight_observer

-    # standardize shape and dtype
-    match module:
-        case torch.nn.Conv2d():
-            W = W.flatten(1)
-        case transformers.Conv1D():
-            W.transpose_(0, 1)
    W = W.to(dtype=GPTQ_PRECISION)
    num_rows = W.shape[0]


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Confirm whether Conv1D / Conv2d are still routed through quantize_weight rg -nP -C3 '\b(Conv1D|Conv2d)\b' src/llmcompressor/modifiers/gptq/ rg -nP -C3 'quantize_weight\s*\(' --type=py # Look for any guards that exclude Conv layers before calling quantize_weight rg -nP -C3 'isinstance\([^)]+,\s*(transformers\.)?Conv1D' --type=py

Repository: vllm-project/llm-compressor

Length of output: 8173

🏁 Script executed:

# Read the quantize_weight function from start to verify shape handling sed -n '68,160p' src/llmcompressor/modifiers/gptq/gptq_quantize.py | cat -n

Repository: vllm-project/llm-compressor

Length of output: 4064

🏁 Script executed:

sed -n '38,66p' src/llmcompressor/modifiers/gptq/gptq_quantize.py | cat -n

Repository: vllm-project/llm-compressor

Length of output: 1055

🏁 Script executed:

# Check sparsegpt_sparsify.py to see how it handles shape standardization sed -n '80,110p' src/llmcompressor/modifiers/pruning/sparsegpt/sgpt_sparsify.py | cat -n

Repository: vllm-project/llm-compressor

Length of output: 1372

🏁 Script executed:

# Check if there are any guards/filters that prevent Conv2d/Conv1D from reaching quantize_weight rg -B5 'quantize_weight\(' src/llmcompressor/modifiers/gptq/base.py | head -40

Repository: vllm-project/llm-compressor

Length of output: 328

🏁 Script executed:

# Get more context around the quantize_weight call in base.py sed -n '290,320p' src/llmcompressor/modifiers/gptq/base.py | cat -n

Repository: vllm-project/llm-compressor

Length of output: 1592

🏁 Script executed:

# Check the logic for which modules get quantized - look for module type checks rg -B10 'compress_module_list\(' src/llmcompressor/modifiers/gptq/base.py | head -60

Repository: vllm-project/llm-compressor

Length of output: 1011

🏁 Script executed:

# Find where accumulate_hessian is called to see module filtering rg -B5 'accumulate_hessian\(' src/llmcompressor/modifiers/gptq/ | head -80

Repository: vllm-project/llm-compressor

Length of output: 1103

🏁 Script executed:

# Check weight_observer implementation to verify global_scale key rg -A10 'class.*Observer' src/llmcompressor/modifiers/gptq/ | head -80

Repository: vllm-project/llm-compressor

Length of output: 53

🏁 Script executed:

# Search for weight_observer setup and get_qparams implementation rg 'weight_observer' src/llmcompressor/modifiers/gptq/ -A3 | head -50

Repository: vllm-project/llm-compressor

Length of output: 371

🏁 Script executed:

# Find where weight_observer is created rg 'weight_observer\s*=' src/llmcompressor/ -B3 -A3 | head -100

Repository: vllm-project/llm-compressor

Length of output: 53

🏁 Script executed:

# Search for observer setup in the quantization scheme rg 'Observer' src/llmcompressor/modifiers/gptq/ --type=py -l

Repository: vllm-project/llm-compressor

Length of output: 53

🏁 Script executed:

# Find where get_qparams is defined rg 'def get_qparams' src/llmcompressor/ --type=py -B2 -A10 | head -100

Repository: vllm-project/llm-compressor

Length of output: 916

🏁 Script executed:

# Find compute_qparams_from_statistics to verify it returns global_scale rg 'def compute_qparams_from_statistics' src/llmcompressor/ --type=py -A30 | head -100

Repository: vllm-project/llm-compressor

Length of output: 2580

🏁 Script executed:

# Check test to see if Conv2d is tested with GPTQ cat -n src/llmcompressor/modifiers/gptq/test_gptq_quantize.py | head -100

Repository: vllm-project/llm-compressor

Length of output: 158

🏁 Script executed:

# Find test files for GPTQ find . -name '*test*gptq*' -o -name '*gptq*test*' 2>/dev/null | head -20

Repository: vllm-project/llm-compressor

Length of output: 191

🏁 Script executed:

# Check test_gptq_quantize.py to see what module types are tested cat -n tests/llmcompressor/modifiers/gptq/test_gptq_quantize.py | head -80

Repository: vllm-project/llm-compressor

Length of output: 1422

Restore upfront shape standardization for Conv1D/Conv2d weights.

The code path removed the initial Conv2d.flatten(1) and Conv1D.transpose_(0, 1) operations that convert weights to 2D before quantization, but only applies the inverse Conv1D.transpose_(0, 1) at the end (line 268). This causes shape mismatches in the quantization loop:

Conv1D: Weight remains (in, out) so W.shape[1] = out, but H.shape[0] = in from accumulate_hessian. Column indexing in the GPTQ loop will be misaligned.

Conv2d: Weight is 4-D (out, in, kH, kW), but line 101 executes W[:, dead] = 0 expecting 2-D, and num_columns = W.shape[1] gives incorrect results.

Compare with sparsegpt_sparsify.py (lines 89-93), which preserves this standardization:

if isinstance(module, torch.nn.Conv2d): W = W.flatten(1) elif isinstance(module, transformers.Conv1D): W.transpose_(0, 1)

Add equivalent guards at the start of quantize_weight before line 88 where num_rows and num_columns are calculated, with matching inverse operations before reshaping at line 269.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/llmcompressor/modifiers/gptq/gptq_quantize.py` around lines 86 - 95, The quantize_weight routine is missing upfront shape standardization for Conv2d and transformers.Conv1D, causing mismatched indexing; modify quantize_weight to, when module is torch.nn.Conv2d, call W = W.flatten(1) and when module is transformers.Conv1D call W.transpose_(0,1) before converting W to GPTQ_PRECISION and computing num_rows/num_columns, and then apply the inverse transforms (for Conv1D transpose back and for Conv2d reshape back to original 4-D) just before the final reshape/assignment step around where the current inverse Conv1D.transpose_(0,1) is applied so all column/row operations in the GPTQ loop operate on a 2-D (out x in) matrix.

coderabbitai · 2026-04-25T02:15:12Z

+def observe(
+    module: Module | Iterable[Module],
    base_name: str,
-    value: Optional[torch.Tensor] = None,
-    should_calculate_gparam: bool = False,
-    should_calculate_qparams: bool = True,
 ):
    """
-    Call a module's attached input/weight/output observer using a provided value.
-    Update the module's scale and zp using the observer's return values.
+    Run observers to accumulate statistics on modules.
+    Must be called before update_qparams.

-    :param module: torch.nn.Module
-    :param base_name: substring used to fetch the observer, scales, and zp
-    :param value: torch.Tensor to be passed to the observer for activations. If
-        base_name is "weight", then the module's weight tensor will be used
+    :param module: module or iterable of modules with observer attributes
+    :param base_name: substring used to fetch the observer and value to observe
    """
-    with align_module_device(module):
-        if value is None and base_name == "weight":
-            value = module.weight
-        observer: Observer = getattr(module, f"{base_name}_observer")
-
-        if should_calculate_gparam:
-            global_scale = observer.get_global_scale(value)
-            update_offload_parameter(module, f"{base_name}_global_scale", global_scale)
-
-        if should_calculate_qparams:
-            scale, zero_point = observer(value)
-            update_offload_parameter(module, f"{base_name}_scale", scale)
-            if hasattr(module, f"{base_name}_zero_point"):
-                update_offload_parameter(module, f"{base_name}_zero_point", zero_point)
-
-
-def update_weight_global_scale(module: Module):
-    if getattr_chain(module, "quantization_scheme.weights", None) is None:
+    if isinstance(module, Iterable):
+        for m in module:
+            observe(m, base_name)
        return

-    if (
-        getattr_chain(module, "quantization_scheme.weights.strategy", None)
-        != QuantizationStrategy.TENSOR_GROUP
-    ):
+    observer = getattr(module, f"{base_name}_observer", None)
+    if observer is None:
        return
+    with align_module_device(module):
+        observer(getattr(module, base_name))

-    call_observer(
-        module,
-        base_name="weight",
-        should_calculate_gparam=True,
-        should_calculate_qparams=False,
-    )
-
-
-def update_weight_zp_scale(module: Module):
-    """
-    marks a layer as ready for calibration which activates observers
-    to update scales and zero points on each forward pass
-
-    apply to full model with `model.apply(update_weight_zp_scale)`

-    :param module: module to set for calibration
-    :param quantize_weights_upfront: whether to automatically
-       run weight quantization at the start of calibration
+def update_qparams(
+    module: Module | Iterable[Module],
+    base_name: str,
+):
    """
-    if getattr_chain(module, "quantization_scheme.weights", None) is None:
-        return
-
-    if getattr(module, "quantization_status", None) != QuantizationStatus.CALIBRATION:
-        logger.warning(
-            "Attempting to calibrate weights of a module not in calibration mode"
-        )
+    Compute quantization parameters from observer statistics and store on module.

-    call_observer(module=module, base_name="weight")
+    For dynamic quantization, scale/zp updates are skipped (scale/zp are
+    computed at inference time). For non-TENSOR_GROUP strategies, global_scale
+    is None and naturally skipped.

-
-def calibrate_activations(module: Module, value: torch.Tensor, base_name: str):
-    """
-    Calibrate input or output activations by calling the a module's attached
-    observer.
-
-    :param module: torch.nn.Module
+    :param module: torch.nn.Module with attached observer (or iterable of modules)
    :param base_name: substring used to fetch the observer, scales, and zp
-    :param value: torch.Tensor to be passed to the observer
-
    """
-    # If empty tensor, can't update zp/scale
-    # Case for MoEs
-    if value.numel() == 0:
+    if isinstance(module, Iterable):
+        for m in module:
+            update_qparams(m, base_name)
        return


⚠️ Potential issue | 🟡 Minor

isinstance(module, Iterable) will treat nn.Sequential / nn.ModuleList as a list of modules.

nn.Sequential, nn.ModuleList, and nn.ModuleDict all implement __iter__, so they satisfy isinstance(x, typing.Iterable). The current callers in this PR happen to pass either a single Linear or a list comprehension, but anyone passing a container module — or any future caller — will silently see the function recurse into the container's children rather than treat the container itself as the target.

Recommend dispatching on a narrower type to avoid surprising behavior. Same applies to update_qparams on Lines 124–127.

♻️ Suggested fix

- if isinstance(module, Iterable): + if not isinstance(module, Module) and isinstance(module, Iterable): for m in module: observe(m, base_name) return

(or isinstance(module, (list, tuple)) if explicit support for arbitrary iterables isn't required)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/llmcompressor/modifiers/quantization/calibration.py` around lines 87 - 127, The observe and update_qparams helpers currently branch on isinstance(module, Iterable), which treats container nn.Module subclasses (e.g., nn.Sequential, nn.ModuleList/Dict) as iterables and recurses into their children; change the check to only treat explicit sequence/collection types (for example isinstance(module, (list, tuple)) or isinstance(module, Sequence)) so container modules are handled as single modules; update both observe and update_qparams to use the narrower type check and ensure any required import (e.g., from collections.abc import Sequence) is added if you choose Sequence.

coderabbitai · 2026-04-25T02:15:12Z

+        assert self.has_statistics, "No statistics available. Call observer(value) first."
+
+        global_scale = None
+        if self.args.strategy == QuantizationStrategy.TENSOR_GROUP:
+            global_absmax = torch.max(-self.min_vals.min(), self.max_vals.max())
+            for obs in self._fused_observers:
+                assert obs.has_statistics, "All fused observers must be run before get_qparams."
+                global_absmax = torch.max(global_absmax, -obs.min_vals.min())
+                global_absmax = torch.max(global_absmax, obs.max_vals.max())
+            global_scale = generate_gparam(
+                -global_absmax.reshape(1), global_absmax.reshape(1)
+            )


⚠️ Potential issue | 🟡 Minor

Use exceptions instead of assert for invariants users can hit.

Both assert self.has_statistics (Line 84) and assert obs.has_statistics (Line 90) become no-ops when Python is run with -O/PYTHONOPTIMIZE. If a caller forgets to run the observer first or to observe all fused partners (a documented pre-condition on Lines 78–80), the failure mode under optimized mode would silently fall through into calculate_qparams with missing attributes (AttributeError) or an absent fused contribution (incorrect global_scale), rather than a clear error.

🛡️ Suggested fix

- assert self.has_statistics, "No statistics available. Call observer(value) first." + if not self.has_statistics: + raise RuntimeError( + "No statistics available. Call observer(value) first." + ) global_scale = None if self.args.strategy == QuantizationStrategy.TENSOR_GROUP: global_absmax = torch.max(-self.min_vals.min(), self.max_vals.max()) for obs in self._fused_observers: - assert obs.has_statistics, "All fused observers must be run before get_qparams." + if not obs.has_statistics: + raise RuntimeError( + "All fused observers must be run before get_qparams." + )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/llmcompressor/observers/base.py` around lines 84 - 95, Replace the runtime-only asserts with explicit exceptions so missing preconditions always fail: in the method that begins with "assert self.has_statistics" (the get_qparams / calculate_qparams block), check if not self.has_statistics and raise a clear exception (e.g., RuntimeError or ValueError) with a message instructing to call observer(value) first; similarly, inside the loop over self._fused_observers replace "assert obs.has_statistics" with a conditional that raises a descriptive exception referencing the fused observer (e.g., RuntimeError("Fused observer X has no statistics; run observer(...) on all fused partners first")), so failures are deterministic even under PYTHONOPTIMIZE and before calling generate_gparam.

coderabbitai · 2026-04-25T02:15:12Z

+def fuse_weight_observers(model: Module):
+    """
+    Link weight observers across fused layer groups for shared global_scale.
+
+    For TENSOR_GROUP quantization (e.g. NVFP4), vLLM requires that fused
+    layers (Q/K/V attention, gate/up MLP) share the same global_scale.
+    This function links their observers so that get_qparams() computes
+    global_scale from the combined statistics of all observers in the group.
+
+    :param model: model whose weight observers should be linked
+    """
+    from llmcompressor.observers import Observer
+
+    for submodule in model.modules():
+        for layers_to_fuse in FUSED_LAYER_NAMES:
+            if not all(hasattr(submodule, name) for name in layers_to_fuse):
+                continue
+
+            layers = [getattr(submodule, name) for name in layers_to_fuse]
+            observers = []
+            for layer in layers:
+                obs = getattr(layer, "weight_observer", None)
+                if obs is None:
+                    break
+                if obs.args.strategy != QuantizationStrategy.TENSOR_GROUP:
+                    break
+                observers.append(obs)
+            else:
+                Observer.fuse(observers)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Confirm fuse_weight_observers is invoked only once and Observer.fuse has no dedup. rg -nP --type=py -C3 'fuse_weight_observers\s*\(' ast-grep --pattern $'def fuse(observers: $_) -> None: $$$ '

Repository: vllm-project/llm-compressor

Length of output: 4311

🏁 Script executed:

#!/bin/bash # Check FUSED_LAYER_NAMES definition rg -nP --type=py 'FUSED_LAYER_NAMES\s*=' -A 5 # Check if _fused_observers is cleared anywhere rg -nP --type=py '_fused_observers' -B 2 -A 2 # Check start_calibration and end_calibration flow rg -nP --type=py 'def (start_calibration|end_calibration)' -A 10 | head -60

Repository: vllm-project/llm-compressor

Length of output: 6345

🏁 Script executed:

#!/bin/bash # Check how many times calibration can be started in a typical workflow rg -nP --type=py 'start_calibration' -B 3 -A 3 | head -80

Repository: vllm-project/llm-compressor

Length of output: 7769

🏁 Script executed:

#!/bin/bash # Get complete FUSED_LAYER_NAMES definition cat -n src/llmcompressor/observers/helpers.py | sed -n '155,165p' # Check freeze_module_quantization implementation rg -nP --type=py 'def freeze_module_quantization' -A 15

Repository: vllm-project/llm-compressor

Length of output: 1955

Heuristic matching of FUSED_LAYER_NAMES has low false-positive risk; duplication concern only applies to abnormal re-calibration.

The pattern names in FUSED_LAYER_NAMES are specific to known transformer architectures (standard MLPs, attention, DeepSeek variants, MoE experts) rather than generic. False positives are unlikely, though a layer type check (e.g., isinstance(layer, torch.nn.Linear)) would provide additional safety.

The duplication concern is valid but edge-case: Observer.fuse appends to _fused_observers without de-duplication, and _fused_observers is never cleared. However, normal calibration flow calls freeze_module_quantization (which deletes observers) before a new calibration can attach fresh observers. Duplication only occurs if start_calibration is called multiple times on the same model without end_calibration in between, which is not the typical usage pattern.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/llmcompressor/observers/helpers.py` around lines 167 - 195, fuse_weight_observers can mistakenly match non-linear attributes; add a concrete layer-type guard (e.g., ensure each layer in layers is an instance of torch.nn.Linear or the expected Linear subclass) before reading layer.weight_observer in fuse_weight_observers, and then prevent duplicate fusion by checking Observer._fused_observers (or a provided API on Observer) to skip fusing the same observer objects twice—ensure you reference FUSED_LAYER_NAMES, fuse_weight_observers, weight_observer, and Observer.fuse when locating where to add the isinstance check and the de-duplication guard; optionally also consider clearing or resetting Observer._fused_observers at the start/end of calibration (freeze_module_quantization / start_calibration / end_calibration) to avoid accumulation across repeated calibrations.

coderabbitai · 2026-04-25T02:15:12Z

        with contextlib.ExitStack() as stack:
            stack.enter_context(calibration_forward_context(model))
-            # Optionally disable quantization
-            if not dataset_args.quantization_aware_calibration or disable_qac:
-                stack.enter_context(DisableQuantization(model))
-
+            stack.enter_context(DisableQuantization(model))


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Find all remaining references to quantization_aware_calibration and DISABLE_QAC_MODIFIERS rg -nP '\bquantization_aware_calibration\b' -C2 rg -nP '\bDISABLE_QAC_MODIFIERS\b' -C2

Repository: vllm-project/llm-compressor

Length of output: 2289

🏁 Script executed:

# Check actual usage of quantization_aware_calibration in oneshot.py - where is it passed/used? rg -nP 'quantization_aware_calibration' -A 5 -B 5 src/llmcompressor/entrypoints/oneshot.py

Repository: vllm-project/llm-compressor

Length of output: 1340

🏁 Script executed:

# Check the sequential pipeline context around line 103-105 to understand the previous state head -120 src/llmcompressor/pipelines/sequential/pipeline.py | tail -30

Repository: vllm-project/llm-compressor

Length of output: 1297

🏁 Script executed:

# Check if dataset_args is being created with quantization_aware_calibration and passed anywhere rg -nP 'DatasetArguments|dataset_args' src/llmcompressor/entrypoints/oneshot.py | head -20

Repository: vllm-project/llm-compressor

Length of output: 830

🏁 Script executed:

# Search for where DisableQuantization is used to understand its role rg -nP 'DisableQuantization' -B 3 -A 3

Repository: vllm-project/llm-compressor

Length of output: 7664

🏁 Script executed:

# Search for actual access/reading of the quantization_aware_calibration field rg -nP 'dataset_args\.quantization_aware_calibration' -C 3

Repository: vllm-project/llm-compressor

Length of output: 53

🏁 Script executed:

# Check the parse_args function to see how kwargs are converted to dataset_args rg -nP 'def parse_args' -A 30

Repository: vllm-project/llm-compressor

Length of output: 14451

🏁 Script executed:

# Search for any conditional logic checking quantization_aware_calibration rg -nP 'if.*quantization_aware_calibration|quantization_aware_calibration.*if' -C 3

Repository: vllm-project/llm-compressor

Length of output: 53

🏁 Script executed:

# Check git history for changes to the sequential pipeline file git log --oneline -20 src/llmcompressor/pipelines/sequential/pipeline.py

Repository: vllm-project/llm-compressor

Length of output: 144

🏁 Script executed:

# Check if there's recent changes showing what was removed from line 103-105 area git diff HEAD~5 src/llmcompressor/pipelines/sequential/pipeline.py | head -100

Repository: vllm-project/llm-compressor

Length of output: 258

🏁 Script executed:

# Look at the full context around the calibration code in sequential pipeline sed -n '90,120p' src/llmcompressor/pipelines/sequential/pipeline.py

Repository: vllm-project/llm-compressor

Length of output: 1334

🏁 Script executed:

# Check the DisableQuantization implementation to understand what it does sed -n '78,100p' src/llmcompressor/utils/helpers.py

Repository: vllm-project/llm-compressor

Length of output: 637

🏁 Script executed:

# Look at the specific commit that modified this git show 5076faa3 --stat | head -30

Repository: vllm-project/llm-compressor

Length of output: 1661

🏁 Script executed:

# Get more detail on what changed in that commit git show 5076faa3 | grep -A 30 -B 5 "sequential/pipeline.py"

Repository: vllm-project/llm-compressor

Length of output: 4211

🏁 Script executed:

# Check if quantization_aware_calibration is still used anywhere by searching more broadly rg -nP 'quantization.*calibration|calibration.*quantization' --no-heading | grep -v "quantization_aware_calibration: bool"

Repository: vllm-project/llm-compressor

Length of output: 9702

🏁 Script executed:

# Check the full docstring for quantization_aware_calibration to understand intended behavior sed -n '387,400p' src/llmcompressor/entrypoints/oneshot.py

Repository: vllm-project/llm-compressor

Length of output: 1051

🏁 Script executed:

# Look at the DatasetArguments metadata for this field sed -n '244,252p' src/llmcompressor/args/dataset_arguments.py

Repository: vllm-project/llm-compressor

Length of output: 480

quantization_aware_calibration parameter has no effect—quantization is always disabled during calibration.

The quantization_aware_calibration parameter is documented in oneshot() (lines 392–395) and defined in DatasetArguments (lines 244–251) with clear intended behavior: when True, quantization should be applied during calibration; when False, it should be disabled. However, this parameter is never actually read anywhere in the codebase. The sequential pipeline unconditionally calls DisableQuantization(model) regardless of the parameter value, effectively hardcoding the behavior to always disable quantization.

Users who set quantization_aware_calibration=True expecting quantization during calibration will see no effect. This is a silent API regression: the parameter appears functional but has zero impact.

Either:

Remove the parameter with a clear deprecation message, or

Implement the conditional logic to actually respect this parameter's value.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/llmcompressor/pipelines/sequential/pipeline.py` around lines 103 - 105, The sequential pipeline unconditionally disables quantization during calibration by always entering DisableQuantization(model) inside the calibration context, so the DatasetArguments/oneshot() parameter quantization_aware_calibration is ignored; update the pipeline.py logic to respect that parameter (or deprecate it). Specifically, locate the calibration block that uses calibration_forward_context(model) and DisableQuantization(model) and change it so that it checks the quantization_aware_calibration flag passed into the pipeline/oneshot call (or via DatasetArguments) and only enters DisableQuantization when quantization_aware_calibration is False; if you choose to remove the parameter instead, add a deprecation notice in oneshot() and DatasetArguments and delete usages/readers of quantization_aware_calibration to avoid confusion.

- Remove absmax as stored attribute, compute inline from min_vals/max_vals - Simplify observer API: remove module param, is_memoryless, already_observed - Weight observers get empty _sync_dict (no DDP sync needed) - IMatrixMSEObserver stores raw _imatrix_sum/_imatrix_count with DDP sync - Separate observer sync from qparam update (sync_obs_act_stats) - Move dynamic-quant skip logic into update_qparams, remove flags - Disable quantization in basic pipeline to align with sequential - Fuse weight observers for shared global_scale Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mergify · 2026-04-25T02:34:55Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

coderabbitai

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/llmcompressor/modifiers/gptq/gptq_quantize.py (1)

75-84: ⚠️ Potential issue | 🟡 Minor

Stale docstring: return type no longer matches.

The function now returns (loss, q_param_dict) (line 279), but the docstring still claims :return: loss, quantized_weight, scale, zero_point, g_idx and the type annotation is tuple[float, torch.Tensor, torch.Tensor, torch.Tensor | None, torch.Tensor].

🛠️ Proposed fix

 def quantize_weight(
     module: torch.nn.Module,
     quant_args: QuantizationArgs,
     hessian: torch.Tensor,
     blocksize: int = 128,
     percdamp: float = 0.01,
-) -> tuple[float, torch.Tensor, torch.Tensor, torch.Tensor | None, torch.Tensor]:
+) -> tuple[float, dict[str, torch.Tensor]]:
     """
     Quantize a module weight according to the GPTQ algorithm

     :param module: module with weight being quantized
     :param quant_args: quantization arguments used to find quantization parameters
     :param hessian_dict: dictionary containing preaccumulated hessian for quantization
     :param blocksize: chunk size of quantization updates
     :param percdamp: dampening factor on hessian diagonal
-    :return: loss, quantized_weight, scale, zero_point, g_idx
+    :return: tuple of (loss, q_param_dict) where q_param_dict contains
+        "weight", "weight_scale", "weight_zero_point", and optionally "weight_g_idx"
     """

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/modifiers/gptq/gptq_quantize.py` around lines 75 - 84, The
docstring and type annotation for gptq_quantize are stale: the function now
returns (loss, q_param_dict) rather than (loss, quantized_weight, scale,
zero_point, g_idx); update the docstring return description to state it returns
a float loss and a dict of quantization parameters (q_param_dict) and adjust the
function's type annotation from tuple[float, torch.Tensor, ...] to tuple[float,
dict] (or a more specific Mapping type if preferred), and mention the
contents/keys of q_param_dict (e.g., quantized_weight, scale, zero_point, g_idx)
so callers know what to expect.

♻️ Duplicate comments (3)

tests/llmcompressor/modifiers/utils/test_update_fused_layer_weights.py (1)

203-216: ⚠️ Potential issue | 🔴 Critical

weight_observer() on Lines 205 and 215 will raise TypeError — still unfixed from earlier review.

Observer.forward(observed: torch.Tensor) requires the observed tensor as a positional argument. Lines 205 and 215 call weight_observer() with no args, which will raise TypeError at runtime. Lines 97, 147, and 195 correctly pass layer.weight. This test will fail as written.

🐛 Proposed fix

     # Change q_proj weight to have much larger magnitude
     with torch.no_grad():
         module.q_proj.weight *= 10.0
-    module.q_proj.weight_observer()
+    module.q_proj.weight_observer(module.q_proj.weight)
     update_qparams(module.q_proj, base_name="weight")

     # global_scale should have changed (larger weight = smaller global_scale)
     new_gs = module.q_proj.weight_global_scale.data
     assert not torch.allclose(
         initial_gs, new_gs
     ), "global_scale should change after weight magnitude change"

     # Recompute k_proj — should see q_proj's new statistics via fusion
-    module.k_proj.weight_observer()
+    module.k_proj.weight_observer(module.k_proj.weight)
     update_qparams(module.k_proj, base_name="weight")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/llmcompressor/modifiers/utils/test_update_fused_layer_weights.py`
around lines 203 - 216, The test calls weight_observer() with no arguments which
raises a TypeError because Observer.forward expects the observed tensor; update
the two offending calls (the weight_observer invocations on module.q_proj and
module.k_proj) to pass the layer weight (module.q_proj.weight and
module.k_proj.weight respectively) so they match the other correct calls (e.g.,
the calls at lines where weight_observer(layer.weight) is used) before calling
update_qparams.

src/llmcompressor/modifiers/quantization/calibration.py (1)

87-107: ⚠️ Potential issue | 🟡 Minor

isinstance(module, Iterable) dispatches incorrectly for container nn.Modules.

nn.Sequential, nn.ModuleList, and nn.ModuleDict all implement __iter__ and will match isinstance(x, Iterable). If any caller passes one of these, observe will iterate over its children rather than treat the container as the target. The current callers in this PR pass either a Linear or a list/values view, so the bug is latent, but the dispatch is fragile. Recommend narrowing to (list, tuple) or collections.abc.Sequence, or prefer Module first.
♻️ Proposed fix
-    if isinstance(module, Iterable):
+    if not isinstance(module, Module) and isinstance(module, Iterable):
         for m in module:
             observe(m, base_name)
         return
Apply the same change to update_qparams on Lines 124-127.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/modifiers/quantization/calibration.py` around lines 87 -
107, The observe function incorrectly treats any Iterable (e.g., nn.Sequential,
nn.ModuleList/Dict) as a container to iterate, causing module containers to be
iterated instead of handled as Modules; change the dispatch to first check if
module is an instance of torch.nn.Module and handle it directly, and only treat
non-Module iterables as sequences (e.g., list/tuple or collections.abc.Sequence)
to recurse; apply the same fix to update_qparams so both functions use the same
Module-first logic and avoid mis-dispatching container Modules like
nn.Sequential, nn.ModuleList, or nn.ModuleDict.

src/llmcompressor/observers/base.py (1)

83-103: ⚠️ Potential issue | 🟡 Minor

CI Quality Checks fails on E501 (line too long) at lines 83 and 89.

Both assert lines exceed the 88-char limit. Separately, reiterating prior feedback: these assert statements become no-ops under python -O, so a caller forgetting to run the observer before get_qparams() will see a confusing AttributeError on self.min_vals instead of a clear error. Converting to explicit raise RuntimeError(...) fixes both concerns at once.

✏️ Proposed fix

-        assert self.has_statistics, "No statistics available. Call observer(value) first."
+        if not self.has_statistics:
+            raise RuntimeError(
+                "No statistics available. Call observer(value) first."
+            )
 
         global_scale = None
         if self.args.strategy == QuantizationStrategy.TENSOR_GROUP:
             global_absmax = torch.max(-self.min_vals.min(), self.max_vals.max())
             for obs in self._fused_observers:
-                assert obs.has_statistics, "All fused observers must be run before get_qparams."
+                if not obs.has_statistics:
+                    raise RuntimeError(
+                        "All fused observers must be run before get_qparams."
+                    )
                 global_absmax = torch.max(global_absmax, -obs.min_vals.min())
                 global_absmax = torch.max(global_absmax, obs.max_vals.max())

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/observers/base.py` around lines 83 - 103, Replace the two
assert statements in get_qparams with explicit RuntimeError raises to avoid E501
and the python -O no-op problem: check self.has_statistics and if false raise
RuntimeError("No statistics available. Call observer(value) before
get_qparams()") and similarly loop over self._fused_observers and for any obs
where not obs.has_statistics raise RuntimeError("All fused observers must be run
before get_qparams(). Missing statistics on fused observer"). Keep the rest of
get_qparams (global_scale generation, calculate_qparams call, and returned dict)
unchanged.

🧹 Nitpick comments (13)

tests/llmcompressor/observers/test_distributed.py (1)
29-30: Consider asserting the specific observer error type rather than AssertionError.

Matching on AssertionError with message "No statistics available" couples the test to a raw assert in the implementation. If the refactor tightens this into a dedicated exception (RuntimeError, ValueError, or a custom observer error), the test will break even though the behavior is still correct. If the contract is "calling get_qparams() before any observation is an error", consider relaxing to Exception/a dedicated type, or at minimum centralize the expected message to a constant so it's easy to update in lockstep with the implementation. Same applies on Line 111.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/llmcompressor/observers/test_distributed.py` around lines 29 - 30, The
test currently asserts a raw AssertionError when calling observer.get_qparams()
which couples the test to an implementation assert; change the pytest.raises to
expect a stable, specific error type (e.g., ObserverError or
ValueError/RuntimeError) instead of AssertionError for observer.get_qparams()
(and the similar check later in the file), and centralize the expected message
into a shared constant (e.g., NO_STATS_MSG) used by both the implementation and
the test so message changes stay synchronized.
src/llmcompressor/observers/helpers.py (2)
178-178: Late import to avoid circular dependency.

The from llmcompressor.observers import Observer import inside the function avoids a circular import with base.py. This is fine but worth a one-line comment explaining why, since the pattern is unusual.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/observers/helpers.py` at line 178, The inline import "from
llmcompressor.observers import Observer" is placed inside the function to avoid
a circular dependency with base.py; add a concise one-line comment immediately
above that import explaining this rationale (e.g., "Late import to avoid
circular dependency with base.py") so future readers understand why the unusual
import location is intentional and should not be moved; ensure the comment is
adjacent to the import statement referencing Observer.
155-195: FUSED_LAYER_NAMES is a static list — consider making it configurable.

Hardcoding fused layer names limits applicability to architectures with matching attribute names. For less common architectures (e.g., future MoE variants, custom transformer blocks), users would need to modify this module directly. Consider exposing this as a user-configurable parameter (e.g., via quantization config) so the mechanism extends naturally.

Additionally, the early-break on non-TENSOR_GROUP (Line 192) will silently skip fusion if any one layer in the group is not TENSOR_GROUP. This is likely correct but worth documenting, since a mixed-strategy group would silently fall through.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/observers/helpers.py` around lines 155 - 195,
FUSED_LAYER_NAMES is hardcoded and fuse_weight_observers silently skips groups
if any member isn't using QuantizationStrategy.TENSOR_GROUP; make
FUSED_LAYER_NAMES configurable by adding an injectable/optional parameter or
reading from a quantization configuration object (e.g., pass a fused_layer_names
list into fuse_weight_observers or read from a QuantizationConfig used by your
quant pipeline) and use that instead of the module-level constant, and
explicitly document and/or log the current behavior in fuse_weight_observers
when a group is skipped due to a non-TENSOR_GROUP strategy; ensure you still
check each layer's weight_observer and obs.args.strategy ==
QuantizationStrategy.TENSOR_GROUP before calling Observer.fuse(observers) so
behavior remains correct while making the names configurable and visible to
users.
src/llmcompressor/entrypoints/model_free/process.py (1)
205-213: Possible issue when fused members share a single parent module.

Line 208 collects weight_observer from each fused member; if the fused members were expected to sit at the same sibling level under a common parent (as fuse_weight_observers assumes in helpers.py), consider whether Observer.fuse here can be replaced by fuse_weight_observers(parent_module) for consistency across the codebase. The current approach works, but having two separate fusion paths (helpers-based for in-memory model flow, direct Observer.fuse call here for the model-free flow) increases the chance of these diverging over time.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/entrypoints/model_free/process.py` around lines 205 - 213,
The current loop over fused_modules calls Observer.fuse on a list of
weight_observer objects (Observer.fuse([mod.weight_observer for mod in
named_modules.values()])) which diverges from the helpers.py helper
fuse_weight_observers that expects siblings under a parent; replace the direct
Observer.fuse call with a call to fuse_weight_observers(parent_module) when the
fused members share a common parent (identify the parent of the entries in
named_modules and call helpers.fuse_weight_observers(parent)), keeping
observe(named_modules.values(), base_name="weight"), update_qparams(...), and
freeze_module_quantization(module) unchanged so the model-free flow uses the
same fusion path as the in-memory flow.
src/llmcompressor/modifiers/transform/awq/base.py (2)
705-706: Grid search re-observes per iteration — ensure observer is memoryless.

Inside the grid loop, observe(balance_layer, base_name="weight") is invoked for every (ratio, duo_scaling) combination on the same weight tensor (scaled differently each iteration). The patched observer is memoryless_minmax (Line 667-672), which correctly resets statistics each call, so qparams reflect the current scaled weight. Worth adding a brief comment explaining why the memoryless observer is required here (accumulating observers would leak stats across iterations and break grid search).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/modifiers/transform/awq/base.py` around lines 705 - 706,
The grid search loop calls observe(balance_layer, base_name="weight") repeatedly
for the same (rescaled) weight tensor, so ensure the observer used is memoryless
(the existing memoryless_minmax implementation) and add a concise in-line
comment near observe(balance_layer, base_name="weight") /
update_qparams(balance_layer, base_name="weight") explaining that
memoryless_minmax resets stats per call to avoid leaking accumulated statistics
across iterations; verify the observer assignment (memoryless_minmax at/around
where observers are patched) remains unchanged and document that accumulating
observers would break the grid search results.
268-274: Possible duplicate observation for layers shared across mappings.

smoothed_layers is built by iterating every resolved mapping and appending each balance layer plus smooth layer. If the same layer appears in multiple mappings (e.g. a shared norm across several balance groups), observe and update_qparams will be invoked on the same observer multiple times. For the weight observers used here this is effectively idempotent (weight observers are memoryless per PR refactor), but it's wasted work and could become incorrect if stateful observers are swapped in. Also consider using unpacking per Ruff's RUF005 hint on Line 270.
♻️ Proposed refactor
-        smoothed_layers = []
+        smoothed_layers = []
+        seen: set[int] = set()
         for mapping in self._resolved_mappings:
-            for layer in mapping.balance_layers + [mapping.smooth_layer]:
-                if hasattr(layer, "weight_observer"):
-                    smoothed_layers.append(layer)
+            for layer in [*mapping.balance_layers, mapping.smooth_layer]:
+                if hasattr(layer, "weight_observer") and id(layer) not in seen:
+                    smoothed_layers.append(layer)
+                    seen.add(id(layer))
         observe(smoothed_layers, base_name="weight")
         update_qparams(smoothed_layers, base_name="weight")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/modifiers/transform/awq/base.py` around lines 268 - 274,
The loop that builds smoothed_layers can append the same layer multiple times
when a layer is shared across mappings (from self._resolved_mappings), causing
duplicate calls to observe(...) and update_qparams(...); change the collection
to deduplicate by identity (e.g., track seen ids) while iterating
mapping.balance_layers and mapping.smooth_layer (use unpacking like
(*mapping.balance_layers, mapping.smooth_layer) for clarity) so you only call
observe(smoothed_layers, ...) and update_qparams(smoothed_layers, ...) with
unique layers that have hasattr(layer, "weight_observer"); reference
smoothed_layers, self._resolved_mappings, mapping.balance_layers,
mapping.smooth_layer, observe, update_qparams, and weight_observer when making
the change.
src/llmcompressor/observers/mse.py (1)
41-43: Missing blank line between class definitions.

PEP 8 expects two blank lines between top-level class definitions. There is currently one blank line between the end of MemorylessMSEObserver.update_statistics and the @Observer.register("mse") decorator.
✏️ Proposed fix
             self.grid,
             self.norm,
         )
+
 
 `@Observer.register`("mse")
 class MovingAverageMSEObserver(MovingAverageObserverBase):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/observers/mse.py` around lines 41 - 43, Add a blank line so
there are two blank lines separating the end of the MemorylessMSEObserver class
(specifically after MemorylessMSEObserver.update_statistics) and the next
top-level decorator/@Observer.register("mse"); ensure the file follows PEP8 by
having two blank lines between these top-level class definitions and decorators.
src/llmcompressor/observers/min_max.py (2)
42-58: Missing blank lines between class definitions (PEP 8).

Two blank lines should separate top-level class definitions. Currently StaticMinMaxObserver → MinMaxObserver (around line 42) and MinMaxObserver → _get_min_max (around line 58) only have one.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/observers/min_max.py` around lines 42 - 58, Add the missing
blank line separation to satisfy PEP8: ensure there are two blank lines before
the class MinMaxObserver and two blank lines after it (so there's a full blank
line gap between StaticMinMaxObserver and MinMaxObserver, and between
MinMaxObserver and the following function _get_min_max). Locate the top-level
class definitions StaticMinMaxObserver and MinMaxObserver and insert the extra
blank line(s) so each top-level class is separated by two blank lines from
adjacent top-level definitions.
23-23: StaticMinMaxObserver inheriting from MemorylessMinMaxObserver is semantically off.

"Static" here accumulates min/max across all observations (line 37-38 uses running torch.min/torch.max), whereas "Memoryless" overwrites on every call (line 19). The inheritance relationship is therefore misleading — Static is not a kind of Memoryless. Consider making both direct subclasses of Observer, factoring shared parts (there are effectively none at this point) or leaving them independent. Minor cleanup.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/observers/min_max.py` at line 23, The class
StaticMinMaxObserver should not inherit from MemorylessMinMaxObserver because
its behavior accumulates min/max across calls; change StaticMinMaxObserver to
directly subclass Observer instead, and ensure MemorylessMinMaxObserver also
directly subclasses Observer (or both explicitly subclass Observer) so the
hierarchy reflects behavior; update the class definitions for
StaticMinMaxObserver and MemorylessMinMaxObserver to inherit from Observer and
remove the misleading inheritance link, keeping existing methods (e.g., those
using torch.min/torch.max and the memoryless overwrite logic) unchanged.
src/llmcompressor/observers/base.py (2)
14-14: __all__ not sorted (Ruff RUF022).

Minor: apply isort-style ordering if you want a clean Ruff report.
✏️ Proposed fix
-__all__ = ["Observer", "MinMaxTuple", "QParamsDict"]
+__all__ = ["MinMaxTuple", "Observer", "QParamsDict"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/observers/base.py` at line 14, The module-level __all__
list is not alphabetically sorted; update it to an isort-style ordering so Ruff
RUF022 is satisfied by replacing "__all__ = [\"Observer\", \"MinMaxTuple\",
\"QParamsDict\"]" with the entries sorted alphabetically (e.g., "__all__ =
[\"MinMaxTuple\", \"Observer\", \"QParamsDict\"]"), ensuring you reference the
same exported symbols (Observer, MinMaxTuple, QParamsDict) and preserve exact
spelling/casing.
19-24: global_scale: Optional[torch.Tensor] in a total=False TypedDict is redundant.

With total=False, every key is already optional (may be absent). Wrapping the value type in Optional[...] additionally allows the key to be present with value None, which has different semantics. Given the returned dict at line 103 always sets global_scale (to None for non-TENSOR_GROUP), the current typing is consistent, but consumers have to disambiguate "missing key" vs. "None value". If all callers expect the key to always be present, drop total=False; otherwise drop Optional.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/observers/base.py` around lines 19 - 24, The TypedDict
QParamsDict currently uses total=False while also typing global_scale as
Optional[torch.Tensor], which conflates "missing key" vs "key present with
None"; since the factory at line 103 always sets the global_scale key (sometimes
to None), make the key required by removing total=False from the TypedDict
declaration (i.e., change class QParamsDict(TypedDict, total=False) to class
QParamsDict(TypedDict)) so global_scale stays Optional[torch.Tensor] and callers
no longer need to handle a missing key; update any type assumptions accordingly.
src/llmcompressor/observers/imatrix.py (1)
39-40: Stray extra blank line.

Minor formatting: two blank lines between _act_sync_dict and __init__ is inconsistent with the rest of the file.
✏️ Proposed fix
     _act_sync_dict = {
         "_imatrix_sum": dist.ReduceOp.SUM,
         "_imatrix_count": dist.ReduceOp.SUM,
     }
-
-
+
     def __init__(self, *args, **kwargs):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/observers/imatrix.py` around lines 39 - 40, Remove the
extra blank line between the class attribute _act_sync_dict and the constructor
__init__ in imatrix.py so spacing matches the file's style; locate the class
where _act_sync_dict is defined and ensure there's only a single blank line
before the def __init__(...) declaration.
src/llmcompressor/modifiers/quantization/quantization/mixin.py (1)
270-302: Consider consolidating the two match_named_modules passes.

update_activation_qparams and sync_obs_act_stats are frequently invoked back-to-back (see the call sequence in base.py:87-94), and each re-iterates all modules. For large models, a single pass that both launches the async all-reduces and then (after wait_for_comms) computes qparams would avoid the duplicate traversal. Defer if not a hot path in practice.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/modifiers/quantization/quantization/mixin.py` around lines
270 - 302, Both update_activation_qparams and sync_obs_act_stats iterate
match_named_modules separately causing duplicate traversals; refactor by
consolidating into a single pass that first collects observers and starts their
async syncs, calls wait_for_comms, then computes qparams for each module.
Implement a new method (e.g., sync_and_update_activation_qparams) or modify
sync_obs_act_stats to return the list of modules/observers it touched so
update_activation_qparams can operate on that list; use the existing symbols
match_named_modules, observer.sync_activation_stats(), wait_for_comms, and
update_qparams(module, base_name) to: 1) in one loop over match_named_modules
collect observers for base_names ("weight","input","output","q","k","v") and
extend pending_comms with their syncs, 2) call wait_for_comms(pending_comms),
then 3) in a second step over the same collected modules call update_qparams for
base_names ("input","output","q","k","v"); replace callers (e.g., base.py
sequence) to use the consolidated method.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/llmcompressor/entrypoints/model_free/process.py`:
- Around line 206-213: After freeze_module_quantization(module) deletes a
module's weight_observer, other fused observers still hold stale references in
their _fused_observers list (created by Observer.fuse), which can lead to
dangling references if qparams are ever recomputed; fix by ensuring observers
are mutually cleaned up: update freeze_module_quantization (or Observer.detach)
to also remove the removed observer from any peers' _fused_observers lists (or
clear the peer lists when fully frozen), so that after calling
freeze_module_quantization on modules in fused_modules the
Observer._fused_observers no longer contains references to deleted
weight_observer instances and future calls to update_qparams/observe cannot see
stale observers.

In `@src/llmcompressor/modifiers/quantization/quantization/mixin.py`:
- Line 443: Fix the typo in the comment inside quantization/mixin.py by
replacing "amd" with "and" and update the referenced function name from
`observer` to `observe` to match the existing API; ensure the comment now reads
something like "weight observers (used by child modifier or `observe` and
`update_qparams`)" so it correctly references the `observe` and `update_qparams`
methods.
- Around line 281-302: Remove the trailing whitespace in the docstring inside
sync_obs_act_stats (the docstring line that reads "Iterates all observers "
contains the stray space); edit the docstring for the sync_obs_act_stats method
to delete that trailing space (and optionally reflow the sentence if needed) so
Ruff W291 no longer triggers while keeping the existing explanation about
weight/input/output/q/k/v observers intact.

In `@src/llmcompressor/observers/base.py`:
- Around line 49-51: The current blanket reset in the weight-observer branch
replaces any subclass-declared _act_sync_dict and silences subclass DDP sync
policies; instead, if you must default for weight observers, only set
self._act_sync_dict = {} when the subclass hasn't declared its own policy (e.g.,
check if '_act_sync_dict' not in self.__class__.__dict__). Update the block
around base_name == "weight" in observers/base.py to avoid overwriting
class-defined _act_sync_dict (or remove the reset entirely) so subclasses like
IMatrixMSEObserver retain their declared accumulators; note weight-only min/max
observers can still explicitly set _act_sync_dict = {} in their class body.
- Around line 129-140: Observer.fuse currently appends peers into each
observer's _fused_observers every call, causing duplicate entries on repeated
fuse_weight_observers(model) or start_calibration runs; make fuse idempotent by
ensuring no duplicates are added: either clear/reset each
observer._fused_observers before populating or check "if other not in
obs._fused_observers" before appending, so Observer.fuse(observers:
Iterable["Observer"]) leaves each _fused_observers list with unique peers and
prevents repeated contributions to compute_qparams_from_statistics.

In `@src/llmcompressor/observers/hist.py`:
- Around line 1-48: HistogramObserver is currently broken and non-conformant:
change forward(self, observed: torch.Tensor) to return self (not ScaleZpTuple or
(None,None)) and ensure ScaleZpTuple is not referenced; fix _add_batch usage of
flatten_for_calibration by calling it with the correct signature (remove the
extra g_idx arg) and replace the erroneous xmin(xmax <= -xmin) expression with
torch.maximum(xmax, -xmin) (or equivalent) so xmin/xmax logic is valid; remove
or use the computed new_max (avoid unused variable) and implement the required
abstract method update_statistics(self, observed: torch.Tensor) to satisfy
Observer, plus restore/implement the histogram accumulation logic (or
gate/remove the file from the PR if you intend to land it later) so the observer
actually updates state instead of being a no-op.

In `@src/llmcompressor/observers/imatrix.py`:
- Around line 115-118: Replace the EN DASH characters in the imatrix.py
docstring lines that read "Case 1 – accumulators..." and "Case 2 – no
accumulators..." with regular hyphen-minus "-" characters to satisfy RUF002;
update the docstring text near the observer's attach() description (the block
mentioning "Case 1" and "Case 2") so both dashes are simple hyphens.
- Around line 35-38: IMatrixMSEObserver's class-level _act_sync_dict is dead
because Observer.__init__ overwrites self._act_sync_dict when base_name ==
"weight"; remove the misleading class attribute or implement explicit cross-rank
reduction. Either delete the _act_sync_dict declaration (remove _imatrix_sum and
_imatrix_count from the class-level map) to avoid confusion, or override
sync_activation_stats() in IMatrixMSEObserver to perform a dist.all_reduce
(using dist.ReduceOp.SUM) on the buffers named _imatrix_sum and _imatrix_count
so they are aggregated across ranks when using DDP; reference
IMatrixMSEObserver, _act_sync_dict, sync_activation_stats, _imatrix_sum,
_imatrix_count, and Observer.__init__ to locate the changes.

In `@src/llmcompressor/pipelines/sequential/pipeline.py`:
- Line 19: The import statement is flagged as unsorted by Ruff I001; reorder the
two imported names so DisableQuantization appears before
calibration_forward_context in the from-import line (i.e., import
DisableQuantization, calibration_forward_context) to satisfy alphabetical import
ordering and reformat the import block accordingly.

In `@tests/llmcompressor/observers/test_distributed.py`:
- Around line 58-69: The test test_static_synchronize_with_global_state
duplicates test_static_synchronize_issues_all_reduce and should either be
removed or updated to exercise the global-state path: modify the test to create
an observer configured with the TENSOR_GROUP / tensor_group strategy (instead of
StaticMinMaxObserver), seed its global statistic (e.g., global_scale or the
observer's tensor-group-specific state), call observer.sync_activation_stats(),
and assert that mock_dist.all_reduce was called for both per-tensor stats and
the global statistic (i.e., call_count increased by one more); ensure you
reference and exercise the sync_activation_stats method and the observer
configured for tensor_group/global_scale in the updated test.

---

Outside diff comments:
In `@src/llmcompressor/modifiers/gptq/gptq_quantize.py`:
- Around line 75-84: The docstring and type annotation for gptq_quantize are
stale: the function now returns (loss, q_param_dict) rather than (loss,
quantized_weight, scale, zero_point, g_idx); update the docstring return
description to state it returns a float loss and a dict of quantization
parameters (q_param_dict) and adjust the function's type annotation from
tuple[float, torch.Tensor, ...] to tuple[float, dict] (or a more specific
Mapping type if preferred), and mention the contents/keys of q_param_dict (e.g.,
quantized_weight, scale, zero_point, g_idx) so callers know what to expect.

---

Duplicate comments:
In `@src/llmcompressor/modifiers/quantization/calibration.py`:
- Around line 87-107: The observe function incorrectly treats any Iterable
(e.g., nn.Sequential, nn.ModuleList/Dict) as a container to iterate, causing
module containers to be iterated instead of handled as Modules; change the
dispatch to first check if module is an instance of torch.nn.Module and handle
it directly, and only treat non-Module iterables as sequences (e.g., list/tuple
or collections.abc.Sequence) to recurse; apply the same fix to update_qparams so
both functions use the same Module-first logic and avoid mis-dispatching
container Modules like nn.Sequential, nn.ModuleList, or nn.ModuleDict.

In `@src/llmcompressor/observers/base.py`:
- Around line 83-103: Replace the two assert statements in get_qparams with
explicit RuntimeError raises to avoid E501 and the python -O no-op problem:
check self.has_statistics and if false raise RuntimeError("No statistics
available. Call observer(value) before get_qparams()") and similarly loop over
self._fused_observers and for any obs where not obs.has_statistics raise
RuntimeError("All fused observers must be run before get_qparams(). Missing
statistics on fused observer"). Keep the rest of get_qparams (global_scale
generation, calculate_qparams call, and returned dict) unchanged.

In `@tests/llmcompressor/modifiers/utils/test_update_fused_layer_weights.py`:
- Around line 203-216: The test calls weight_observer() with no arguments which
raises a TypeError because Observer.forward expects the observed tensor; update
the two offending calls (the weight_observer invocations on module.q_proj and
module.k_proj) to pass the layer weight (module.q_proj.weight and
module.k_proj.weight respectively) so they match the other correct calls (e.g.,
the calls at lines where weight_observer(layer.weight) is used) before calling
update_qparams.

---

Nitpick comments:
In `@src/llmcompressor/entrypoints/model_free/process.py`:
- Around line 205-213: The current loop over fused_modules calls Observer.fuse
on a list of weight_observer objects (Observer.fuse([mod.weight_observer for mod
in named_modules.values()])) which diverges from the helpers.py helper
fuse_weight_observers that expects siblings under a parent; replace the direct
Observer.fuse call with a call to fuse_weight_observers(parent_module) when the
fused members share a common parent (identify the parent of the entries in
named_modules and call helpers.fuse_weight_observers(parent)), keeping
observe(named_modules.values(), base_name="weight"), update_qparams(...), and
freeze_module_quantization(module) unchanged so the model-free flow uses the
same fusion path as the in-memory flow.

In `@src/llmcompressor/modifiers/quantization/quantization/mixin.py`:
- Around line 270-302: Both update_activation_qparams and sync_obs_act_stats
iterate match_named_modules separately causing duplicate traversals; refactor by
consolidating into a single pass that first collects observers and starts their
async syncs, calls wait_for_comms, then computes qparams for each module.
Implement a new method (e.g., sync_and_update_activation_qparams) or modify
sync_obs_act_stats to return the list of modules/observers it touched so
update_activation_qparams can operate on that list; use the existing symbols
match_named_modules, observer.sync_activation_stats(), wait_for_comms, and
update_qparams(module, base_name) to: 1) in one loop over match_named_modules
collect observers for base_names ("weight","input","output","q","k","v") and
extend pending_comms with their syncs, 2) call wait_for_comms(pending_comms),
then 3) in a second step over the same collected modules call update_qparams for
base_names ("input","output","q","k","v"); replace callers (e.g., base.py
sequence) to use the consolidated method.

In `@src/llmcompressor/modifiers/transform/awq/base.py`:
- Around line 705-706: The grid search loop calls observe(balance_layer,
base_name="weight") repeatedly for the same (rescaled) weight tensor, so ensure
the observer used is memoryless (the existing memoryless_minmax implementation)
and add a concise in-line comment near observe(balance_layer,
base_name="weight") / update_qparams(balance_layer, base_name="weight")
explaining that memoryless_minmax resets stats per call to avoid leaking
accumulated statistics across iterations; verify the observer assignment
(memoryless_minmax at/around where observers are patched) remains unchanged and
document that accumulating observers would break the grid search results.
- Around line 268-274: The loop that builds smoothed_layers can append the same
layer multiple times when a layer is shared across mappings (from
self._resolved_mappings), causing duplicate calls to observe(...) and
update_qparams(...); change the collection to deduplicate by identity (e.g.,
track seen ids) while iterating mapping.balance_layers and mapping.smooth_layer
(use unpacking like (*mapping.balance_layers, mapping.smooth_layer) for clarity)
so you only call observe(smoothed_layers, ...) and
update_qparams(smoothed_layers, ...) with unique layers that have hasattr(layer,
"weight_observer"); reference smoothed_layers, self._resolved_mappings,
mapping.balance_layers, mapping.smooth_layer, observe, update_qparams, and
weight_observer when making the change.

In `@src/llmcompressor/observers/base.py`:
- Line 14: The module-level __all__ list is not alphabetically sorted; update it
to an isort-style ordering so Ruff RUF022 is satisfied by replacing "__all__ =
[\"Observer\", \"MinMaxTuple\", \"QParamsDict\"]" with the entries sorted
alphabetically (e.g., "__all__ = [\"MinMaxTuple\", \"Observer\",
\"QParamsDict\"]"), ensuring you reference the same exported symbols (Observer,
MinMaxTuple, QParamsDict) and preserve exact spelling/casing.
- Around line 19-24: The TypedDict QParamsDict currently uses total=False while
also typing global_scale as Optional[torch.Tensor], which conflates "missing
key" vs "key present with None"; since the factory at line 103 always sets the
global_scale key (sometimes to None), make the key required by removing
total=False from the TypedDict declaration (i.e., change class
QParamsDict(TypedDict, total=False) to class QParamsDict(TypedDict)) so
global_scale stays Optional[torch.Tensor] and callers no longer need to handle a
missing key; update any type assumptions accordingly.

In `@src/llmcompressor/observers/helpers.py`:
- Line 178: The inline import "from llmcompressor.observers import Observer" is
placed inside the function to avoid a circular dependency with base.py; add a
concise one-line comment immediately above that import explaining this rationale
(e.g., "Late import to avoid circular dependency with base.py") so future
readers understand why the unusual import location is intentional and should not
be moved; ensure the comment is adjacent to the import statement referencing
Observer.
- Around line 155-195: FUSED_LAYER_NAMES is hardcoded and fuse_weight_observers
silently skips groups if any member isn't using
QuantizationStrategy.TENSOR_GROUP; make FUSED_LAYER_NAMES configurable by adding
an injectable/optional parameter or reading from a quantization configuration
object (e.g., pass a fused_layer_names list into fuse_weight_observers or read
from a QuantizationConfig used by your quant pipeline) and use that instead of
the module-level constant, and explicitly document and/or log the current
behavior in fuse_weight_observers when a group is skipped due to a
non-TENSOR_GROUP strategy; ensure you still check each layer's weight_observer
and obs.args.strategy == QuantizationStrategy.TENSOR_GROUP before calling
Observer.fuse(observers) so behavior remains correct while making the names
configurable and visible to users.

In `@src/llmcompressor/observers/imatrix.py`:
- Around line 39-40: Remove the extra blank line between the class attribute
_act_sync_dict and the constructor __init__ in imatrix.py so spacing matches the
file's style; locate the class where _act_sync_dict is defined and ensure
there's only a single blank line before the def __init__(...) declaration.

In `@src/llmcompressor/observers/min_max.py`:
- Around line 42-58: Add the missing blank line separation to satisfy PEP8:
ensure there are two blank lines before the class MinMaxObserver and two blank
lines after it (so there's a full blank line gap between StaticMinMaxObserver
and MinMaxObserver, and between MinMaxObserver and the following function
_get_min_max). Locate the top-level class definitions StaticMinMaxObserver and
MinMaxObserver and insert the extra blank line(s) so each top-level class is
separated by two blank lines from adjacent top-level definitions.
- Line 23: The class StaticMinMaxObserver should not inherit from
MemorylessMinMaxObserver because its behavior accumulates min/max across calls;
change StaticMinMaxObserver to directly subclass Observer instead, and ensure
MemorylessMinMaxObserver also directly subclasses Observer (or both explicitly
subclass Observer) so the hierarchy reflects behavior; update the class
definitions for StaticMinMaxObserver and MemorylessMinMaxObserver to inherit
from Observer and remove the misleading inheritance link, keeping existing
methods (e.g., those using torch.min/torch.max and the memoryless overwrite
logic) unchanged.

In `@src/llmcompressor/observers/mse.py`:
- Around line 41-43: Add a blank line so there are two blank lines separating
the end of the MemorylessMSEObserver class (specifically after
MemorylessMSEObserver.update_statistics) and the next top-level
decorator/@Observer.register("mse"); ensure the file follows PEP8 by having two
blank lines between these top-level class definitions and decorators.

In `@tests/llmcompressor/observers/test_distributed.py`:
- Around line 29-30: The test currently asserts a raw AssertionError when
calling observer.get_qparams() which couples the test to an implementation
assert; change the pytest.raises to expect a stable, specific error type (e.g.,
ObserverError or ValueError/RuntimeError) instead of AssertionError for
observer.get_qparams() (and the similar check later in the file), and centralize
the expected message into a shared constant (e.g., NO_STATS_MSG) used by both
the implementation and the test so message changes stay synchronized.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 97abee88-c108-474d-9387-bf0e1609a2f7

📥 Commits

Reviewing files that changed from the base of the PR and between 5076faa and 107f63f.

📒 Files selected for processing (31)

src/llmcompressor/entrypoints/model_free/lifecycle.py
src/llmcompressor/entrypoints/model_free/process.py
src/llmcompressor/modifiers/gptq/base.py
src/llmcompressor/modifiers/gptq/gptq_quantize.py
src/llmcompressor/modifiers/quantization/calibration.py
src/llmcompressor/modifiers/quantization/quantization/base.py
src/llmcompressor/modifiers/quantization/quantization/mixin.py
src/llmcompressor/modifiers/transform/awq/base.py
src/llmcompressor/modifiers/transform/imatrix/base.py
src/llmcompressor/modifiers/utils/__init__.py
src/llmcompressor/modifiers/utils/helpers.py
src/llmcompressor/observers/base.py
src/llmcompressor/observers/helpers.py
src/llmcompressor/observers/hist.py
src/llmcompressor/observers/imatrix.py
src/llmcompressor/observers/min_max.py
src/llmcompressor/observers/moving_base.py
src/llmcompressor/observers/mse.py
src/llmcompressor/pipelines/basic/pipeline.py
src/llmcompressor/pipelines/sequential/pipeline.py
src/llmcompressor/utils/helpers.py
tests/llmcompressor/modifiers/calibration/test_lifecycle.py
tests/llmcompressor/modifiers/calibration/test_observers.py
tests/llmcompressor/modifiers/quantization/test_quantization_ddp.py
tests/llmcompressor/modifiers/utils/test_update_fused_layer_weights.py
tests/llmcompressor/observers/test_distributed.py
tests/llmcompressor/observers/test_helpers.py
tests/llmcompressor/observers/test_imatrix.py
tests/llmcompressor/observers/test_min_max.py
tests/llmcompressor/observers/test_mse.py
tests/llmcompressor/observers/test_mse_vs_minmax.py

💤 Files with no reviewable changes (3)

src/llmcompressor/modifiers/utils/init.py
src/llmcompressor/utils/helpers.py
src/llmcompressor/modifiers/utils/helpers.py

🚧 Files skipped from review as they are similar to previous changes (5)

src/llmcompressor/pipelines/basic/pipeline.py
src/llmcompressor/modifiers/gptq/base.py
tests/llmcompressor/observers/test_mse.py
src/llmcompressor/modifiers/transform/imatrix/base.py
src/llmcompressor/modifiers/quantization/quantization/base.py

coderabbitai · 2026-04-25T02:41:44Z

    for named_modules in fused_modules.values():
-        # 2.1. compute fused global scale across all members of the fused set
-        global_scales = [m.weight_global_scale for m in named_modules.values()]
-        fused_global_scale = torch.min(torch.cat(global_scales, dim=0))
+        # 2. fuse observers, observe weights, and get qparams
+        Observer.fuse([mod.weight_observer for mod in named_modules.values()])
+        observe(named_modules.values(), base_name="weight")
+        update_qparams(named_modules.values(), base_name="weight")

        for name, module in named_modules.items():
-            module_name, _ = name.rsplit(".", 1)
-            module.weight_global_scale.data.copy_(fused_global_scale)
-
-            # 2.2. finish calibration with fused global scale
-            calibrate_scale_zp(module)
+            freeze_module_quantization(module)


⚠️ Potential issue | 🟡 Minor

Fused observer cleanup — dangling references after freeze.

After freeze_module_quantization(module) at Line 213 deletes each module's weight_observer, the remaining fused observers in other modules still hold references to the deleted observer via their _fused_observers list (populated by Observer.fuse). In this code path, update_qparams on the fused set has already completed before any freeze, so no incorrect results will occur. However, if the logic is ever refactored to recompute qparams after partial freezing, the stale references could surface as subtle bugs (e.g., referencing a detached observer's statistics).

Consider either clearing _fused_observers in Observer.detach/freeze_module_quantization, or documenting that freezing must only happen after all fused peers have completed qparam computation.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/llmcompressor/entrypoints/model_free/process.py` around lines 206 - 213, After freeze_module_quantization(module) deletes a module's weight_observer, other fused observers still hold stale references in their _fused_observers list (created by Observer.fuse), which can lead to dangling references if qparams are ever recomputed; fix by ensuring observers are mutually cleaned up: update freeze_module_quantization (or Observer.detach) to also remove the removed observer from any peers' _fused_observers lists (or clear the peer lists when fully frozen), so that after calling freeze_module_quantization on modules in fused_modules the Observer._fused_observers no longer contains references to deleted weight_observer instances and future calls to update_qparams/observe cannot see stale observers.

coderabbitai · 2026-04-25T02:41:44Z

+    def sync_obs_act_stats(self, model: torch.nn.Module):
+        """
+        Synchronize the activation statistics for observers
+        across DDP ranks. Iterates all observers 
+        (weight, input, output, q, k, v);
+        note: No-op when not distributed and
+        most weight observers don't have activation
+        statistics and thus are no-ops as well.

        :param model: model containing quantized modules
        """
        if not is_distributed():
            return

        pending_comms = []
-        modules_to_update = []
-
        for _, module in match_named_modules(model, self.resolved_targets, self.ignore):
-            for base_name in ("input", "output", "q", "k", "v"):
+            for base_name in ("weight", "input", "output", "q", "k", "v"):
                observer = getattr(module, f"{base_name}_observer", None)
                if observer is None:
                    continue
-                pending_comms.extend(observer.synchronize())
-                modules_to_update.append((module, base_name, observer))
-
+                pending_comms.extend(observer.sync_activation_stats())
        wait_for_comms(pending_comms)


⚠️ Potential issue | 🟡 Minor

Fix Ruff W291 trailing-whitespace failure on line 284.

The CI Quality Checks pipeline fails with W291 Trailing whitespace at line 284. Also, the docstring wording mentions "most weight observers don't have activation statistics" — arguably the base-name iteration here intentionally includes weight for completeness, but the is_distributed() short-circuit means this is fine.

✏️ Proposed fix

def sync_obs_act_stats(self, model: torch.nn.Module): """ Synchronize the activation statistics for observers - across DDP ranks. Iterates all observers + across DDP ranks. Iterates all observers (weight, input, output, q, k, v); note: No-op when not distributed and most weight observers don't have activation statistics and thus are no-ops as well.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def sync_obs_act_stats(self, model: torch.nn.Module):

"""

Synchronize the activation statistics for observers

across DDP ranks. Iterates all observers

(weight, input, output, q, k, v);

note: No-op when not distributed and

most weight observers don't have activation

statistics and thus are no-ops as well.

:param model: model containing quantized modules

"""

if not is_distributed():

return

pending_comms = []

modules_to_update = []

for _, module in match_named_modules(model, self.resolved_targets, self.ignore):

for base_name in ("input", "output", "q", "k", "v"):

for base_name in ("weight", "input", "output", "q", "k", "v"):

observer = getattr(module, f"{base_name}_observer", None)

if observer is None:

continue

pending_comms.extend(observer.synchronize())

modules_to_update.append((module, base_name, observer))

pending_comms.extend(observer.sync_activation_stats())

wait_for_comms(pending_comms)

def sync_obs_act_stats(self, model: torch.nn.Module):

"""

Synchronize the activation statistics for observers

across DDP ranks. Iterates all observers

(weight, input, output, q, k, v);

note: No-op when not distributed and

most weight observers don't have activation

statistics and thus are no-ops as well.

:param model: model containing quantized modules

"""

if not is_distributed():

return

pending_comms = []

for _, module in match_named_modules(model, self.resolved_targets, self.ignore):

for base_name in ("weight", "input", "output", "q", "k", "v"):

observer = getattr(module, f"{base_name}_observer", None)

if observer is None:

continue

pending_comms.extend(observer.sync_activation_stats())

wait_for_comms(pending_comms)

🧰 Tools

🪛 GitHub Actions: Quality Checks

[warning] 284-284: Ruff W291 Trailing whitespace

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/llmcompressor/modifiers/quantization/quantization/mixin.py` around lines 281 - 302, Remove the trailing whitespace in the docstring inside sync_obs_act_stats (the docstring line that reads "Iterates all observers " contains the stray space); edit the docstring for the sync_obs_act_stats method to delete that trailing space (and optionally reflow the sentence if needed) so Ruff W291 no longer triggers while keeping the existing explanation about weight/input/output/q/k/v observers intact.

coderabbitai · 2026-04-25T02:41:44Z

                    initialize_observer(module, base_name="v")

-        # weight observers (used by `update_weight_zp_scale` or child modifier)
+        # weight observers (used by child modifier or `observer` amd `update_qparams`)


⚠️ Potential issue | 🟡 Minor

Typo in comment: "amd" → "and".

✏️ Proposed fix

- # weight observers (used by child modifier or `observer` amd `update_qparams`) + # weight observers (used by child modifier or `observe` and `update_qparams`)

Also note the function name referenced should likely be observe, not observer, per the similar docstring on lines 75-76.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# weight observers (used by child modifier or `observer` amd `update_qparams`)

# weight observers (used by child modifier or `observe` and `update_qparams`)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/llmcompressor/modifiers/quantization/quantization/mixin.py` at line 443, Fix the typo in the comment inside quantization/mixin.py by replacing "amd" with "and" and update the referenced function name from `observer` to `observe` to match the existing API; ensure the comment now reads something like "weight observers (used by child modifier or `observe` and `update_qparams`)" so it correctly references the `observe` and `update_qparams` methods.

coderabbitai · 2026-04-25T02:41:44Z

+        # weight observers don't need to sync
+        if base_name == "weight":
+            self._act_sync_dict = {}


⚠️ Potential issue | 🟠 Major

Resetting _act_sync_dict by base_name == "weight" silently clobbers subclass declarations.

Any subclass that declares a class-level _act_sync_dict (e.g., IMatrixMSEObserver declares _imatrix_sum/_imatrix_count with SUM reductions in imatrix.py:35-38) will have that dict replaced with {} on the instance for weight observers — even when the subclass specifically needs those buffers to be DDP-synced. IMatrixMSEObserver is a weight-only observer, so its imatrix accumulators will never be all-reduced under DDP, yielding rank-local importance and inconsistent weighting.

Consider removing this blanket reset and instead letting each observer declare its own policy (weight-only min/max observers already override _act_sync_dict = {} explicitly).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/llmcompressor/observers/base.py` around lines 49 - 51, The current blanket reset in the weight-observer branch replaces any subclass-declared _act_sync_dict and silences subclass DDP sync policies; instead, if you must default for weight observers, only set self._act_sync_dict = {} when the subclass hasn't declared its own policy (e.g., check if '_act_sync_dict' not in self.__class__.__dict__). Update the block around base_name == "weight" in observers/base.py to avoid overwriting class-defined _act_sync_dict (or remove the reset entirely) so subclasses like IMatrixMSEObserver retain their declared accumulators; note weight-only min/max observers can still explicitly set _act_sync_dict = {} in their class body.

coderabbitai · 2026-04-25T02:41:44Z

+    @staticmethod
+    def fuse(observers: Iterable["Observer"]) -> None:
+        """
+        Link all observers in the list with each other for shared global_scale.

-        with align_module_device(module):
-            return getattr(module, f"{self.base_name}_{name}", None)
+        :param observers: list of observers to fuse together
+        """
+        observers = list(observers)
+        for obs in observers:
+            for other in observers:
+                if other is not obs:
+                    obs._fused_observers.append(other)


⚠️ Potential issue | 🟡 Minor

Observer.fuse is not idempotent; repeated calls duplicate peer references.

Each call to fuse appends to every observer's _fused_observers without checking existing membership. If fuse_weight_observers(model) is ever invoked more than once (e.g., re-entering start_calibration or a second run on the same model), peers will be duplicated, which in turn means compute_qparams_from_statistics at line 88-91 will redundantly fold the same fused observer's absmax multiple times — incorrect in principle, harmless for torch.max specifically, but a footgun.

✏️ Proposed fix

`@staticmethod` def fuse(observers: Iterable["Observer"]) -> None: """ Link all observers in the list with each other for shared global_scale. :param observers: list of observers to fuse together """ observers = list(observers) for obs in observers: for other in observers: - if other is not obs: + if other is not obs and other not in obs._fused_observers: obs._fused_observers.append(other)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/llmcompressor/observers/base.py` around lines 129 - 140, Observer.fuse currently appends peers into each observer's _fused_observers every call, causing duplicate entries on repeated fuse_weight_observers(model) or start_calibration runs; make fuse idempotent by ensuring no duplicates are added: either clear/reset each observer._fused_observers before populating or check "if other not in obs._fused_observers" before appending, so Observer.fuse(observers: Iterable["Observer"]) leaves each _fused_observers list with unique peers and prevents repeated contributions to compute_qparams_from_statistics.

coderabbitai · 2026-04-25T02:41:44Z

+import torch
+
+from llmcompressor.observers.base import Observer
+from llmcompressor.observers.helpers import flatten_for_calibration
+
+__all__ = ["HistogramObserver"]
+
+
+@Observer.register("histogram")
+class HistogramObserver(Observer):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        observer_kwargs = self.args.observer_kwargs
+        self.bins = observer_kwargs.get("bins", 2048)
+        self.histogram = None
+        self.xmax = -torch.inf
+
+    @torch.no_grad
+    def forward(self, observed: torch.Tensor) -> ScaleZpTuple:
+        """
+        Calculate updated scales and zero points from observed value
+        (weight, activation, or attention state).
+
+        :param observed: value being observed
+        :return: calibrated scale and zero point
+        """
+        self._add_batch(observed)
+        return (None, None)
+
+    def _add_batch(self, observed: torch.Tensor):
+        g_idx = self._get_module_param(f"{self.base_name}_g_idx")
+        observed = flatten_for_calibration(observed, self.base_name, self.args, g_idx)
+        xmin, xmax = torch.aminmax(observed, dim=[0, 1])  #
+        xmax = xmax * (xmax > -xmin) - xmin(xmax <= -xmin)
+
+        new_max = torch.max(self.xmax, xmax)
+
+        if self.histogram is None:
+            return
+            # init histogram
+        # hist = torch.histdd(observed,
+
+        # need the shape to have groups as final dimension for histogramdd
+
+        # probably just need to store the max of all the bins
+
+        # should probably just inherit from minmax observer and steal
+        # that functionality and tack this functionality on at the end


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify flatten_for_calibration signature has no g_idx ast-grep --pattern $'def flatten_for_calibration($$$) -> $_: $$$ ' # Verify Observer.forward contract rg -nP -C3 '@torch\.no_grad|def forward\b' src/llmcompressor/observers/base.py # Check if HistogramObserver is referenced anywhere else (to assess blast radius) rg -nP 'HistogramObserver|"histogram"' --type=py

Repository: vllm-project/llm-compressor

Length of output: 3651

🏁 Script executed:

# Check if ScaleZpTuple is defined or imported anywhere rg -nP 'ScaleZpTuple' --type=py # Check Observer base class for required abstract methods rg -nP 'def _update_statistics|def compute_gparams_from_statistics|def compute_qparams_from_statistics|@abstractmethod' src/llmcompressor/observers/base.py -A 2

Repository: vllm-project/llm-compressor

Length of output: 434

🏁 Script executed:

# Get full view of Observer abstract methods rg -nP '^class Observer|@abstractmethod' src/llmcompressor/observers/base.py -A 2 | head -60 # Check what methods HistogramObserver implements rg -nP 'def \w+' src/llmcompressor/observers/hist.py # Verify the line 34 and line 36 issues directly sed -n '32,40p' src/llmcompressor/observers/hist.py

Repository: vllm-project/llm-compressor

Length of output: 816

🏁 Script executed:

# Get the full Observer class sed -n '27,120p' src/llmcompressor/observers/base.py

Repository: vllm-project/llm-compressor

Length of output: 3599

HistogramObserver is incomplete/broken — should not ship in this PR.

Multiple critical issues make this observer non-functional and inconsistent with the new observer contract:

Line 19: ScaleZpTuple is undefined (not imported or defined anywhere in the codebase). Also, forward must return the observer instance (self) per the Observer.forward contract in base.py, not a tuple or (None, None).

Line 28: Returning (None, None) violates the refactored contract where forward must return self for method chaining (observer(x).get_qparams()), breaking any caller.

Line 32: flatten_for_calibration(observed, self.base_name, self.args, g_idx) passes 4 positional args, but this function only accepts 3 parameters (no g_idx parameter exists in this PR) — will raise TypeError at runtime.

Line 34: xmin(xmax <= -xmin) treats the tensor xmin as callable and will raise TypeError. Likely intended: torch.maximum(xmax, -xmin).

Line 36: new_max is assigned but never used (Ruff F841 / pipeline failure).

Missing required abstract method: HistogramObserver does not implement update_statistics, which is the only required abstract method per the Observer base class. This will raise TypeError on instantiation.

The histogram accumulation/computation logic is entirely commented out (Lines 38-48), so even if the above bugs were fixed, the observer would do nothing useful.

Recommend removing hist.py from this PR and landing it separately when the implementation is complete, or clearly marking it as an in-progress stub behind a feature flag.

🧰 Tools

🪛 GitHub Actions: Quality Checks

[error] 19-19: Ruff F821 Undefined name ScaleZpTuple

[error] 36-36: Ruff F841 Local variable new_max is assigned to but never used

🪛 Ruff (0.15.11)

[error] 19-19: Undefined name ScaleZpTuple

(F821)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/llmcompressor/observers/hist.py` around lines 1 - 48, HistogramObserver is currently broken and non-conformant: change forward(self, observed: torch.Tensor) to return self (not ScaleZpTuple or (None,None)) and ensure ScaleZpTuple is not referenced; fix _add_batch usage of flatten_for_calibration by calling it with the correct signature (remove the extra g_idx arg) and replace the erroneous xmin(xmax <= -xmin) expression with torch.maximum(xmax, -xmin) (or equivalent) so xmin/xmax logic is valid; remove or use the computed new_max (avoid unused variable) and implement the required abstract method update_statistics(self, observed: torch.Tensor) to satisfy Observer, plus restore/implement the histogram accumulation logic (or gate/remove the file from the PR if you intend to land it later) so the observer actually updates state instead of being a no-op.

coderabbitai · 2026-04-25T02:41:45Z

+    _act_sync_dict = {
+        "_imatrix_sum": dist.ReduceOp.SUM,
+        "_imatrix_count": dist.ReduceOp.SUM,
+    }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Confirm that the base class unconditionally resets _act_sync_dict for weight observers rg -nP --type=py -C3 '_act_sync_dict' src/llmcompressor/observers/ # And confirm imatrix is only used as a weight observer rg -nP --type=py -C3 '"imatrix_mse"|IMatrixMSEObserver'

Repository: vllm-project/llm-compressor

Length of output: 17114

🏁 Script executed:

sed -n '150,170p' src/llmcompressor/observers/imatrix.py

Repository: vllm-project/llm-compressor

Length of output: 903

_act_sync_dict declaration in IMatrixMSEObserver is dead code due to base class override.

Observer.__init__ (base.py:49-51) unconditionally sets self._act_sync_dict = {} when base_name == "weight". Because IMatrixMSEObserver only supports weight observers—enforced by validation in _get_validated_importance() (lines 157-170)—the class-level _act_sync_dict at lines 35-38 is never used. Under DDP, _imatrix_sum and _imatrix_count will remain rank-local.

If cross-rank importance aggregation is required in multi-GPU training, override sync_activation_stats() to explicitly reduce these buffers. If not needed, remove the class-level _act_sync_dict declaration to avoid confusion.

🧰 Tools

🪛 Ruff (0.15.11)

[warning] 35-38: Mutable default value for class attribute

(RUF012)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/llmcompressor/observers/imatrix.py` around lines 35 - 38, IMatrixMSEObserver's class-level _act_sync_dict is dead because Observer.__init__ overwrites self._act_sync_dict when base_name == "weight"; remove the misleading class attribute or implement explicit cross-rank reduction. Either delete the _act_sync_dict declaration (remove _imatrix_sum and _imatrix_count from the class-level map) to avoid confusion, or override sync_activation_stats() in IMatrixMSEObserver to perform a dist.all_reduce (using dist.ReduceOp.SUM) on the buffers named _imatrix_sum and _imatrix_count so they are aggregated across ranks when using DDP; reference IMatrixMSEObserver, _act_sync_dict, sync_activation_stats, _imatrix_sum, _imatrix_count, and Observer.__init__ to locate the changes.

coderabbitai · 2026-04-25T02:41:45Z

+        Case 1 – accumulators present on module: leave them for next
+        observer's ``attach()`` to pick up.

-        Case 2 – only ``_imatrix_importance`` present (no accumulators): this
-        is the final cleanup pass — delete it so it doesn't end up in the
-        checkpoint.
+        Case 2 – no accumulators (second-pass cleanup): nothing to do.


⚠️ Potential issue | 🟡 Minor

Replace en-dash with hyphen in docstring.

Ruff RUF002 flags lines 115 and 118: ambiguous – (EN DASH) in docstrings. Substitute with - (HYPHEN-MINUS) to avoid the warning and reduce risk of downstream parsing surprises.

🧰 Tools

🪛 Ruff (0.15.11)

[warning] 115-115: Docstring contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF002)

[warning] 118-118: Docstring contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF002)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/llmcompressor/observers/imatrix.py` around lines 115 - 118, Replace the EN DASH characters in the imatrix.py docstring lines that read "Case 1 – accumulators..." and "Case 2 – no accumulators..." with regular hyphen-minus "-" characters to satisfy RUF002; update the docstring text near the observer's attach() description (the block mentioning "Case 1" and "Case 2") so both dashes are simple hyphens.

coderabbitai · 2026-04-25T02:41:45Z

-    DisableQuantization,
-    calibration_forward_context,
-)
+from llmcompressor.utils.helpers import calibration_forward_context, DisableQuantization


⚠️ Potential issue | 🟡 Minor

Fix unsorted import flagged by Ruff I001.

The pipeline failure indicates the import block is unsorted/unformatted. Place DisableQuantization alphabetically before calibration_forward_context:

🛠️ Proposed fix

-from llmcompressor.utils.helpers import calibration_forward_context, DisableQuantization +from llmcompressor.utils.helpers import DisableQuantization, calibration_forward_context

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from llmcompressor.utils.helpers import calibration_forward_context, DisableQuantization

from llmcompressor.utils.helpers import DisableQuantization, calibration_forward_context

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/llmcompressor/pipelines/sequential/pipeline.py` at line 19, The import statement is flagged as unsorted by Ruff I001; reorder the two imported names so DisableQuantization appears before calibration_forward_context in the from-import line (i.e., import DisableQuantization, calibration_forward_context) to satisfy alphabetical import ordering and reformat the import block accordingly.

coderabbitai · 2026-04-25T02:41:45Z

 @pytest.mark.unit
 @patch("llmcompressor.observers.base.dist")
 def test_static_synchronize_with_global_state(mock_dist):
-    mock_dist.ReduceOp.MIN = "MIN"
-    mock_dist.ReduceOp.MAX = "MAX"
    mock_dist.all_reduce.return_value = MagicMock()

    observer = _make_observer(StaticMinMaxObserver)
-    observer.past_min_vals = torch.tensor([-1.0])
-    observer.past_max_vals = torch.tensor([1.0])
-    observer.past_global_min_vals = torch.tensor([-2.0])
-    observer.past_global_max_vals = torch.tensor([2.0])
+    observer.min_vals = torch.tensor([-1.0])
+    observer.max_vals = torch.tensor([1.0])

-    comms = observer.synchronize()
-    assert len(comms) == 4
-    assert mock_dist.all_reduce.call_count == 4
+    comms = observer.sync_activation_stats()
+    assert len(comms) == 2
+    assert mock_dist.all_reduce.call_count == 2


⚠️ Potential issue | 🟡 Minor

Duplicate test — test_static_synchronize_with_global_state is identical to test_static_synchronize_issues_all_reduce.

This test has the same setup and assertions as test_static_synchronize_issues_all_reduce above (lines 39–55) and does nothing to exercise "global state". Given the PR introduces TENSOR_GROUP/global_scale paths (see test_get_qparams_with_tensor_group_strategy below), this test was likely intended to verify that sync_activation_stats() also synchronizes a global-scale / global statistic but currently doesn't. Either remove it as a duplicate or extend it to actually cover the global-state path (e.g., configure a tensor_group observer, seed the global statistic, and assert an additional all_reduce call for that tensor).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/llmcompressor/observers/test_distributed.py` around lines 58 - 69, The test test_static_synchronize_with_global_state duplicates test_static_synchronize_issues_all_reduce and should either be removed or updated to exercise the global-state path: modify the test to create an observer configured with the TENSOR_GROUP / tensor_group strategy (instead of StaticMinMaxObserver), seed its global statistic (e.g., global_scale or the observer's tensor-group-specific state), call observer.sync_activation_stats(), and assert that mock_dist.all_reduce was called for both per-tensor stats and the global statistic (i.e., call_count increased by one more); ensure you reference and exercise the sync_activation_stats method and the observer configured for tensor_group/global_scale in the updated test.

- Set global_scale in lifecycle tests (required for tensor_group quant) - Fix typo: linear.linear.input_observer -> linear.input_observer - Pass weight arg to observer calls (forward() now requires observed) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mergify · 2026-04-25T02:51:29Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

coderabbitai

🧹 Nitpick comments (2)

tests/llmcompressor/modifiers/utils/test_update_fused_layer_weights.py (2)
47-50: Optional: avoid mutating global RNG in __init__.

torch.manual_seed(42) (and 99 in MockMLPModule) re-seeds the process-wide RNG every time a mock module is instantiated, which can leak into and affect ordering-dependent outcomes of other tests run in the same process. Consider using a local torch.Generator and passing it to torch.randn to keep determinism scoped to this fixture.
♻️ Suggested change
-        torch.manual_seed(42)
-        self.q_proj.weight.data = torch.randn(64, 64) * 1.0
-        self.k_proj.weight.data = torch.randn(64, 64) * 2.0  # larger magnitude
-        self.v_proj.weight.data = torch.randn(64, 64) * 0.5  # smaller magnitude
+        gen = torch.Generator().manual_seed(42)
+        self.q_proj.weight.data = torch.randn(64, 64, generator=gen) * 1.0
+        self.k_proj.weight.data = torch.randn(64, 64, generator=gen) * 2.0  # larger magnitude
+        self.v_proj.weight.data = torch.randn(64, 64, generator=gen) * 0.5  # smaller magnitude
(apply the same pattern in MockMLPModule.__init__).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/llmcompressor/modifiers/utils/test_update_fused_layer_weights.py`
around lines 47 - 50, The test mutates the global RNG by calling
torch.manual_seed(42) in the fixture; instead create a local torch.Generator(),
call gen.manual_seed(42), and pass generator=gen into torch.randn used for
q_proj/k_proj/v_proj weights (and similarly replace torch.manual_seed(99) in
MockMLPModule.__init__ with a local Generator there) so the seeded randomness is
scoped to the mock and doesn't affect global test RNG state.
173-176: Minor: assertion reaches into a private attribute.

layer.weight_observer._fused_observers is a private field. While no public accessor exists to check fused observer state, this test will break silently if the internal implementation changes (e.g., renamed, replaced with a property). This is acceptable for an internal-API unit test, but consider advocating for a public method like get_fused_observers() if this pattern needs wider use elsewhere.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/llmcompressor/modifiers/utils/test_update_fused_layer_weights.py`
around lines 173 - 176, The test currently inspects the private field
layer.weight_observer._fused_observers; change it to use a public accessor
instead by adding a small public method/property to the observer (e.g.,
get_fused_observers or fused_observers) and update the test to call that (for
the observers on module.q_proj, module.k_proj, module.v_proj) to assert the
returned collection is empty; reference the existing observer instance as
weight_observer and the private field _fused_observers when implementing the new
accessor so the test no longer reaches into private state.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/llmcompressor/modifiers/utils/test_update_fused_layer_weights.py`:
- Around line 47-50: The test mutates the global RNG by calling
torch.manual_seed(42) in the fixture; instead create a local torch.Generator(),
call gen.manual_seed(42), and pass generator=gen into torch.randn used for
q_proj/k_proj/v_proj weights (and similarly replace torch.manual_seed(99) in
MockMLPModule.__init__ with a local Generator there) so the seeded randomness is
scoped to the mock and doesn't affect global test RNG state.
- Around line 173-176: The test currently inspects the private field
layer.weight_observer._fused_observers; change it to use a public accessor
instead by adding a small public method/property to the observer (e.g.,
get_fused_observers or fused_observers) and update the test to call that (for
the observers on module.q_proj, module.k_proj, module.v_proj) to assert the
returned collection is empty; reference the existing observer instance as
weight_observer and the private field _fused_observers when implementing the new
accessor so the test no longer reaches into private state.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ab913577-4c05-45d7-823a-dd5d6bb82e97

📥 Commits

Reviewing files that changed from the base of the PR and between 107f63f and aba3635.

📒 Files selected for processing (2)

tests/llmcompressor/modifiers/calibration/test_lifecycle.py
tests/llmcompressor/modifiers/utils/test_update_fused_layer_weights.py

🚧 Files skipped from review as they are similar to previous changes (1)

tests/llmcompressor/modifiers/calibration/test_lifecycle.py

HDCharles requested a review from brian-dellabetta April 8, 2026 04:34

HDCharles requested review from dsikka and kylesayrs as code owners April 8, 2026 04:34

gemini-code-assist Bot reviewed Apr 8, 2026

View reviewed changes

mergify Bot added the quality-failed label Apr 8, 2026

coderabbitai Bot reviewed Apr 8, 2026

View reviewed changes

brian-dellabetta reviewed Apr 8, 2026

View reviewed changes

HDCharles force-pushed the 90_refactor_obs branch from 755bc04 to b3f32c2 Compare April 10, 2026 07:00

mergify Bot removed the quality-failed label Apr 10, 2026

mergify Bot added the quality-failed label Apr 10, 2026

brian-dellabetta approved these changes Apr 10, 2026

View reviewed changes

HDCharles force-pushed the 90_refactor_obs branch from b3f32c2 to 57d6639 Compare April 10, 2026 21:00

mergify Bot removed the quality-failed label Apr 10, 2026

mergify Bot added quality-failed and removed quality-failed labels Apr 10, 2026

mergify Bot added the quality-failed label Apr 10, 2026

mergify Bot removed the quality-failed label Apr 11, 2026

mergify Bot added the quality-failed label Apr 11, 2026

brian-dellabetta mentioned this pull request Apr 13, 2026

refactor: modernize observers module with Python 3.10+ type hints #2607

Merged

3 tasks

mergify Bot removed the quality-failed label Apr 13, 2026

mergify Bot added quality-failed and removed quality-failed labels Apr 13, 2026

mergify Bot added quality-failed and removed quality-failed labels Apr 14, 2026

mergify Bot added the quality-failed label Apr 14, 2026

HDCharles mentioned this pull request Apr 15, 2026

Defer weight qparams to epoch end, unify calibration lifecycle #2621

Open

5 tasks

dsikka mentioned this pull request Apr 16, 2026

Q2 2026 Roadmap #2624

Open

16 tasks

mergify Bot removed the quality-failed label Apr 22, 2026

mergify Bot added the needs-rebase label Apr 22, 2026

coderabbitai Bot reviewed Apr 25, 2026

View reviewed changes

HDCharles force-pushed the 90_refactor_obs branch from 5076faa to 107f63f Compare April 25, 2026 02:33

mergify Bot removed the needs-rebase label Apr 25, 2026

mergify Bot added the quality-failed label Apr 25, 2026

coderabbitai Bot reviewed Apr 25, 2026

View reviewed changes

mergify Bot removed the quality-failed label Apr 25, 2026

mergify Bot added the quality-failed label Apr 25, 2026

coderabbitai Bot reviewed Apr 25, 2026

View reviewed changes

		should_calculate_gparam: bool = False, # Kept for backward compatibility, ignored
		should_calculate_qparams: bool = True, # Kept for backward compatibility, ignored

	# weight observers (used by child modifier or `observer` amd `update_qparams`)
	# weight observers (used by child modifier or `observe` and `update_qparams`)

	from llmcompressor.utils.helpers import calibration_forward_context, DisableQuantization
	from llmcompressor.utils.helpers import DisableQuantization, calibration_forward_context

Conversation

HDCharles commented Apr 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions Bot commented Apr 8, 2026

Uh oh!

coderabbitai Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

.coderabbit.yaml has a parsing error

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Apr 8, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Apr 10, 2026

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Apr 10, 2026

Uh oh!

mergify Bot commented Apr 10, 2026

Uh oh!

HDCharles commented Apr 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 8, 2026 •

edited

Loading

`.coderabbit.yaml` has a parsing error

brian-dellabetta Apr 10, 2026 •

edited

Loading