Skip to content

Phase 2: replace dict-style quant_config access with LayerQuantConfig#274

Open
thpereir wants to merge 1 commit intomainfrom
thpereir/quark_quant_layer_phase2
Open

Phase 2: replace dict-style quant_config access with LayerQuantConfig#274
thpereir wants to merge 1 commit intomainfrom
thpereir/quark_quant_layer_phase2

Conversation

@thpereir
Copy link
Contributor

@thpereir thpereir commented Mar 6, 2026

…cross all models and ops

Motivation

Following up DeepSeek R1 enablement with per layer quantization support this converts all other models to use the same structured quantization config instead of a dict.

This PR depends on both #236 and #268

Technical Details

  1. Clean type boundary — LayerQuantConfig is a frozen dataclass; impossible to accidentally mutate or misspell fields.
  2. Single resolution API — resolve(prefix) handles exclude-list, per-layer overrides, pattern matching, and global fallback in one call with documented priority. Less error prone than previous approach of accessing the dict. What was quant_config.get_layer_quant_config(f"{prefix}.fused_qkv_a_proj") becomes quant_config.resolve(prefix)
  3. Parser registry — @register_quant_parser("quark") means new quantization formats are self-contained modules. In the current approach adding a new quantization method means editing QuantizationConfig.__init__ directly.

Test Plan

Test Result

Submission Checklist

@thpereir thpereir force-pushed the thpereir/quark_quant_layer_phase2 branch from 66e940b to 873a969 Compare March 10, 2026 16:29
@thpereir thpereir changed the base branch from thpereir/quark_quant_layer to main March 10, 2026 16:30
@thpereir thpereir force-pushed the thpereir/quark_quant_layer_phase2 branch 2 times, most recently from 8801fb5 to 0c9fd54 Compare March 10, 2026 17:06
@thpereir thpereir changed the title Phase 2: replace dict-style quant_config access with LayerQuantSpec a… Phase 2: replace dict-style quant_config access with LayerQuantConfig Mar 10, 2026
@thpereir thpereir marked this pull request as ready for review March 11, 2026 16:35
@thiagocrepaldi
Copy link

Is it possible to request a regression test for this pr or for the current main branch?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR completes the Phase 2 migration away from dict-style quantization config access by introducing a typed, immutable LayerQuantConfig and routing quant config parsing through a parser registry, then updating models/ops/docs/tests to use the new API.

Changes:

  • Introduce atom/quant_spec.py with LayerQuantConfig (frozen dataclass), ParsedQuantConfig, and a quant-config parser registry (QuarkParser, GenericParser).
  • Refactor QuantizationConfig to parse via the registry and expose typed per-layer resolution via get_layer_quant_config().
  • Update model/ops code and docs/tests to use attribute access (e.g., .quant_dtype) instead of dict keys.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
atom/quant_spec.py Adds typed quant spec + parser registry and built-in parsers.
atom/config.py Switches quant parsing to registry + typed resolution; removes dict-based parsing paths.
atom/models/llama.py Uses per-layer resolved quant type instead of global dict access.
atom/models/deepseek_v2.py Replaces dict indexing with typed attribute access for resolved specs.
atom/models/deepseek_mtp.py Replaces dict indexing with typed attribute access.
atom/model_ops/moe.py Updates MoE methods to use typed LayerQuantConfig attributes.
atom/model_ops/linear.py Updates LinearBase plumbing to use typed spec attributes.
atom/model_ops/layernorm.py Updates RMSNorm to resolve per-layer spec via prefix.
atom/model_ops/activation.py Updates SiluAndMul to resolve per-layer spec via prefix.
tests/test_quant_config.py Updates/refactors tests for typed API and parser registry.
docs/configuration_guide.md Updates docs to reflect LayerQuantConfig as a frozen dataclass and new parsing path.
docs/architecture_guide.md Removes LayerQuantConfig from atom/config.py ownership list.
Comments suppressed due to low confidence (1)

docs/configuration_guide.md:139

  • The LayerQuantConfig table documents quant_method as type str with default "", but the implementation is quant_method: str | None = None. Update the docs to reflect that the field may be None and clarify what None vs an empty string means for downstream logic.
| Field | Type | Default | Description |
|---|---|---|---|
| `quant_type` | `QuantType` | `QuantType.No` | Quantization granularity (see below) |
| `quant_dtype` | `torch.dtype` | `torch.bfloat16` | Data type for quantized weights |
| `is_dynamic` | `bool` | `True` | Use dynamic quantization (scales computed at runtime) |
| `quant_method` | `str` | `""` | Quantization method (e.g., `"quark"`, `"compressed-tensors"`) |

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@@ -99,7 +99,7 @@ def __init__(
self.act_fn = SiluAndMul(
fused_quant=self.fused_act_quant, quant_config=quant_config
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SiluAndMul now resolves its quant spec from quant_config.get_layer_quant_config(prefix), but this call site does not pass prefix. In mixed per-layer quant setups (e.g., MXFP4 MLP), self.quant_type is resolved from the layer prefix and later used to decide whether to unpack (x, scale) from self.act_fn(...); if SiluAndMul was initialized with the wrong (global) spec it may return a tensor while the caller tries to unpack it, causing incorrect behavior or runtime errors. Pass the same layer prefix into SiluAndMul (or derive the condition from self.act_fn.quant_type) so the fused-quant return contract matches the caller logic.

Suggested change
fused_quant=self.fused_act_quant, quant_config=quant_config
fused_quant=self.fused_act_quant,
quant_config=quant_config,
prefix=prefix,

Copilot uses AI. Check for mistakes.
Comment on lines 271 to +274
ATOM_LLAMA_ENABLE_AITER_TRITON_FUSED_RMSNORM_QUANT
)

self.quant_type = quant_config.global_quant_config["quant_type"]
self.quant_type = quant_config.get_layer_quant_config(prefix).quant_type
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.quant_type is now resolved per-layer via get_layer_quant_config(prefix). Any fused ops in this decoder layer that conditionally return (tensor, scale) based on quant type (e.g., RMSNorm/activation) need to be initialized with the same prefix as well; otherwise the module’s internal resolved quant type can diverge from self.quant_type, and the subsequent tuple-unpacking logic can break. Ensure the prefix is plumbed consistently to the fused submodules that resolve quant specs.

Copilot uses AI. Check for mistakes.
quant_method_str = (
layer_quant_config.quant_method if layer_quant_config else None
)
if layer_quant_config.quant_type == QuantType.No:
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

layer_quant_config can be None when quant_config is None, but the code unconditionally accesses layer_quant_config.quant_type / .quant_dtype below. This will raise an AttributeError for unquantized MoE usage. Initialize layer_quant_config to a default LayerQuantConfig() when quant_config is None, or guard all later accesses and treat the None case as QuantType.No.

Suggested change
if layer_quant_config.quant_type == QuantType.No:
if layer_quant_config is None or layer_quant_config.quant_type == QuantType.No:

Copilot uses AI. Check for mistakes.
# 3. Pattern match
for pattern, spec in self._parsed.layer_pattern_specs:
if "*" not in pattern:
if layer_name in pattern:
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In get_layer_quant_config, the non-glob pattern branch uses if layer_name in pattern:. This is reversed for the intended semantics (exact match or pattern being a substring/prefix of the layer name) and can both fail to match valid overrides and accidentally match short layer_name values (e.g., "l" matches "lm_head"). Replace this with an explicit equality check (layer_name == pattern) or a clear prefix/substring rule (pattern in layer_name) consistent with the documented resolution order.

Suggested change
if layer_name in pattern:
if pattern in layer_name:

Copilot uses AI. Check for mistakes.
Comment on lines +212 to +229
def parse(self, hf_quant_config: dict) -> ParsedQuantConfig:
quant_method = hf_quant_config.get("quant_method", "")
config_str = str(hf_quant_config).lower()

quant_dtype = self._infer_dtype(hf_quant_config, config_str)
quant_type = self._infer_qtype(hf_quant_config, config_str)
is_dynamic = hf_quant_config.get("is_dynamic", True)
exclude = list(hf_quant_config.get("exclude", []) or [])

global_spec = LayerQuantConfig(
quant_type=quant_type,
quant_dtype=quant_dtype,
is_dynamic=is_dynamic,
quant_method=quant_method or None,
)

return ParsedQuantConfig(global_spec=global_spec, exclude_layers=exclude)

Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GenericParser.parse() only reads excluded layers from the exclude key. The previous implementation (and current docs) reference ignore as the exclude-list key for compressed-tensors/other formats, so this change will silently stop honoring excludes for those configs. Consider supporting both keys (e.g., exclude with fallback to ignore) for backward compatibility.

Copilot uses AI. Check for mistakes.
Comment on lines 187 to 193

**For quark models** (`quant_method == "quark"`):

1. Parses `global_quant_config` dict via `parse_quark_config_dict()` to produce the global `LayerQuantConfig`.
1. Parses `global_quant_config` dict via `QuarkParser` to produce the global `LayerQuantConfig`.
2. Parses each entry in `layer_quant_config` dict to produce per-layer overrides.
3. Reads the `"exclude"` list for excluded layers.
4. Within each config dict, `weight.qscheme` determines `quant_type` (`"per_channel"` → `per_Token`, `"per_tensor"` → `per_Tensor`, `"per_group"` → `per_1x32`), and `weight.dtype` determines `quant_dtype`.
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This quark parsing section was updated to reference QuarkParser, but the immediately following “For other models” section in this doc still describes the old parse_other_config heuristics and says excludes come from the "ignore" key. Since QuantizationConfig now delegates non-quark parsing to GenericParser (which currently reads exclude, not ignore), the documentation is now internally inconsistent—please update the non-quark section (or adjust GenericParser) to match.

Copilot uses AI. Check for mistakes.
Comment on lines +212 to +220
def parse(self, hf_quant_config: dict) -> ParsedQuantConfig:
quant_method = hf_quant_config.get("quant_method", "")
config_str = str(hf_quant_config).lower()

quant_dtype = self._infer_dtype(hf_quant_config, config_str)
quant_type = self._infer_qtype(hf_quant_config, config_str)
is_dynamic = hf_quant_config.get("is_dynamic", True)
exclude = list(hf_quant_config.get("exclude", []) or [])

Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GenericParser.parse() introduces new heuristics for non-quark quant configs, but current tests only cover registry lookup and Quark parsing. Add unit tests that exercise GenericParser parsing (dtype/qtype inference and exclude-list handling) to prevent regressions for compressed-tensors/GPTQ/AWQ-style configs.

Copilot uses AI. Check for mistakes.
@haoyangli0109
Copy link
Contributor

haoyangli0109 commented Mar 13, 2026

@thpereir
Thank you very much.
It's great to see that several of the directions we discussed were implemented quickly. If we modify the code for other quantization formats, we should also select the corresponding models for accuracy testing.

@thpereir thpereir force-pushed the thpereir/quark_quant_layer_phase2 branch 2 times, most recently from dffe673 to 3389cee Compare March 14, 2026 02:14
… for per-layer quant config

Introduce atom/quant_spec.py with:
- LayerQuantConfig: frozen dataclass with typed attribute access
  (quant_type, quant_dtype, is_dynamic, quant_method) replacing
  the old dict-based LayerQuantConfig(dict) subclass
- ParsedQuantConfig: structured output of HF config parsing
- Parser registry (@register_quant_parser) with QuarkParser and
  GenericParser (fallback for compressed-tensors, GPTQ, AWQ, etc.)

Refactor QuantizationConfig (atom/config.py):
- Internal storage now uses ParsedQuantConfig via parser registry
- get_layer_quant_config(prefix) -> LayerQuantConfig (frozen dataclass)
- global_quant_config property -> LayerQuantConfig
- Convenience properties: quant_type, quant_dtype, is_dynamic
- compute_hash() uses typed internal structures

Migrate all consumers to typed attribute access:
- linear.py: layer_quant_config.quant_type instead of ["quant_type"]
- moe.py: all MoE method classes use LayerQuantConfig type hints
- activation.py, layernorm.py: accept prefix param, use
  get_layer_quant_config() instead of bypassing with global_quant_config
- deepseek_mtp.py, deepseek_v2.py, llama.py: use get_layer_quant_config()

Fix GenericParser exclude-layer key handling (atom/quant_spec.py):
- Different quantizers use different keys for excluded layers:
  compressed-tensors uses "ignore", gpt-oss/HF uses "modules_to_not_convert",
  Quark uses "exclude"
- GenericParser now tries all three keys in priority order so excluded
  layers are never silently treated as quantized

Fix hard-coded quant_config=None across models:
- gpt_oss.py OAIAttention: qkv_proj and o_proj were passing quant_config=None,
  preventing fp8/mxfp4 quantization on attention projections in Quark
  gpt-oss models (e.g. fp8 qkv + mxfp4 MoE); both now receive quant_config
- deepseek_v2.py Indexer: weights_proj passed quant_config=None while sibling
  linears wq_b and wk correctly used quant_config; fixed for consistency
- qwen3_next.py GatedDeltaNet: conv1d ColumnParallelLinear omitted
  quant_config while other linears in the same class passed it; fixed
@thpereir thpereir force-pushed the thpereir/quark_quant_layer_phase2 branch from 3389cee to ca3e23e Compare March 14, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants