Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 24 additions & 2 deletions auto_round/compressors/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -866,8 +866,30 @@ def get_fp_layer_names(model: torch.nn.Module, ignore_layers: str):
"""
from auto_round.utils import SUPPORTED_LAYER_TYPES

if not ignore_layers:
return []
not_to_quantized_layers = []

for n, m in model.name_modules():
if is_fp8_linear(m):
not_to_quantized_layers.append(n)
logger.trace(f"Auto-detected FP8 layer to ignore : {n}")

if ignore_layers:
ignore_list = ignore_layers.replace(" ", "").split(",")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @scopophobic, thanks for your interest in fix that issue! I think there might be a bit of misunderstanding.
We don’t want to skip all FP8 layers. The idea is that we start with an FP8 model and want to requantize it to another format, like W4A16. However, we don’t want certain layers—such as those inside the attention module—to be quantized to W4A16.
The fix here is aligned with what we’re aiming for. #1286

Copy link
Contributor

@yiliu30 yiliu30 Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @scopophobic Would you be interested in working on the left part of this issue? #1283 (comment)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @yiliu30, thanks a lot for the clarification, that helped resolve a misunderstanding I had 👍

I now understand that the goal is not to skip all FP8 layers, but to start from an FP8 model and re-quantize it (e.g., to W4A16), while keeping specific submodules (like attention) from being quantized.

I’m definitely interested in working on the remaining part of #1283. My current thought is to make FP8 detection more robust by moving away from class-name checks (like "FP8Linear") and instead relying on explicit FP8 characteristics (e.g., presence of FP8 scale metadata used during dequantization). This would allow supporting multiple FP8 layer implementations without brittle heuristics.

Does this approach sound aligned with what you had in mind for this issue?

for fp_layer in ignore_list:
if not fp_layer:
continue
for n, _ in model.named_modules():
match_name = fp_layer
if fp_layer[-1].isdigit():
match_name += "."
if match_name in n:
if n not in not_to_quantized_layers:
not_to_quantized_layers.append(n)
logger.trace(f"User-specified ignore layer matched {n}:")

logger.trace(f"not_to_quantized_layers: {not_to_quantized_layers}")
return not_to_quantized_layers

ignore_layers = ignore_layers.replace(" ", "").split(",")
all_layer_names = []
for n, m in model.named_modules():
Expand Down
Loading