Skip to content

fix(quant): load W8A8 int8 checkpoints with per-shard static input scales#29530

Open
Sunt-ing wants to merge 1 commit into
sgl-project:mainfrom
Sunt-ing:6
Open

fix(quant): load W8A8 int8 checkpoints with per-shard static input scales#29530
Sunt-ing wants to merge 1 commit into
sgl-project:mainfrom
Sunt-ing:6

Conversation

@Sunt-ing

@Sunt-ing Sunt-ing commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Motivation

Loading a compressed-tensors W8A8 int8 checkpoint with static activation scales (e.g. nm-testing/tinyllama-w8a8-compressed) crashes during model load:

IndexError: index 1 is out of bounds for dimension 0 with size 1

CompressedTensorsW8A8Int8.create_weights allocates the static input_scale / input_zero_point as one-slot PerTensorScaleParameters. A fused layer such as Llama gate_up_proj has separate gate_proj.input_scale and up_proj.input_scale shards; the first loads into shard 0, the second into shard 1, but the parameter has length 1, so the load fails.

Modifications

  • Allocate the static input_scale and input_zero_point with len(output_partition_sizes) slots, like weight_scale, so every logical shard of a fused layer loads.
  • For the TENSOR strategy, convert the per-tensor weight scale to channelwise (convert_to_channelwise) so the int8 GEMM kernel receives one scales_b per output channel; otherwise forward fails with RuntimeError: size of scales_b is not matched. The activation input_scale stays scalar.

Testing

Real-engine E2E on a single GPU, loading the public nm-testing/tinyllama-w8a8-compressed checkpoint.

Reproduction (env, script, before/after)
# quant_engine_matrix.py
import sglang as sgl

engine = sgl.Engine(
    model_path="nm-testing/tinyllama-w8a8-compressed",
    dtype="float16",
    mem_fraction_static=0.28,
    disable_cuda_graph=True,
    attention_backend="triton",
    sampling_backend="pytorch",
    skip_server_warmup=True,
)
out = engine.generate("The capital of France is",
                      {"temperature": 0, "max_new_tokens": 8})
print(out[0]["text"])
engine.shutdown()
main:  IndexError: index 1 is out of bounds for dimension 0 with size 1
       (parameter.py _load_into_shard_id, model fails to load)
fix:   status ok, generates " Paris.\n\n### 1"

negative control nm-testing/tinyllama-w4a16-compressed (unaffected path):
       loads and generates on both main and fix

A TestW8A8Int8StaticInputScale case added to test/registered/quant/test_quant_config_parsing.py (CPU): create_weights with output_partition_sizes=[4, 4] must allocate input_scale / input_zero_point of shape (2,). The test fails on main (torch.Size([1]) != torch.Size([2])) and passes with the fix. ruff/black/isort clean.

Checklist

  • Format your code according to the Format code with pre-commit.
  • Add unit tests according to the Run and add unit tests.

CI States

Latest PR Test (Base): ❌ Run #28302208073
Latest PR Test (Extra): ❌ Run #28302208028

…ales

CompressedTensorsW8A8Int8 allocated the static input_scale and
input_zero_point as one-slot PerTensorScaleParameters. A fused layer
(e.g. gate_up_proj) has one activation scale per logical shard, so
loading the second shard raised "IndexError: index 1 is out of bounds
for dimension 0 with size 1" and the model failed to load. Allocate one
slot per logical shard like weight_scale, and convert the per-tensor
weight scale to channelwise so the int8 GEMM kernel gets one scale per
output channel.

Signed-off-by: Ting Sun <suntcrick@gmail.com>
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the quant LLM Quantization label Jun 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

quant LLM Quantization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant