fix(quant): load W8A8 int8 checkpoints with per-shard static input scales by Sunt-ing · Pull Request #29530 · sgl-project/sglang

Sunt-ing · 2026-06-27T21:22:24Z

Motivation

Loading a compressed-tensors W8A8 int8 checkpoint with static activation scales (e.g. nm-testing/tinyllama-w8a8-compressed) crashes during model load:

IndexError: index 1 is out of bounds for dimension 0 with size 1

CompressedTensorsW8A8Int8.create_weights allocates the static input_scale / input_zero_point as one-slot PerTensorScaleParameters. A fused layer such as Llama gate_up_proj has separate gate_proj.input_scale and up_proj.input_scale shards; the first loads into shard 0, the second into shard 1, but the parameter has length 1, so the load fails.

Modifications

Allocate the static input_scale and input_zero_point with len(output_partition_sizes) slots, like weight_scale, so every logical shard of a fused layer loads.
For the TENSOR strategy, convert the per-tensor weight scale to channelwise (convert_to_channelwise) so the int8 GEMM kernel receives one scales_b per output channel; otherwise forward fails with RuntimeError: size of scales_b is not matched. The activation input_scale stays scalar.

Testing

Real-engine E2E on a single GPU, loading the public nm-testing/tinyllama-w8a8-compressed checkpoint.

Reproduction (env, script, before/after)

# quant_engine_matrix.py
import sglang as sgl

engine = sgl.Engine(
    model_path="nm-testing/tinyllama-w8a8-compressed",
    dtype="float16",
    mem_fraction_static=0.28,
    disable_cuda_graph=True,
    attention_backend="triton",
    sampling_backend="pytorch",
    skip_server_warmup=True,
)
out = engine.generate("The capital of France is",
                      {"temperature": 0, "max_new_tokens": 8})
print(out[0]["text"])
engine.shutdown()

main:  IndexError: index 1 is out of bounds for dimension 0 with size 1
       (parameter.py _load_into_shard_id, model fails to load)
fix:   status ok, generates " Paris.\n\n### 1"

negative control nm-testing/tinyllama-w4a16-compressed (unaffected path):
       loads and generates on both main and fix

A TestW8A8Int8StaticInputScale case added to test/registered/quant/test_quant_config_parsing.py (CPU): create_weights with output_partition_sizes=[4, 4] must allocate input_scale / input_zero_point of shape (2,). The test fails on main (torch.Size([1]) != torch.Size([2])) and passes with the fix. ruff/black/isort clean.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.

CI States

Latest PR Test (Base): ❌ Run #28302208073
Latest PR Test (Extra): ❌ Run #28302208028

…ales CompressedTensorsW8A8Int8 allocated the static input_scale and input_zero_point as one-slot PerTensorScaleParameters. A fused layer (e.g. gate_up_proj) has one activation scale per logical shard, so loading the second shard raised "IndexError: index 1 is out of bounds for dimension 0 with size 1" and the model failed to load. Allocate one slot per logical shard like weight_scale, and convert the per-tensor weight scale to channelwise so the int8 GEMM kernel gets one scale per output channel. Signed-off-by: Ting Sun <suntcrick@gmail.com>

gemini-code-assist · 2026-06-27T21:22:28Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Sunt-ing requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, HaiShaw, b8zhong and ch-wan as code owners June 27, 2026 21:22

github-actions Bot added the quant LLM Quantization label Jun 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(quant): load W8A8 int8 checkpoints with per-shard static input scales#29530

fix(quant): load W8A8 int8 checkpoints with per-shard static input scales#29530
Sunt-ing wants to merge 1 commit into
sgl-project:mainfrom
Sunt-ing:6

Sunt-ing commented Jun 27, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Sunt-ing commented Jun 27, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Testing

Checklist

CI States

Uh oh!

gemini-code-assist Bot commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Sunt-ing commented Jun 27, 2026 •

edited by github-actions Bot

Loading