36 lines (24 loc) · 782 Bytes

v1.2 Quantization

Goal

Support at least one quantized model loading path.

Why

Serving cost and GPU memory pressure are central production constraints.

Scope

int8, fp8, AWQ, GPTQ, or another practical first option
memory benchmark
latency benchmark
compatibility notes

Out Of Scope

implementing a quantization algorithm from scratch
supporting every quantization format

Acceptance Criteria

A quantized model can be loaded and served.
Memory use is compared with the non-quantized baseline.
Latency impact is measured.

Progress