Skip to content

Latest commit

 

History

History
36 lines (24 loc) · 782 Bytes

File metadata and controls

36 lines (24 loc) · 782 Bytes

v1.2 Quantization

Goal

Support at least one quantized model loading path.

Why

Serving cost and GPU memory pressure are central production constraints.

Scope

  • int8, fp8, AWQ, GPTQ, or another practical first option
  • memory benchmark
  • latency benchmark
  • compatibility notes

Out Of Scope

  • implementing a quantization algorithm from scratch
  • supporting every quantization format

Acceptance Criteria

  • A quantized model can be loaded and served.
  • Memory use is compared with the non-quantized baseline.
  • Latency impact is measured.

Progress

  • Choose the first quantization path.
  • Add model loading option.
  • Add compatibility checks.
  • Add memory benchmark.
  • Add latency benchmark.
  • Document cost and quality tradeoffs.