Problem / Motivation
Currently, dequantized vectors are scaled by the stored original norm. But quantization introduces error, so the reconstructed vector's norm differs from the original. Applying original_norm / ||reconstruction|| as a correction factor restores the correct magnitude.
This is zero cost at decode (the reconstruction norm can be computed during dequant) and improves perplexity.
Current code: turboquant/src/cache/quantize_tensor.rs:217-219 — no correction applied
Solution
After dequantization (codebook lookup + inverse WHT + scale):
- Compute
||reconstruction|| (L2 norm of the reconstructed vector)
- Apply correction:
result *= original_norm / reconstruction_norm
Key files
turboquant/src/cache/quantize_tensor.rs:217-219 — apply correction here
- Also update CUDA dequant kernel:
turboquant/src/cache/cuda/kernels/tq_dequant_kernel.cu
Acceptance criteria
Problem / Motivation
Currently, dequantized vectors are scaled by the stored original norm. But quantization introduces error, so the reconstructed vector's norm differs from the original. Applying
original_norm / ||reconstruction||as a correction factor restores the correct magnitude.This is zero cost at decode (the reconstruction norm can be computed during dequant) and improves perplexity.
Current code:
turboquant/src/cache/quantize_tensor.rs:217-219— no correction appliedSolution
After dequantization (codebook lookup + inverse WHT + scale):
||reconstruction||(L2 norm of the reconstructed vector)result *= original_norm / reconstruction_normKey files
turboquant/src/cache/quantize_tensor.rs:217-219— apply correction hereturboquant/src/cache/cuda/kernels/tq_dequant_kernel.cuAcceptance criteria
cargo fmt --checkclean