Complete per-module flag, problem statement, and benchmark reference. All benchmarks are CPU/numpy micro-benchmarks on Apple Silicon M-series unless otherwise noted. End-to-end hardware benchmarks: run
python3 dev/benchmarks/bench_eoe.py.
| Tier | Waves | Description |
|---|---|---|
| Stable | 1–12 | Core compression, KV cache, speculative decode, AWQ, prefill/radix cache. Validated on hardware. |
| Beta | 13–18 | Advanced KV compression, additional spec-decode variants, attention architectures. Functionally complete, hardware validation in progress. |
| Experimental | 19–26 | Cutting-edge attention, extended quantisation, long-context, production reliability. Proof-of-concept; may change. |
Flags for Experimental modules are labelled [Experimental] in squish serve --help.
Enable with squish run --model <name> [flags]:
| Module | Flag | Effect | Overhead |
|---|---|---|---|
| PM-KVQ | --pm-kvq |
4.2× KV cache memory at 4096 tokens | 14 µs/step |
| MixKVQ | --mix-kvq |
3.9× KV memory · 4.1 avg bits/channel | 72 µs/step |
| CocktailKV | --cocktail-kv |
~3× KV memory · chunk-similarity routing | 895 µs/512-tok |
| MiLo INT3 | --milo |
5.3× weight compression · SNR > 13 dB | one-time convert |
| AgileIO | --agile-io |
40–60% I/O latency reduction · 25× warm-cache reads | ≈ 0 |
| SageAttn | --sage-attention |
2.1× attention speedup (INT8 QK^T) | ≈ 0 |
| SpargeAttn | --sparge-attn |
2.5–5× attention speedup (sparse blocks) | ≈ 0 |
Full stack:
squish run qwen3:8b \
--pm-kvq --mix-kvq --cocktail-kv \
--agile-io --milo \
--sage-attention --sparge-attnBenchmark results: docs/benchmark_wave12.md
Raw data: dev/results/wave12_bench.json
v3 (Wave 13) focuses on ultra-long context (128K+ tokens) and adaptive speculative decoding, shipping 10 new modules:
| Module | Flag | Problem Solved | Key Number |
|---|---|---|---|
| DuoAttention | --duo-attention |
Long-context KV blowup: separates 30–40% retrieval heads from streaming heads | ~2× KV memory saved at 32K tokens |
| ShadowKV | --shadow-kv |
128K+ KV cache → CPU offload with low-rank pre-RoPE key projection | 6–10× KV compression on long contexts |
| PQCache | --pq-cache |
ANN-based KV retrieval for retrieval heads via product quantisation | 4–8× key memory · sub-ms lookup |
| SpeCache | --spe-cache |
Multi-turn KV reload stalls: speculatively prefetches prior-turn KV | 40–60% KV reload latency eliminated |
| DuoDecoding | --duo-decoding |
Fixed draft-sequence count wastes ANE cycles on M3 | 1.5–2.3× decode throughput |
| KnapSpec | --knapspec |
Choosing which layers to skip for self-spec-decode is NP-hard | Optimal skip schedule in O(NL) |
| Token Merging | --token-merging |
Similar tokens waste prefill FLOPs | 1.4–1.8× prefill speedup |
| TokenSwift | --token-swift |
Long outputs (20K–100K tokens) hit KV bandwidth ceiling | 2–3× throughput on ultra-long gen |
| C2T | --c2t |
Uniform draft tree wastes budget at confident positions | +0.8 tokens/step accepted |
Full stack:
squish run qwen3:8b \
--duo-attention --shadow-kv --pq-cache --spe-cache \
--duo-decoding --knapspec --token-merging \
--token-swift --c2tv3 (Wave 14) focuses on quantisation methods, vocabulary-adaptive speculative decoding, and expert mixing, shipping 16 new modules:
| Module | Flag | Problem Solved | Key Number |
|---|---|---|---|
| SubSpec | --sub-spec |
Offloaded models have no fast draft path | Quantised substitute layers as draft → full spec-decode |
| DFloat11 | --dfloat11 |
BF16 weights have poor entropy-coding compressibility | DFloat11 block-float: >30% size reduction vs BF16 |
| rANS Codec | --rans-codec |
Huffman coding leaves 5–15% entropy on the table | rANS → near-optimal entropy coding for KV/weights |
| QSpec | --qspec |
Draft and verify share the same quantisation level | W4A8 draft / W4A16 verify → 1.8× throughput |
| QuantSpec | --quant-spec |
Full-precision draft is slow; quantised draft is inaccurate | Bit-width selection per draft step → 98% accuracy |
| CopySpec | --copy-spec |
Spec-decode needs a trained draft model | Copy from history buffer — zero extra model |
| SqueezeLLM | --squeeze-llm |
Uniform quantisation crushes outlier weights | Sparse + dense mixed-precision: 4× smaller, 0.5 ppl loss |
| NF4 Quant | --nf4-quant |
Uniform quantisation misaligns to weight distributions | Normal Float 4-bit levels → best quality per bit for LLMs |
| SpinQuant | --spin-quant |
Weight outliers defeat quantisation | Hadamard rotation → 1.5 ppl improvement at INT4 |
| HeadInfer | --head-infer |
Uniform KV policy wastes memory on non-retrieval heads | Head-type-aware KV store: retrieval vs. streaming |
Full stack:
squish run qwen3:8b \
--squeeze-llm --nf4-quant --spin-quant \
--copy-spec --sub-spec \
--qspec --quant-spec \
--dfloat11 --rans-codec --head-inferBenchmark results: docs/benchmark_wave13_14.md
Raw data: dev/results/wave13_14_bench.json
v4 (Wave 15) focuses on SLO-aware inference scheduling, confidence-gated verification, and KV architecture evolution, shipping 10 new modules:
| Module | Flag | Problem Solved | Key Number |
|---|---|---|---|
| AdaServe | --ada-serve |
Fixed gamma wastes draft budget on low-SLO requests | 30% P99 latency ↓ · 1.5–2× throughput |
| ConfSpec | --conf-spec |
Always running full verification wastes compute | 54% verification cost ↓ via confidence gating |
| SeqPacking | --seq-packing |
Varying sequence lengths cause barrel effect padding waste | +1.8× batch throughput |
| MetaReasoner | --meta-reasoner |
CoT thinking on every token wastes energy on easy prompts | 44–89% CoT energy saved |
| YOCO | --yoco |
Cross-decoder layers duplicate KV across decoding passes | −50% KV memory via shared cross-decoder KV |
| CLA | --cla |
Adjacent transformer layers learn nearly identical KV | 10–30% KV reduction via cross-layer sharing schedule |
| KVSharer | --kv-sharer |
No data-driven way to measure actual KV layer redundancy | ~30% KV ops saved via calibration-based share map |
| DiffKV | --diffkv |
Uniform K/V precision ignores asymmetric sensitivity | 2.7–5.7× KV compression · 1.9–5.4× throughput |
| ParisKV | --paris-kv |
Online KV quantisation codebooks drift without correction | 4× KV compression with drift-robust adaptation |
| KVTuner | --kvtuner |
Naive mixed-precision quant loses 20–35% accuracy | 20–35% accuracy restored vs uniform quant |
Full stack:
squish run qwen3:8b \
--ada-serve --conf-spec --seq-packing --meta-reasoner \
--yoco --cla --kv-sharer \
--diffkv --paris-kv --kvtunerv4 (Wave 16) focuses on heterogeneous CPU+GPU execution, pipelined weight offloading, and advanced speculative decoding, shipping 11 new modules:
| Module | Flag | Problem Solved | Key Number |
|---|---|---|---|
| Dovetail | --dovetail |
CPU idle during GPU draft wastes heterogeneous compute | 2× throughput via concurrent CPU verify + GPU draft |
| PIPO | --pipo |
Sequential weight offload causes GPU idle stalls | +1.7× throughput via pipelined prefetch overlap |
| MobileMoE | --mobile-moe |
MoE expert dispatch ignores device balance | +1.4× throughput via balanced layer-expert routing |
| OnlineSD | --online-sd |
Frozen draft heads degrade after fine-tuning | +5–8 pp acceptance rate via continuous adaptation |
| LookaheadReasoning | --lookahead-reasoning |
Sequential reasoning steps serialise all verification | +2.1× throughput via parallel step verification |
| SparseSpec | --sparse-spec |
Static speculation ignores dynamic attention patterns | +2.13× spec throughput via adaptive pillar cache |
| FRSpec | --fr-spec |
Full-vocab draft head is expensive at inference | −13% draft latency via frequency-ranked subset head |
| LongSpec | --long-spec |
Draft KV grows with context → memory ceiling for long gen | Zero draft KV overhead via shared-KV draft head |
| ForeLen | --forelen |
Output length prediction is inaccurate, causing early truncation | −29% MAE vs TRAIL baseline |
| RASD | --rasd |
Draft models unfamiliar with corpus vocab fail spec decode | 40–60% hit rate via retrieval-augmented draft tree |
Full stack:
squish run qwen3:8b \
--dovetail --pipo \
--mobile-moe --online-sd \
--lookahead-reasoning --sparse-spec \
--fr-spec --long-spec \
--forelen --rasdBenchmark results: docs/benchmark_wave15_16.md
Raw data: dev/results/wave15_16_bench.json
v5 (Wave 17) focuses on INT4/INT8 attention kernels, slab-allocated KV storage, joint 2D KV budget management, and context-aware speculative prefetching, shipping 14 new modules:
| Module | Flag | Problem Solved | Key Number |
|---|---|---|---|
| SageAttention2 | --sage-attn2 |
Full-precision attention is bandwidth-bound for long sequences | INT4/INT8 warp-tile quantisation · 672 µs forward (4h/seq32/d64) |
| StreamingSink | --streaming-sink |
Unbounded KV growth at long contexts | Attention-sink eviction — bounded memory at any context length |
| KVSlab | --kv-slab |
Per-token malloc/free causes fragmentation under scale | Pre-allocated slab allocator · 0.87 µs alloc+free round-trip |
| SqueezeAttention | --squeeze-attn |
Independent token/layer KV compression compounds quality loss | Joint 2D Pareto-optimal budget allocation across both axes |
| SmallKV | --small-kv |
Aggressive KV compression degrades small-model quality | Saliency-compensated recall · 39 µs ingest · 8 µs check-and-recall |
| SpeContext | --spe-context |
Speculative decode wastes context retrieval at each draft step | Cosine-similarity context cache · 3.3 ms retrieve top-32 |
| SVDq | --svdq |
Uniform K quantisation ignores per-head SVD structure | Head-wise mixed-precision K search · 62 ms one-time calibration |
| CommVQ | --comm-vq |
Per-layer VQ codebooks waste memory building near-identical codebooks | Shared communal codebook · 55 µs encode · 68 µs decode |
| ChunkedPrefill | --chunked-prefill |
Long prefills block decoding requests for the full context length | Interleaved chunked prefill — bounded latency per chunk |
| GemFilter | --gemfilter |
KV eviction without attention-score feedback drops important tokens | Top-K attention-score selector · 0.90× cR · 50 µs select |
| MInferencePatch | --minference |
Full O(n²) attention is infeasible for 1M+ token contexts | Dynamic sparse patterns — sub-quadratic attention at ultra-long context |
| PromptCompressor | --prompt-compress |
Long system prompts and RAG context waste prefill FLOPs | TF-IDF sentence-level compression · 686 µs for 50-sentence input |
| PromptLookup | --prompt-lookup |
No-draft-model baseline has no spec-decode path | N-gram copy speculation from prompt · 0.8 µs find · 3.3 µs push |
| TRAIL | --trail |
Output-length prediction is too slow for real-time SRPT scheduling | Linear-probe predictor · 10 µs predict · feeds SRPT priority queue |
Full stack:
squish run qwen3:8b \
--sage-attn2 --streaming-sink --kv-slab \
--squeeze-attn --small-kv --spe-context \
--svdq --comm-vq --chunked-prefill \
--gemfilter --minference \
--prompt-compress --prompt-lookup --trailv5 (Wave 18) focuses on vector-product quantisation, confidence-gated early exit, online domain adaptation, and energy-aware scheduling, shipping 14 new modules:
| Module | Flag | Problem Solved | Key Number |
|---|---|---|---|
| VPTQ | --vptq |
Scalar quantisation loses intra-vector correlations | Vector-product tree quant · 15 µs decode · 133 ms one-time compress |
| LayerSkip | --layer-skip |
All tokens pass through all layers regardless of difficulty | Confidence-gated early exit · 266 µs estimate · exit at threshold=0.85 |
| SWIFT | --swift |
All FFN layers execute even when weights are functionally redundant | Calibration-based FFN skip · 162 µs calibrate · 34% layers skipped |
| SpecReason | --spec-reason |
Reasoning chains serialise draft+verify round trips | Pipelined draft+target step · 6.6 µs per orchestrated step |
| MirrorSD | --mirror-sd |
Single-draft spec-decode misses acceptance bursts | Mirror pipeline (parallel draft branches) · 867 µs step vocab=32k |
| SparseVerify | --sparse-verify |
Re-verifying identical KV slices across draft iterations wastes compute | Inter-draft KV reuse cache · 0.28 µs query · near-zero overhead |
| RobustScheduler | --robust-sched |
Priority inversions under bursty load hurt P99 latency | A-balanced SRPT scheduler · 3.7 µs schedule 32 requests |
| SemanticCache | --semantic-cache |
Repeated semantically-equivalent queries re-run full inference | sqlite-vec semantic cache · short-circuit on cosine similarity hit |
| IPW | --ipw |
No per-inference energy accounting available on-device | Perf-per-watt tracker · 0.16 µs record · 4.6 ms full summary |
| PowerMonitor | --power-monitor |
Compute policy ignores battery vs. AC power state | Apple Silicon power advisor · 0.5 µs get recommended mode |
Full stack:
squish run qwen3:8b \
--vptq --layer-skip --swift \
--spec-reason --mirror-sd --sparse-verify \
--robust-sched --semantic-cache \
--ipw --power-monitorBenchmark results: docs/benchmark_wave17_18.md
Raw data: dev/results/wave17_18_bench.json
v6 (Wave 19) focuses on FP8/MX microscaling quantisation, paged KV caching, GQA and sliding window attention, RoPE context extension, and multi-head speculative decoding (MEDUSA, EAGLE-3), shipping 14 new modules:
| Module | Flag | Problem Solved | Key Number |
|---|---|---|---|
| FP8Quant | --fp8-quant |
Weight storage overhead | ~60% storage vs BF16 |
| MXQuant | --mx-quant |
Quantisation quality at low bits | Better quality than INT4 at same bits |
| FlashDecode | --flash-decode |
KV read parallelism at decode | O(1) memory overhead per step |
| PagedKV | --paged-kv |
KV fragmentation across requests | Zero KV fragmentation |
| GQA | --gqa |
KV memory per head | 4–8× KV reduction vs MHA |
| SlidingWindowAttn | --sliding-window |
Memory at long context | O(window_size) memory |
| RoPEScaling | --rope-scaling |
Context extension without fine-tuning | 4–32× context extension |
| ActSparsity | --act-sparsity |
FFN compute on sparse activations | 30–60% FFN compute saved |
| FusedRMSNorm | --fused-norm |
LayerNorm bandwidth | Single kernel pass |
| LoRAInference | --lora-inference |
Adapter switching overhead | Zero-copy, no re-quant |
| MEDUSA | --medusa |
Decode throughput | 2–3× decode throughput |
| EAGLE3 | --eagle3 |
Draft acceptance rate | 3.5× accept rate vs token-prediction |
| PrefixPool | --prefix-pool |
KV recomputation on shared prompts | 40–80% KV savings |
| TokenHealer | --token-healer |
Prefix token boundary artifacts | Eliminates prefix artifacts |
Full stack:
squish serve ./model \
--fp8-quant --mx-quant \
--flash-decode --paged-kv \
--gqa --sliding-window \
--rope-scaling ntk \
--medusa --eagle3 \
--prefix-pool --token-healerv6 (Wave 20) focuses on model merging, multi-LoRA composition, continuous batching, constrained decoding, and vision token compression, shipping 14 new modules:
| Module | Flag | Problem Solved | Key Number |
|---|---|---|---|
| ModelMerge | --model-merge |
Combining domains without retraining | SLERP/DARE/TIES merging |
| LoRACompose | --lora-compose |
Multi-adapter blending | Learnable composition coefficients |
| ContinuousBatching | --continuous-batching |
GPU utilization at variable request rate | Max GPU utilization |
| MatryoshkaEmb | --matryoshka-emb |
Embedding dim flexibility | 1 forward pass, any dimensionality |
| ANEProfiler | --ane-profiler |
ANE vs GPU op breakdown | Op-level ANE utilization |
| SpecBench | --spec-bench |
Speculative decode CI | Acceptance rate + throughput |
| PPLTracker | --ppl-tracker |
Quantisation quality degradation | Real-time PPL monitoring |
| GrammarCache | --grammar-cache |
Per-token FSM rebuild overhead | Zero rebuild on cached grammars |
| QuantAware | --quant-aware |
Scale selection for quantisation | Per-channel optimal scales |
| AdaptiveBudget | --adaptive-budget |
Joint KV + layer skip SLO control | SLO-aware compute budget |
| ToolCache | --tool-cache |
Tool schema parse overhead | Zero parse overhead on repeats |
| DistilSpec | --distil-spec |
Draft head acceptance rate | +10–15 pp from calibration |
| BatchEmbed | --batch-embed |
Embedding pooling strategy | mean/max/cls/weighted in one pass |
Full stack:
squish serve ./model \
--continuous-batching \
--grammar-cache \
--adaptive-budget \
--tool-cache \
--distil-spec \
--batch-embed meanBenchmark results: docs/benchmark_wave19_20.md
v7 (Wave 21) focuses on tree-parallel speculative verification, online KV compression, mixed-precision per-head KV, pipeline bubble elimination, learned KV codecs, and retention-style recurrent attention, shipping 14 new modules:
| Module | Flag | Problem Solved | Key Number |
|---|---|---|---|
| TreeVerifier | --tree-verify |
Speculative tree acceptance | Structured multi-token acceptance |
| KVCompress | --kv-compress |
KV memory growth during generation | Online prune + INT8 quant |
| DynamicNTK | --dynamic-ntk |
Context extension without retraining | Auto-extends at 80% context fill |
| QuantSpecDecode | --quant-spec-decode |
Draft memory overhead | 4× draft memory reduction vs FP16 |
| SparseAttnIndex | --sparse-attn-index |
Attention cost at very long context | Sub-linear KV attention cost |
| MixedPrecisionKV | --mp-kv |
KV memory at iso-quality | 2–4× KV reduction via per-head precision |
| PipelineBubble | --pipeline-bubble |
Pipeline stage idle time | 1F1B near-zero bubble fraction |
| LayerwiseDecode | --layerwise-decode |
Full-depth decode latency | Early-exit at configurable layer |
| CodecKV | --codec-kv |
KV cache memory | 204× compression via learned codebook |
| DedupeAttn | --dedupe-attn |
Attention FLOPs on repetitive context | Near-duplicate Q/K output reuse |
| FlashPrefill | --flash-prefill |
Prefill memory on long sequences | O(seq × chunk) not O(seq²) |
| BudgetSpec | --budget-spec |
Draft compute near token budget | Ramp-down to 1 draft near limit |
| RetentionAttn | --retention-attn |
KV cache memory for recurrent inference | O(1) per-step linear recurrence |
Full stack:
squish serve ./model \
--tree-verify --kv-compress \
--dynamic-ntk \
--quant-spec-decode \
--sparse-attn-index --mp-kv \
--pipeline-bubble --layerwise-decode \
--codec-kv --dedupe-attn \
--flash-prefill --budget-spec \
--retention-attnv7 (Wave 22) focuses on multi-tenant fair scheduling, load-balanced request routing, predictive KV pre-warming, OpenTelemetry-compatible tracing, adaptive quantisation under pressure, and SLA violation detection, shipping 14 new modules:
| Module | Flag | Problem Solved | Key Number |
|---|---|---|---|
| CacheWarmup | --cache-warmup |
Cold TTFT on hot paths | Predictive KV pre-warming |
| TokenBudgetGate | --token-budget |
Request cost determinism | Hard budget with graceful truncation |
| RequestCoalesce | --req-coalesce |
Redundant prefill forward passes | Shared prefill for common prefixes |
| AdaptiveQuantize | --adaptive-quant |
Memory pressure OOM risk | Auto INT8/INT4 under pressure |
| HealthCheck | --health-check |
Quality regression detection | p50/p99 latency + error rate |
| FaultTolerance | --fault-tolerance |
OOM crash risk | Progressive evict→disable→reduce |
| ModelPool | --model-pool |
Multi-model reload latency | Hot pool with LRU eviction |
| StreamingChunk | --streaming-chunk |
First-chunk streaming latency | Sub-token chunked streaming |
| ContextCache | --context-cache |
Cross-session context re-encoding | Persistent TTL cache, 100% hit rate |
Full stack:
squish serve ./model \
--cache-warmup --token-budget \
--req-coalesce \
--adaptive-quant --health-check \
--fault-tolerance --model-pool \
--streaming-chunk --context-cacheBenchmark results: docs/benchmark_wave21_22.md
Wave 23 focuses on long-context efficiency, RAG-aware serving, CoT compression, and hierarchical KV tiering. Multi-modal (vision/video) modules were removed from this wave as out of scope for a text LLM server.
RAGPrefetch · CoTCompress · ContextualRerank · HierarchicalKV · StreamRAG · CrossDocAttn · LongContextChunk
Key numbers: 30–50% CoT token reduction · predictive RAG doc KV prefetch · 1M+ token semantic chunking.
Benchmark results: docs/benchmark_wave23_24.md
Raw data: dev/results/wave23_24_bench.json
Wave 24 focuses on ternary/sub-bit quantisation, structured sparsity, cross-layer weight sharing, and second-order calibration.
TernaryQuant · StructuredPrune · LayerFusion · WeightSharing · QuantCalib · SparseWeight · DeltaCompress · ZeroQuantV2 · GPTQLayer · SparseMoE · AWQv2
Key numbers: 1.58-bit ternary weights · 2:4 structured sparsity · 7.98× SVD delta compression.
Wave 25 focuses on DeepSeek-V2/V3 attention patterns, fused compute kernels, KV defragmentation, long-context attention, and multi-draft speculation.
FlashMLA · NativeSparseAttn · FusedSampler · KVDefrag · DualChunkAttn · ActivationOffload · MorphAttn · HydraSpec · SeqCompact · ParallelSampler · ContextSummarizer · SchemaGen
Key numbers: FlashMLA 4× KV compression · NSA ~87% attention sparsity · HydraSpec multi-draft speculation.
Benchmark results: docs/benchmark_wave25_26.md
Raw data: dev/results/wave25_26_bench.json
Wave 26 focuses on production-grade monitoring, adaptive batching, safety classification, semantic response caching, and rate limiting. Distributed multi-node infrastructure modules (tensor/sequence parallelism, KV migration, disaggregated prefill, request preemption, inference gateway, model version swap, audit logging) were removed as out of scope for single-device local inference.
ProductionProfiler · AdaptiveBatcher · SafetyLayer · SemanticResponseCache · RateLimiter · SchemaValidator
Key numbers: sub-200ns APM record · safety classification · semantic response deduplication.