Squish — Optimisation Module Reference

Complete per-module flag, problem statement, and benchmark reference. All benchmarks are CPU/numpy micro-benchmarks on Apple Silicon M-series unless otherwise noted. End-to-end hardware benchmarks: run python3 dev/benchmarks/bench_eoe.py.

Feature Stability Tiers

Tier	Waves	Description
Stable	1–12	Core compression, KV cache, speculative decode, AWQ, prefill/radix cache. Validated on hardware.
Beta	13–18	Advanced KV compression, additional spec-decode variants, attention architectures. Functionally complete, hardware validation in progress.
Experimental	19–26	Cutting-edge attention, extended quantisation, long-context, production reliability. Proof-of-concept; may change.

Flags for Experimental modules are labelled [Experimental] in squish serve --help.

v2 — Wave 12 [Stable]: Core KV + Weight Compression

Enable with squish run --model <name> [flags]:

Module	Flag	Effect	Overhead
PM-KVQ	`--pm-kvq`	4.2× KV cache memory at 4096 tokens	14 µs/step
MixKVQ	`--mix-kvq`	3.9× KV memory · 4.1 avg bits/channel	72 µs/step
CocktailKV	`--cocktail-kv`	~3× KV memory · chunk-similarity routing	895 µs/512-tok
MiLo INT3	`--milo`	5.3× weight compression · SNR > 13 dB	one-time convert
AgileIO	`--agile-io`	40–60% I/O latency reduction · 25× warm-cache reads	≈ 0
SageAttn	`--sage-attention`	2.1× attention speedup (INT8 QK^T)	≈ 0
SpargeAttn	`--sparge-attn`	2.5–5× attention speedup (sparse blocks)	≈ 0

Full stack:

squish run qwen3:8b \
  --pm-kvq --mix-kvq --cocktail-kv \
  --agile-io --milo \
  --sage-attention --sparge-attn

Benchmark results: docs/benchmark_wave12.md Raw data: dev/results/wave12_bench.json

v3 — Wave 13 [Beta]: Ultra-Long Context

v3 (Wave 13) focuses on ultra-long context (128K+ tokens) and adaptive speculative decoding, shipping 10 new modules:

Module	Flag	Problem Solved	Key Number
DuoAttention	`--duo-attention`	Long-context KV blowup: separates 30–40% retrieval heads from streaming heads	~2× KV memory saved at 32K tokens
ShadowKV	`--shadow-kv`	128K+ KV cache → CPU offload with low-rank pre-RoPE key projection	6–10× KV compression on long contexts
PQCache	`--pq-cache`	ANN-based KV retrieval for retrieval heads via product quantisation	4–8× key memory · sub-ms lookup
SpeCache	`--spe-cache`	Multi-turn KV reload stalls: speculatively prefetches prior-turn KV	40–60% KV reload latency eliminated
DuoDecoding	`--duo-decoding`	Fixed draft-sequence count wastes ANE cycles on M3	1.5–2.3× decode throughput
KnapSpec	`--knapspec`	Choosing which layers to skip for self-spec-decode is NP-hard	Optimal skip schedule in O(NL)
Token Merging	`--token-merging`	Similar tokens waste prefill FLOPs	1.4–1.8× prefill speedup
TokenSwift	`--token-swift`	Long outputs (20K–100K tokens) hit KV bandwidth ceiling	2–3× throughput on ultra-long gen
C2T	`--c2t`	Uniform draft tree wastes budget at confident positions	+0.8 tokens/step accepted

Full stack:

squish run qwen3:8b \
  --duo-attention --shadow-kv --pq-cache --spe-cache \
  --duo-decoding --knapspec --token-merging \
  --token-swift --c2t

v3 — Wave 14 [Beta]: Quantisation & Spec-Decode

v3 (Wave 14) focuses on quantisation methods, vocabulary-adaptive speculative decoding, and expert mixing, shipping 16 new modules:

Module	Flag	Problem Solved	Key Number
SubSpec	`--sub-spec`	Offloaded models have no fast draft path	Quantised substitute layers as draft → full spec-decode
DFloat11	`--dfloat11`	BF16 weights have poor entropy-coding compressibility	DFloat11 block-float: >30% size reduction vs BF16
rANS Codec	`--rans-codec`	Huffman coding leaves 5–15% entropy on the table	rANS → near-optimal entropy coding for KV/weights
QSpec	`--qspec`	Draft and verify share the same quantisation level	W4A8 draft / W4A16 verify → 1.8× throughput
QuantSpec	`--quant-spec`	Full-precision draft is slow; quantised draft is inaccurate	Bit-width selection per draft step → 98% accuracy
CopySpec	`--copy-spec`	Spec-decode needs a trained draft model	Copy from history buffer — zero extra model
SqueezeLLM	`--squeeze-llm`	Uniform quantisation crushes outlier weights	Sparse + dense mixed-precision: 4× smaller, 0.5 ppl loss
NF4 Quant	`--nf4-quant`	Uniform quantisation misaligns to weight distributions	Normal Float 4-bit levels → best quality per bit for LLMs
SpinQuant	`--spin-quant`	Weight outliers defeat quantisation	Hadamard rotation → 1.5 ppl improvement at INT4
HeadInfer	`--head-infer`	Uniform KV policy wastes memory on non-retrieval heads	Head-type-aware KV store: retrieval vs. streaming

Full stack:

squish run qwen3:8b \
  --squeeze-llm --nf4-quant --spin-quant \
  --copy-spec --sub-spec \
  --qspec --quant-spec \
  --dfloat11 --rans-codec --head-infer

Benchmark results: docs/benchmark_wave13_14.md Raw data: dev/results/wave13_14_bench.json

v4 — Wave 15 [Beta]: Serving Intelligence + KV Architecture

v4 (Wave 15) focuses on SLO-aware inference scheduling, confidence-gated verification, and KV architecture evolution, shipping 10 new modules:

Module	Flag	Problem Solved	Key Number
AdaServe	`--ada-serve`	Fixed gamma wastes draft budget on low-SLO requests	30% P99 latency ↓ · 1.5–2× throughput
ConfSpec	`--conf-spec`	Always running full verification wastes compute	54% verification cost ↓ via confidence gating
SeqPacking	`--seq-packing`	Varying sequence lengths cause barrel effect padding waste	+1.8× batch throughput
MetaReasoner	`--meta-reasoner`	CoT thinking on every token wastes energy on easy prompts	44–89% CoT energy saved
YOCO	`--yoco`	Cross-decoder layers duplicate KV across decoding passes	−50% KV memory via shared cross-decoder KV
CLA	`--cla`	Adjacent transformer layers learn nearly identical KV	10–30% KV reduction via cross-layer sharing schedule
KVSharer	`--kv-sharer`	No data-driven way to measure actual KV layer redundancy	~30% KV ops saved via calibration-based share map
DiffKV	`--diffkv`	Uniform K/V precision ignores asymmetric sensitivity	2.7–5.7× KV compression · 1.9–5.4× throughput
ParisKV	`--paris-kv`	Online KV quantisation codebooks drift without correction	4× KV compression with drift-robust adaptation
KVTuner	`--kvtuner`	Naive mixed-precision quant loses 20–35% accuracy	20–35% accuracy restored vs uniform quant

Full stack:

squish run qwen3:8b \
  --ada-serve --conf-spec --seq-packing --meta-reasoner \
  --yoco --cla --kv-sharer \
  --diffkv --paris-kv --kvtuner

v4 — Wave 16 [Beta]: Heterogeneous Compute + Advanced Spec-Decode

v4 (Wave 16) focuses on heterogeneous CPU+GPU execution, pipelined weight offloading, and advanced speculative decoding, shipping 11 new modules:

Module	Flag	Problem Solved	Key Number
Dovetail	`--dovetail`	CPU idle during GPU draft wastes heterogeneous compute	2× throughput via concurrent CPU verify + GPU draft
PIPO	`--pipo`	Sequential weight offload causes GPU idle stalls	+1.7× throughput via pipelined prefetch overlap
MobileMoE	`--mobile-moe`	MoE expert dispatch ignores device balance	+1.4× throughput via balanced layer-expert routing
OnlineSD	`--online-sd`	Frozen draft heads degrade after fine-tuning	+5–8 pp acceptance rate via continuous adaptation
LookaheadReasoning	`--lookahead-reasoning`	Sequential reasoning steps serialise all verification	+2.1× throughput via parallel step verification
SparseSpec	`--sparse-spec`	Static speculation ignores dynamic attention patterns	+2.13× spec throughput via adaptive pillar cache
FRSpec	`--fr-spec`	Full-vocab draft head is expensive at inference	−13% draft latency via frequency-ranked subset head
LongSpec	`--long-spec`	Draft KV grows with context → memory ceiling for long gen	Zero draft KV overhead via shared-KV draft head
ForeLen	`--forelen`	Output length prediction is inaccurate, causing early truncation	−29% MAE vs TRAIL baseline
RASD	`--rasd`	Draft models unfamiliar with corpus vocab fail spec decode	40–60% hit rate via retrieval-augmented draft tree

Full stack:

squish run qwen3:8b \
  --dovetail --pipo \
  --mobile-moe --online-sd \
  --lookahead-reasoning --sparse-spec \
  --fr-spec --long-spec \
  --forelen --rasd

Benchmark results: docs/benchmark_wave15_16.md Raw data: dev/results/wave15_16_bench.json

v5 — Wave 17 [Beta]: Attention Architecture

v5 (Wave 17) focuses on INT4/INT8 attention kernels, slab-allocated KV storage, joint 2D KV budget management, and context-aware speculative prefetching, shipping 14 new modules:

Module	Flag	Problem Solved	Key Number
SageAttention2	`--sage-attn2`	Full-precision attention is bandwidth-bound for long sequences	INT4/INT8 warp-tile quantisation · 672 µs forward (4h/seq32/d64)
StreamingSink	`--streaming-sink`	Unbounded KV growth at long contexts	Attention-sink eviction — bounded memory at any context length
KVSlab	`--kv-slab`	Per-token malloc/free causes fragmentation under scale	Pre-allocated slab allocator · 0.87 µs alloc+free round-trip
SqueezeAttention	`--squeeze-attn`	Independent token/layer KV compression compounds quality loss	Joint 2D Pareto-optimal budget allocation across both axes
SmallKV	`--small-kv`	Aggressive KV compression degrades small-model quality	Saliency-compensated recall · 39 µs ingest · 8 µs check-and-recall
SpeContext	`--spe-context`	Speculative decode wastes context retrieval at each draft step	Cosine-similarity context cache · 3.3 ms retrieve top-32
SVDq	`--svdq`	Uniform K quantisation ignores per-head SVD structure	Head-wise mixed-precision K search · 62 ms one-time calibration
CommVQ	`--comm-vq`	Per-layer VQ codebooks waste memory building near-identical codebooks	Shared communal codebook · 55 µs encode · 68 µs decode
ChunkedPrefill	`--chunked-prefill`	Long prefills block decoding requests for the full context length	Interleaved chunked prefill — bounded latency per chunk
GemFilter	`--gemfilter`	KV eviction without attention-score feedback drops important tokens	Top-K attention-score selector · 0.90× cR · 50 µs select
MInferencePatch	`--minference`	Full O(n²) attention is infeasible for 1M+ token contexts	Dynamic sparse patterns — sub-quadratic attention at ultra-long context
PromptCompressor	`--prompt-compress`	Long system prompts and RAG context waste prefill FLOPs	TF-IDF sentence-level compression · 686 µs for 50-sentence input
PromptLookup	`--prompt-lookup`	No-draft-model baseline has no spec-decode path	N-gram copy speculation from prompt · 0.8 µs find · 3.3 µs push
TRAIL	`--trail`	Output-length prediction is too slow for real-time SRPT scheduling	Linear-probe predictor · 10 µs predict · feeds SRPT priority queue

Full stack:

squish run qwen3:8b \
  --sage-attn2 --streaming-sink --kv-slab \
  --squeeze-attn --small-kv --spe-context \
  --svdq --comm-vq --chunked-prefill \
  --gemfilter --minference \
  --prompt-compress --prompt-lookup --trail

v5 — Wave 18 [Beta]: Adaptive Compute

v5 (Wave 18) focuses on vector-product quantisation, confidence-gated early exit, online domain adaptation, and energy-aware scheduling, shipping 14 new modules:

Module	Flag	Problem Solved	Key Number
VPTQ	`--vptq`	Scalar quantisation loses intra-vector correlations	Vector-product tree quant · 15 µs decode · 133 ms one-time compress
LayerSkip	`--layer-skip`	All tokens pass through all layers regardless of difficulty	Confidence-gated early exit · 266 µs estimate · exit at threshold=0.85
SWIFT	`--swift`	All FFN layers execute even when weights are functionally redundant	Calibration-based FFN skip · 162 µs calibrate · 34% layers skipped
SpecReason	`--spec-reason`	Reasoning chains serialise draft+verify round trips	Pipelined draft+target step · 6.6 µs per orchestrated step
MirrorSD	`--mirror-sd`	Single-draft spec-decode misses acceptance bursts	Mirror pipeline (parallel draft branches) · 867 µs step vocab=32k
SparseVerify	`--sparse-verify`	Re-verifying identical KV slices across draft iterations wastes compute	Inter-draft KV reuse cache · 0.28 µs query · near-zero overhead
RobustScheduler	`--robust-sched`	Priority inversions under bursty load hurt P99 latency	A-balanced SRPT scheduler · 3.7 µs schedule 32 requests
SemanticCache	`--semantic-cache`	Repeated semantically-equivalent queries re-run full inference	sqlite-vec semantic cache · short-circuit on cosine similarity hit
IPW	`--ipw`	No per-inference energy accounting available on-device	Perf-per-watt tracker · 0.16 µs record · 4.6 ms full summary
PowerMonitor	`--power-monitor`	Compute policy ignores battery vs. AC power state	Apple Silicon power advisor · 0.5 µs get recommended mode

Full stack:

squish run qwen3:8b \
  --vptq --layer-skip --swift \
  --spec-reason --mirror-sd --sparse-verify \
  --robust-sched --semantic-cache \
  --ipw --power-monitor

Benchmark results: docs/benchmark_wave17_18.md Raw data: dev/results/wave17_18_bench.json

v6 — Wave 19 [Experimental]: Next-Gen Attention & Precision

v6 (Wave 19) focuses on FP8/MX microscaling quantisation, paged KV caching, GQA and sliding window attention, RoPE context extension, and multi-head speculative decoding (MEDUSA, EAGLE-3), shipping 14 new modules:

Module	Flag	Problem Solved	Key Number
FP8Quant	`--fp8-quant`	Weight storage overhead	~60% storage vs BF16
MXQuant	`--mx-quant`	Quantisation quality at low bits	Better quality than INT4 at same bits
FlashDecode	`--flash-decode`	KV read parallelism at decode	O(1) memory overhead per step
PagedKV	`--paged-kv`	KV fragmentation across requests	Zero KV fragmentation
GQA	`--gqa`	KV memory per head	4–8× KV reduction vs MHA
SlidingWindowAttn	`--sliding-window`	Memory at long context	O(window_size) memory
RoPEScaling	`--rope-scaling`	Context extension without fine-tuning	4–32× context extension
ActSparsity	`--act-sparsity`	FFN compute on sparse activations	30–60% FFN compute saved
FusedRMSNorm	`--fused-norm`	LayerNorm bandwidth	Single kernel pass
LoRAInference	`--lora-inference`	Adapter switching overhead	Zero-copy, no re-quant
MEDUSA	`--medusa`	Decode throughput	2–3× decode throughput
EAGLE3	`--eagle3`	Draft acceptance rate	3.5× accept rate vs token-prediction
PrefixPool	`--prefix-pool`	KV recomputation on shared prompts	40–80% KV savings
TokenHealer	`--token-healer`	Prefix token boundary artifacts	Eliminates prefix artifacts

Full stack:

squish serve ./model \
  --fp8-quant --mx-quant \
  --flash-decode --paged-kv \
  --gqa --sliding-window \
  --rope-scaling ntk \
  --medusa --eagle3 \
  --prefix-pool --token-healer

v6 — Wave 20 [Experimental]: Serving Infrastructure & Intelligence

v6 (Wave 20) focuses on model merging, multi-LoRA composition, continuous batching, constrained decoding, and vision token compression, shipping 14 new modules:

Module	Flag	Problem Solved	Key Number
ModelMerge	`--model-merge`	Combining domains without retraining	SLERP/DARE/TIES merging
LoRACompose	`--lora-compose`	Multi-adapter blending	Learnable composition coefficients
ContinuousBatching	`--continuous-batching`	GPU utilization at variable request rate	Max GPU utilization
MatryoshkaEmb	`--matryoshka-emb`	Embedding dim flexibility	1 forward pass, any dimensionality
ANEProfiler	`--ane-profiler`	ANE vs GPU op breakdown	Op-level ANE utilization
SpecBench	`--spec-bench`	Speculative decode CI	Acceptance rate + throughput
PPLTracker	`--ppl-tracker`	Quantisation quality degradation	Real-time PPL monitoring
GrammarCache	`--grammar-cache`	Per-token FSM rebuild overhead	Zero rebuild on cached grammars
QuantAware	`--quant-aware`	Scale selection for quantisation	Per-channel optimal scales
AdaptiveBudget	`--adaptive-budget`	Joint KV + layer skip SLO control	SLO-aware compute budget
ToolCache	`--tool-cache`	Tool schema parse overhead	Zero parse overhead on repeats
DistilSpec	`--distil-spec`	Draft head acceptance rate	+10–15 pp from calibration
BatchEmbed	`--batch-embed`	Embedding pooling strategy	mean/max/cls/weighted in one pass

Full stack:

squish serve ./model \
  --continuous-batching \
  --grammar-cache \
  --adaptive-budget \
  --tool-cache \
  --distil-spec \
  --batch-embed mean

Benchmark results: docs/benchmark_wave19_20.md

v7 — Wave 21 [Experimental]: Advanced Memory & Decode

v7 (Wave 21) focuses on tree-parallel speculative verification, online KV compression, mixed-precision per-head KV, pipeline bubble elimination, learned KV codecs, and retention-style recurrent attention, shipping 14 new modules:

Module	Flag	Problem Solved	Key Number
TreeVerifier	`--tree-verify`	Speculative tree acceptance	Structured multi-token acceptance
KVCompress	`--kv-compress`	KV memory growth during generation	Online prune + INT8 quant
DynamicNTK	`--dynamic-ntk`	Context extension without retraining	Auto-extends at 80% context fill
QuantSpecDecode	`--quant-spec-decode`	Draft memory overhead	4× draft memory reduction vs FP16
SparseAttnIndex	`--sparse-attn-index`	Attention cost at very long context	Sub-linear KV attention cost
MixedPrecisionKV	`--mp-kv`	KV memory at iso-quality	2–4× KV reduction via per-head precision
PipelineBubble	`--pipeline-bubble`	Pipeline stage idle time	1F1B near-zero bubble fraction
LayerwiseDecode	`--layerwise-decode`	Full-depth decode latency	Early-exit at configurable layer
CodecKV	`--codec-kv`	KV cache memory	204× compression via learned codebook
DedupeAttn	`--dedupe-attn`	Attention FLOPs on repetitive context	Near-duplicate Q/K output reuse
FlashPrefill	`--flash-prefill`	Prefill memory on long sequences	O(seq × chunk) not O(seq²)
BudgetSpec	`--budget-spec`	Draft compute near token budget	Ramp-down to 1 draft near limit
RetentionAttn	`--retention-attn`	KV cache memory for recurrent inference	O(1) per-step linear recurrence

Full stack:

squish serve ./model \
  --tree-verify --kv-compress \
  --dynamic-ntk \
  --quant-spec-decode \
  --sparse-attn-index --mp-kv \
  --pipeline-bubble --layerwise-decode \
  --codec-kv --dedupe-attn \
  --flash-prefill --budget-spec \
  --retention-attn

v7 — Wave 22 [Experimental]: Production Serving & Observability

v7 (Wave 22) focuses on multi-tenant fair scheduling, load-balanced request routing, predictive KV pre-warming, OpenTelemetry-compatible tracing, adaptive quantisation under pressure, and SLA violation detection, shipping 14 new modules:

Module	Flag	Problem Solved	Key Number
CacheWarmup	`--cache-warmup`	Cold TTFT on hot paths	Predictive KV pre-warming
TokenBudgetGate	`--token-budget`	Request cost determinism	Hard budget with graceful truncation
RequestCoalesce	`--req-coalesce`	Redundant prefill forward passes	Shared prefill for common prefixes
AdaptiveQuantize	`--adaptive-quant`	Memory pressure OOM risk	Auto INT8/INT4 under pressure
HealthCheck	`--health-check`	Quality regression detection	p50/p99 latency + error rate
FaultTolerance	`--fault-tolerance`	OOM crash risk	Progressive evict→disable→reduce
ModelPool	`--model-pool`	Multi-model reload latency	Hot pool with LRU eviction
StreamingChunk	`--streaming-chunk`	First-chunk streaming latency	Sub-token chunked streaming
ContextCache	`--context-cache`	Cross-session context re-encoding	Persistent TTL cache, 100% hit rate

Full stack:

squish serve ./model \
  --cache-warmup --token-budget \
  --req-coalesce \
  --adaptive-quant --health-check \
  --fault-tolerance --model-pool \
  --streaming-chunk --context-cache

Benchmark results: docs/benchmark_wave21_22.md

v8 — Wave 23 [Experimental]: Long Context & RAG Intelligence

Wave 23 focuses on long-context efficiency, RAG-aware serving, CoT compression, and hierarchical KV tiering. Multi-modal (vision/video) modules were removed from this wave as out of scope for a text LLM server.

RAGPrefetch · CoTCompress · ContextualRerank · HierarchicalKV · StreamRAG · CrossDocAttn · LongContextChunk

Key numbers: 30–50% CoT token reduction · predictive RAG doc KV prefetch · 1M+ token semantic chunking.

Benchmark results: docs/benchmark_wave23_24.md Raw data: dev/results/wave23_24_bench.json

v8 — Wave 24 [Experimental]: Quantisation Evolution & Structured Pruning

Wave 24 focuses on ternary/sub-bit quantisation, structured sparsity, cross-layer weight sharing, and second-order calibration.

TernaryQuant · StructuredPrune · LayerFusion · WeightSharing · QuantCalib · SparseWeight · DeltaCompress · ZeroQuantV2 · GPTQLayer · SparseMoE · AWQv2

Key numbers: 1.58-bit ternary weights · 2:4 structured sparsity · 7.98× SVD delta compression.

v9 — Wave 25 [Experimental]: Cutting-Edge Attention Variants & Compute Fusion

Wave 25 focuses on DeepSeek-V2/V3 attention patterns, fused compute kernels, KV defragmentation, long-context attention, and multi-draft speculation.

FlashMLA · NativeSparseAttn · FusedSampler · KVDefrag · DualChunkAttn · ActivationOffload · MorphAttn · HydraSpec · SeqCompact · ParallelSampler · ContextSummarizer · SchemaGen

Key numbers: FlashMLA 4× KV compression · NSA ~87% attention sparsity · HydraSpec multi-draft speculation.

Benchmark results: docs/benchmark_wave25_26.md Raw data: dev/results/wave25_26_bench.json

v9 — Wave 26 [Experimental]: Production Reliability & Safety

Wave 26 focuses on production-grade monitoring, adaptive batching, safety classification, semantic response caching, and rate limiting. Distributed multi-node infrastructure modules (tensor/sequence parallelism, KV migration, disaggregated prefill, request preemption, inference gateway, model version swap, audit logging) were removed as out of scope for single-device local inference.

ProductionProfiler · AdaptiveBatcher · SafetyLayer · SemanticResponseCache · RateLimiter · SchemaValidator

Key numbers: sub-200ns APM record · safety classification · semantic response deduplication.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Squish — Optimisation Module Reference

Feature Stability Tiers

v2 — Wave 12 [Stable]: Core KV + Weight Compression

v3 — Wave 13 [Beta]: Ultra-Long Context

v3 — Wave 14 [Beta]: Quantisation & Spec-Decode

v4 — Wave 15 [Beta]: Serving Intelligence + KV Architecture

v4 — Wave 16 [Beta]: Heterogeneous Compute + Advanced Spec-Decode

v5 — Wave 17 [Beta]: Attention Architecture

v5 — Wave 18 [Beta]: Adaptive Compute

v6 — Wave 19 [Experimental]: Next-Gen Attention & Precision

v6 — Wave 20 [Experimental]: Serving Infrastructure & Intelligence

v7 — Wave 21 [Experimental]: Advanced Memory & Decode

v7 — Wave 22 [Experimental]: Production Serving & Observability

v8 — Wave 23 [Experimental]: Long Context & RAG Intelligence

v8 — Wave 24 [Experimental]: Quantisation Evolution & Structured Pruning

v9 — Wave 25 [Experimental]: Cutting-Edge Attention Variants & Compute Fusion

v9 — Wave 26 [Experimental]: Production Reliability & Safety

FilesExpand file tree

MODULES.md

Latest commit

History

MODULES.md

File metadata and controls

Squish — Optimisation Module Reference

Feature Stability Tiers

v2 — Wave 12 [Stable]: Core KV + Weight Compression

v3 — Wave 13 [Beta]: Ultra-Long Context

v3 — Wave 14 [Beta]: Quantisation & Spec-Decode

v4 — Wave 15 [Beta]: Serving Intelligence + KV Architecture

v4 — Wave 16 [Beta]: Heterogeneous Compute + Advanced Spec-Decode

v5 — Wave 17 [Beta]: Attention Architecture

v5 — Wave 18 [Beta]: Adaptive Compute

v6 — Wave 19 [Experimental]: Next-Gen Attention & Precision

v6 — Wave 20 [Experimental]: Serving Infrastructure & Intelligence

v7 — Wave 21 [Experimental]: Advanced Memory & Decode

v7 — Wave 22 [Experimental]: Production Serving & Observability

v8 — Wave 23 [Experimental]: Long Context & RAG Intelligence

v8 — Wave 24 [Experimental]: Quantisation Evolution & Structured Pruning

v9 — Wave 25 [Experimental]: Cutting-Edge Attention Variants & Compute Fusion

v9 — Wave 26 [Experimental]: Production Reliability & Safety