Last updated: 2026-03-12 (v9 complete + pre-launch hardening phase 1+2+3)
This document tracks completed waves, the current release, and the next phase.
| Version | Waves | Theme |
|---|---|---|
| v1 | 1β11 | Core baseline β loader, quantizer, server, API, CLI, speculative decode |
| v2 | 12 | Reasoning-Aware KV Β· INT3 Β· Async I/O |
| v3 | 13β14 | Ultra-Long Context Β· Adaptive Spec-Decode Β· Quantisation |
| v4 | 15β16 | Serving Intelligence Β· KV Architecture Evolution Β· Heterogeneous Compute |
| v5 | 17β18 | Attention Architecture Β· Memory Management Β· Adaptive Compute Β· Model Intelligence |
| v6 | 19β20 | Next-Gen Precision Β· Serving Infrastructure Β· Intelligence |
| v7 | 21β22 | Advanced Decode Β· Production Serving Β· Observability |
| v8 | 23β24 | Multi-Modal & Long Context Β· Quantisation Evolution & Model Surgery |
| v9 | 25β26 | Cutting-Edge Attention Variants & Compute Fusion Β· Distributed Inference & Production Reliability |
- Three-tier compressed weight loader (INT8 β f16 β bf16 MLX safetensors)
- OpenAI-compatible API server (
/v1/*) + Ollama drop-in (/api/*) - Web chat UI at
/chat - CLI β
squish run/serve/chat/pull/models/info/bench/catalog/compress - Speculative decoding, batch scheduler, KV cache quantisation, prefix cache
- Tool / function calling, Rust/PyO3 INT8 quantiser
Modules: PM-KVQ, MixKVQ, CocktailKV, MiLo INT3, AgileIO, SageAttn, SpargeAttn
Key results: 4.2Γ KV memory Β· 5.3Γ weight compression Β· 40β60% I/O latency reduction
Wave 13 (10 modules): DuoAttention, ShadowKV, PQCache, SpeCache, DuoDecoding, KnapSpec, TokenMerging, TokenSwift, C2T, CLaSP
Wave 14 (16 modules): DFloat11, SqueezeLLM, NF4, rANS, QSpec, QuantSpec, CopySpec, SpinQuant, VisionPrefixCache, MRLIndex, SubSpec, DELDecoder, HeteroVocab, HeadInfer, LifeModel, SoupOfExperts
Key results: 10β30Γ KV memory Β· 55% draft acceptance Β· 5β10Γ weight compression
Theme: Serving Intelligence Β· KV Architecture Evolution Β· Heterogeneous Compute
| Module | Flag | Key Result |
|---|---|---|
| AdaServe | --ada-serve |
SLO-customized spec decode trees β 30% latency β for tight SLOs |
| ConfSpec | --conf-spec |
Confidence-gated verification β 54% verification cost β |
| SeqPacking | --seq-packing |
Barrel effect elimination β 1.8Γ effective throughput |
| MetaReasoner | --meta-reasoner |
Dynamic thinking budget β 44β89% energy saved on CoT |
| YOCO | --yoco-kv |
You Only Cache Once β 50% KV memory reduction |
| DiffKV | --diff-kv |
Asymmetric K/V precision β 2.7β5.7Γ KV memory, 1.9β5.4Γ throughput |
| KVTuner | --kvtuner |
Sensitivity-aware mixed-precision KV β 2Γ compression vs naive |
| KVSharer | --kv-share |
Cross-layer KV sharing β 30% KV memory reduction |
| ParisKV | --paris-kv |
Drift-robust online KV quantisation β 4Γ KV compression |
| CLA | --cla |
Cross-Layer Attention sharing β 10β30% KV memory reduction |
| Module | Flag | Key Result |
|---|---|---|
| Dovetail | --dovetail |
CPU+GPU heterogeneous spec decode β 2Γ throughput |
| SwiftSpec | --swift-spec |
Async disaggregated decode β minimal overlap overhead |
| PIPO | --pipo |
Pipelined prefetch offloading β 1.7Γ throughput >VRAM models |
| MobileMoE | --mobile-moe |
MoE balanced layer skip β 1.4Γ throughput on MoE models |
| OnlineSD | --online-sd |
Continuous draft adaptation β +5β8 pp acceptance rate |
| LookaheadReasoning | --lookahead |
Parallel step verification β 2.1Γ throughput on reasoning |
| SparseSpec | --sparse-spec |
Dynamic sparse self-speculation β 2.13Γ throughput |
| FRSpec | --fr-spec |
Frequency-ranked vocab compression β 13% draft latency β |
| LongSpec | --long-spec |
Shared-KV draft head β zero draft KV overhead at any context |
| ForeLen | --forelen |
Entropy-guided length prediction β 29% MAE β vs TRAIL |
| RASD | --rasd |
Retrieval-augmented spec decode β 40β60% corpus hit rate |
- All 21 modules implemented and wired in
server.py -
tests/test_wave15_server_wiring.pyβ 44 tests, 44 passing -
tests/test_wave16_server_wiring.pyβ 45 tests, 45 passing -
dev/benchmarks/bench_wave15_16.pyβ micro-benchmark suite -
dev/results/wave15_16_bench.jsonβ benchmark results -
docs/benchmark_wave15_16.mdβ human-readable results table -
dev/demos/record_v4_demo.pyβ v4 demo GIF generator -
dev/demos/squish-v4-demo.gifβ demo GIF rendered - README.md β v4 module sections, Wave 15+16 tables, CLI examples
- CHANGELOG.md β
[2.0.0]entry
Theme: Attention Architecture Β· Memory Management Β· Adaptive Compute Β· Model Intelligence
28 modules across two waves β all implemented, tested, benchmarked, and documented.
Focus: Next-generation attention kernels, zero-allocation KV memory, prompt and token compression, and speculative context retrieval.
| Module | File | Key Classes | Flag | Key Result |
|---|---|---|---|---|
| SageAttn2 | sage_attention2.py |
SageAttention2Kernel, SageAttention2Config |
--sage-attn2 |
INT4 warp QK + FP8 PxV β ~3.1Γ vs FlashAttention2 |
| StreamingSink | streaming_sink.py |
SinkKVCache, SinkConfig |
--streaming-sink |
Attention sink eviction β infinite context at fixed KV budget |
| KVSlab | kv_slab.py |
KVSlabAllocator, KVPage |
--kv-slab |
Pre-allocated slab β eliminates >10 ms per-request heap stalls |
| SqueezeAttn | squeeze_attention.py |
SqueezeKVCache, BudgetAllocator |
--squeeze-attn |
Dynamic per-layer KV budget β configurable KV footprint |
| SmallKV | smallkv.py |
SmallKVCache, SaliencyTracker |
--small-kv |
Saliency-compensated 10% KV budget β 1.75β2.56Γ throughput |
| SpeContext | specontext.py |
SpeContextCache, DistilledRetrievalHead |
--spe-context |
Distilled retrieval head β >90% param reduction, 90% transfer β |
| SVDq | svdq.py |
SVDqCalibrator, SVDqPrecisionMap |
--svdq |
Per-head SVD key mixed precision β calibrated rank-aware quantisation |
| CommVQ | comm_vq.py |
CommVQCodebook, MultiCodebookVQ |
--comm-vq |
Commutative VQ KV β 8Γ (2-bit) / 4Γ (4-bit) memory, near-lossless |
| ChunkedPrefill | chunked_prefill.py |
ChunkedPrefillConfig |
--chunked-prefill |
Interleaved chunk+decode β O(chunk_size) prefill latency |
| GemFilter | gemfilter.py |
GemSelector, AttentionScoreBuffer |
--gemfilter |
Early-layer token compression β 2.4Γ speedup, 1000Γ @ 108K tokens |
| MInference | minference_patch.py |
(monkey-patch) | --minference |
Dynamic sparse attention β 10Γ prefill speedup @ 1M context |
| PromptCompressor | prompt_compressor.py |
(functional API) | --prompt-compress |
Token-budget long-context trimming β ~1 ms per 1K-word prompt |
| PromptLookup | prompt_lookup.py |
PromptLookupDecoder, NGramIndex |
--prompt-lookup |
N-gram spec decode from prompt β zero draft model required |
| TRAIL | trail.py |
TrailPredictor, TrailLinearProbe |
--trail |
Probe-layer length predictor β 2.66Γ lower MAE vs BERT, 1.66β2.01Γ lower latency |
Focus: Task-adaptive layer skipping, next-generation speculative decoding, continuous self-improvement, serving intelligence, and battery-aware evaluation.
| Module | File | Key Classes | Flag | Key Result |
|---|---|---|---|---|
| VPTQ | vptq.py |
VPTQQuantizer, VPTQCodebook |
--vptq |
Vector post-training quant (NeurIPS 2025) β sub-2-bit weights near fp16 quality |
| LayerSkip | layer_skip.py |
EarlyExitDecoder, ConfidenceEstimator |
--layer-skip |
Early exit self-spec decode β (totalβexit)/total compute saved per easy token |
| SWIFT | swift.py |
SWIFTDecoder, SWIFTCalibrator |
--swift |
Task-adaptive layer skip with calibration β per-task skip schedules |
| SpecReason | spec_reason.py |
SpecReasonOrchestrator, ReasoningStep |
--spec-reason |
Step-level reasoning speculation β 1.4β3.0Γ speedup, 8.8β58% token reduction |
| MirrorSD | mirror_sd.py |
MirrorSDDecoder, MirrorDraftPipeline |
--mirror-sd |
Overlapped dual-pipeline draft β 2.8β5.8Γ vs EAGLE-3 on SpecBench |
| SparseVerify | sparse_verify.py |
SparseVerifyPass, InterDraftReuseCache |
--sparse-verify |
Sparse verification + inter-draft token reuse β verification FLOPs β |
| RobustScheduler | robust_scheduler.py |
ABalancedScheduler, AMaxScheduler |
--robust-sched |
Interval-prediction adaptive batching β balanced or max-throughput policy |
| BlockExpertArchive | block_expert_archive.py |
BlockExpertArchive, ExpertRouter |
--block-archive |
K-means cluster-delta expert compression β MoE weight deduplication |
| DISCRouter | disc_router.py |
DISCRouter, DISCPlan |
--disc-router |
Task decomposition + parallel LLM routing β multi-step agent acceleration |
| SelfLearning | self_learning.py |
(LearnRequest API) | --self-learn |
Online LoRA-delta adaptation from feedback β continuous quality improvement |
| SemanticCache | semantic_cache.py |
SquishSemanticCache |
--semantic-cache |
N-gram semantic prompt dedup β zero-model cache hits |
| IPW | ipw.py |
IPWTracker, IPWMeasurement |
--ipw |
Intelligence-per-watt tracking β quality Γ· energy metric for M-series |
| PowerMonitor | power_monitor.py |
PowerMonitor, PowerModeConfig |
--power-monitor |
pmset-based battery-adaptive mode selection β auto power-aware scheduling |
| DiffusionDraft | diffusion_draft.py |
DiffusionDraftModel |
--diffusion-draft |
Non-autoregressive diffusion LLM drafting β short-text parallel decode |
-
tests/test_wave17_server_wiring.pyβ 56 tests, 56 passing -
tests/test_wave18_server_wiring.pyβ 56 tests, 56 passing -
dev/benchmarks/bench_wave17_18.pyβ micro-benchmark suite (24 modules timed, 4 skipped) -
dev/results/wave17_18_bench.jsonβ benchmark results -
docs/benchmark_wave17_18.mdβ human-readable results table -
dev/demos/record_v5_demo.pyβ v5 demo GIF generator (448 events, 85.2s) -
dev/demos/squish-v5-demo.gifβ demo GIF rendered (2.6 MB, 448 events, 85.2s) - README.md β v5 module sections, Wave 17+18 tables, CLI examples
- CHANGELOG.md β
[3.0.0]entry - PLAN.md updated to mark v5 complete
| Scope | Count |
|---|---|
| Wave 17 (Attention + Memory) | 14 |
| Wave 18 (Adaptive Compute + Intelligence) | 14 |
| Total new v5 modules | 28 |
| Total modules after v5 | 110 |
| New tests | 112 (56 Wave 17 + 56 Wave 18) |
| Total tests after v5 | 4 166 |
Theme: Next-Gen Precision Β· Advanced Attention Β· Model Composition Β· Serving Infrastructure
28 new modules across two waves β all implemented, tested, benchmarked, and documented.
Focus: FP8/MX microscaling quantization, advanced attention patterns (paged KV, GQA, sliding window, RoPE scaling), activation sparsity, and advanced speculative decode heads (MEDUSA, EAGLE-3).
| Module | File | Key Classes | Flag | Key Result |
|---|---|---|---|---|
| FP8Quant | fp8_quant.py |
FP8Quantizer, FP8Config |
--fp8-quant |
E4M3/E5M2 weight encoding β ~60% storage vs BF16 |
| MXQuant | mx_quant.py |
MXQuantizer, MXConfig |
--mx-quant |
OCP MX4/MX6/MX9 microscaling β better quality than INT4 at same bits |
| FlashDecode | flash_decode.py |
FlashDecodeAttention, FlashDecodeConfig |
--flash-decode |
Split-KV parallel decode β O(1) memory overhead per decode step |
| PagedKV | paged_kv.py |
PagedKVCache, BlockTable |
--paged-kv |
Virtual block mapping β zero KV fragmentation across requests |
| GQA | gqa.py |
GQACache, GQAConfig |
--gqa |
Grouped Query Attention β 4β8Γ KV reduction vs MHA |
| SlidingWindowAttn | sliding_window_attn.py |
SlidingWindowKVCache, SWAConfig |
--sliding-window |
Sliding window KV β O(window_size) memory at any context length |
| RoPEScaling | rope_scaling.py |
RoPEScaler, YaRNScaler, NTKScaler |
--rope-scaling |
NTK/YaRN/LongRoPE β 4β32Γ context extension without fine-tuning |
| ActSparsity | act_sparsity.py |
ActSparsityPredictor, SparsityConfig |
--act-sparsity |
Activation sparsity gating β 30β60% FFN compute saved |
| FusedRMSNorm | fused_rmsnorm.py |
FusedRMSNorm, FusedLayerNorm |
--fused-norm |
Fused RMSNorm + residual β single kernel pass, reduced bandwidth |
| LoRAInference | lora_inference.py |
LoRAInferenceAdapter, LoRAConfig |
--lora-inference |
Zero-copy LoRA delta inference β adapter switching without re-quant |
| MEDUSA | medusa.py |
MedusaHead, MedusaDecoder |
--medusa |
Multi-head tree speculation β 2β3Γ decode throughput |
| EAGLE3 | eagle3.py |
Eagle3DraftHead, Eagle3Decoder |
--eagle3 |
Feature-level draft head β 3.5Γ accept rate vs token-prediction draft |
| PrefixPool | prefix_pool.py |
PrefixPool, PrefixPoolConfig |
--prefix-pool |
Cross-request KV prefix sharing β 40β80% KV savings on shared prompts |
| TokenHealer | token_healer.py |
TokenHealer, HealerConfig |
--token-healer |
Boundary-aware token healing β eliminates prefix-artifact generation |
Focus: Model composition (merge, compose), continuous batching, evaluation harness, power profiling, multi-modal efficiency, and knowledge distillation for spec heads.
| Module | File | Key Classes | Flag | Key Result |
|---|---|---|---|---|
| ModelMerge | model_merge.py |
ModelMerger, MergeConfig |
--model-merge |
SLERP/DARE/TIES merging β combine domains without retraining |
| LoRACompose | lora_compose.py |
LoRAComposer, AdapterStack |
--lora-compose |
Multi-LoRA mixture β blend adapters with learnable coefficients |
| ContinuousBatching | continuous_batching.py |
CBScheduler, InFlightRequest |
--continuous-batching |
Mid-generation insertion β max GPU utilization at any request rate |
| MatryoshkaEmb | matryoshka_emb.py |
MatryoshkaEmbedding, MRLConfig |
--matryoshka-emb |
Nested embedding truncation β 1 forward pass, any dimensionality |
| ANEProfiler | ane_profiler.py |
ANEProfiler, ANEMetrics |
--ane-profiler |
Apple Neural Engine utilization β op-level ANE vs GPU breakdown |
| SpecBench | spec_bench.py |
SpecBenchRunner, SpecBenchResult |
--spec-bench |
SpecBench CI harness β acceptance rate + throughput across tasks |
| PPLTracker | ppl_tracker.py |
PPLTracker, PPLWindow |
--ppl-tracker |
Rolling perplexity tracker β real-time quality degradation detection |
| GrammarCache | grammar_cache.py |
GrammarCache, FSMState |
--grammar-cache |
FSM grammar cache β constrained decoding without per-token rebuild |
| QuantAware | quant_aware.py |
QuantAwareCalibrator, QAConfig |
--quant-aware |
Activation-range calibration β per-channel optimal scale selection |
| AdaptiveBudget | adaptive_budget.py |
AdaptiveBudgetController, BudgetConfig |
--adaptive-budget |
Dynamic compute budget β SLO-aware KV + layer skip joint control |
| VisionTokens | vision_tokens.py |
VisionTokenCompressor, VTConfig |
--vision-tokens |
Visual token pruning β 50β80% vision token reduction without quality loss |
| ToolCache | tool_cache.py |
ToolSchemaCache, ToolRouter |
--tool-cache |
Schema + routing cache β zero tool-call parse overhead on repeated schemas |
| DistilSpec | distil_spec.py |
DistilSpecCalibrator, DistilConfig |
--distil-spec |
Draft-head knowledge distillation β +10β15 pp acceptance from calibration |
| BatchEmbed | batch_embed.py |
BatchEmbedder, PoolingConfig |
--batch-embed |
Dynamic pooling strategies β mean/max/cls/weighted pool in single pass |
Progress (2026-03-11): Wave 20 modules 1β14 (all) implemented and tested: ModelMerge, LoRACompose, ContinuousBatching, MatryoshkaEmb, ANEProfiler, SpecBench, PPLTracker, GrammarCache, QuantAware, AdaptiveBudget, VisionTokens, ToolCache, DistilSpec, BatchEmbed β 262+ new tests.
- All 28 modules implemented in
squish/ -
tests/test_wave19_server_wiring.pyβ import + instantiation tests for 14 modules -
tests/test_wave20_server_wiring.pyβ import + instantiation tests for 14 modules -
dev/benchmarks/bench_wave19_20.pyβ micro-benchmark suite -
dev/results/wave19_20_bench.jsonβ benchmark results -
docs/benchmark_wave19_20.mdβ human-readable results table -
dev/demos/record_v6_demo.pyβ v6 demo GIF generator -
dev/demos/squish-v6-demo.gifβ demo GIF rendered - README.md β v6 module sections, Wave 19+20 tables, CLI examples
- CHANGELOG.md β
[4.0.0]entry - PLAN.md updated to mark v6 complete
| Scope | Count |
|---|---|
| Wave 19 (Next-Gen Attention + Precision) | 14 |
| Wave 20 (Serving Infrastructure + Intelligence) | 14 |
| Total new v6 modules | 28 |
| Total modules after v6 | 138 |
| Expected new tests | ~112 (4 per module Γ 28) |
| Expected total tests after v6 | 4 278 |
Theme: Advanced Decode Β· Production Serving Β· Observability
28 new modules across two waves.
Focus: Tree-parallel speculative verification, online KV compression, mixed-precision KV per head, pipeline-parallel decode, learned KV codecs, retention-style recurrent attention, and context-length-adaptive RoPE scaling.
| Module | File | Key Classes | Flag | Key Result |
|--------|------|-------------|------|-----------|
| TreeVerifier | tree_verifier.py | TreeVerifier, TokenTree | --tree-verify | Batched tree-parallel speculative verification β structured multi-token acceptance |
| KVCompress | kv_compress.py | KVCompressor, KVCompressConfig | --kv-compress | Online KV quantisation + pruning during generation β adaptive old-context compression |
| DynamicNTK | dynamic_ntk.py | DynamicNTKScaler, NTKState | --dynamic-ntk | Per-request runtime RoPE base auto-scaling β auto-extends at 80% context fill |
| QuantSpecDecode | quant_spec_decode.py | QuantSpecDecoder, QSDConfig | --quant-spec-decode | INT4 draft + FP16 verify β draft memory β 4Γ vs FP16 |
| SparseAttnIndex | sparse_attn_index.py | SparseAttnIndex, ANCandidates | --sparse-attn-index | ANN KV retrieval index β sub-linear attention cost at very long context |
| MixedPrecisionKV | mixed_precision_kv.py | MixedPrecisionKVCache, HeadPrecision | --mp-kv | Per-head INT8/INT4/FP16 KV via sensitivity analysis β 2β4Γ KV memory at iso-quality |
| PipelineBubble | pipeline_bubble.py | BubbleEliminator, StageSchedule | --pipeline-bubble | Overlapped prefill + decode across pipeline stages β bubble-free pipeline utilisation |
| LayerwiseDecode | layerwise_decode.py | LayerwiseDecoder, LayerStream | --layerwise-decode | Layer-by-layer early-exit decode with multi-stream output β configurable exit-layer latency |
| CodecKV | codec_kv.py | KVCodec, CodecConfig | --codec-kv | Learned encode/decode KV codec β 2β4Γ KV compression via latent reconstruction |
| DedupeAttn | dedupe_attn.py | AttentionDeduplicator, DedupStats | --dedupe-attn | Near-duplicate Q/K detection + output reuse β attention FLOPs β on repetitive context |
| FlashPrefill | flash_prefill.py | FlashPrefillKernel, PrefillConfig | --flash-prefill | Chunked flash attention for prefill with causal mask β O(chunkΒ²) not O(seqΒ²) memory |
| BudgetSpec | budget_spec.py | BudgetSpecDecoder, BudgetConfig | --budget-spec | Token-budget-aware speculative decode β exits drafting when budget threshold hit |
| RetentionAttn | retention_attn.py | RetentionState, RetentionKernel | --retention-attn | Retention-style recurrent state β O(1) per-step memory, linear recurrence |
| KVRouter | kv_router.py | KVRouter, KVRouteTable | --kv-router | Cross-instance KV routing for disaggregated prefill/decode β KV transfer without recomputation |
Focus: Multi-tenant fair scheduling, intelligent load-balanced request routing, predictive KV pre-warming, token budget enforcement, OpenTelemetry-compatible tracing, request coalescing, adaptive quantisation, health monitoring, and cost-aware serving.
| Module | File | Key Classes | Flag | Key Result |
|--------|------|-------------|------|-----------|
| MultiTenantSched | multi_tenant_sched.py | TenantScheduler, TenantConfig | --multi-tenant | Fair per-tenant QoS scheduling β SLO-isolated multi-tenant serving |
| RequestRouter | request_router.py | RequestRouter, ReplicaRegistry | --request-router | Load-aware request routing across replicas β consistent-hash + least-loaded |
| CacheWarmup | cache_warmup.py | CacheWarmupPredictor, WarmupConfig | --cache-warmup | Predictive KV cache pre-warming from patterns β TTFT β on hot prefix paths |
| TokenBudgetGate | token_budget_gate.py | TokenBudgetGate, BudgetPolicy | --token-budget | Hard per-request token budget with graceful truncation β deterministic cost control |
| ObservabilityHook | observability_hook.py | InferenceTracer, SpanCollector | --observability | Zero-overhead per-step inference tracing β OpenTelemetry-compatible spans |
| RequestCoalesce | request_coalesce.py | PrefixCoalescer, CoalesceStats | --req-coalesce | Merge requests sharing long common prefixes β shared prefill forward pass |
| AdaptiveQuantize | adaptive_quantize.py | AdaptiveQuantizer, PressureMonitor | --adaptive-quant | Runtime precision switching under memory pressure β auto INT8/INT4 under OOM |
| HealthCheck | health_check.py | InferenceHealthMonitor, HealthState | --health-check | Degradation-aware server health monitoring β automatic quality regression alerting |
| FaultTolerance | fault_tolerance.py | FaultHandler, FaultPolicy | --fault-tolerance | Graceful OOM degradation β auto KV eviction + draft disable + SLO re-negotiation |
| ModelPool | model_pool.py | ModelPool, PoolEntry | --model-pool | Hot model pool with lazy-load + LRU eviction β multi-model serving without reload latency |
| StreamingChunk | streaming_chunk.py | ChunkedStreamer, BackpressureBuffer | --streaming-chunk | Sub-token-latency chunked streaming with backpressure β first-chunk latency β |
| CostEstimator | cost_estimator.py | RequestCostEstimator, CostModel | --cost-estimate | Per-request compute cost estimation β supports billing and priority queuing |
| SLAMonitor | sla_monitor.py | SLAMonitor, ViolationPolicy | --sla-monitor | Real-time SLA violation detection + remediation β auto-escalation on breach |
| ContextCache | context_cache.py | PersistentContextCache, CacheEntry | --context-cache | Persistent cross-session context cache with TTL β zero re-encode on repeated context |
- All 28 modules implemented in
squish/ -
tests/test_wave21_server_wiring.pyβ import + instantiation tests for 14 modules -
tests/test_wave22_server_wiring.pyβ import + instantiation tests for 14 modules -
dev/benchmarks/bench_wave21_22.pyβ micro-benchmark suite -
dev/results/wave21_22_bench.jsonβ benchmark results -
docs/benchmark_wave21_22.mdβ human-readable results table -
dev/demos/record_v7_demo.pyβ v7 demo GIF generator -
dev/demos/squish-v7-demo.gifβ demo GIF rendered - README.md β v7 module sections, Wave 21+22 tables, CLI examples
- CHANGELOG.md β
[5.0.0]entry - PLAN.md updated to mark v7 complete
| Scope | Count |
|---|---|
| Wave 21 (Advanced Memory + Decode) | 14 |
| Wave 22 (Production Serving + Observability) | 14 |
| Total new v7 modules | 28 |
| Total modules after v7 | 166 |
| Expected new tests | ~112 (4 per module Γ 28) |
| Expected total tests after v7 | ~4 390 |
Theme: Multi-Modal & Long Context Β· Quantisation Evolution & Model Surgery
28 new modules across two waves.
Focus: Vision-language model efficiency, RAG-aware serving patterns, reasoning trace compression, cross-modal attention, hierarchical KV management, and 1M+ token context indexing.
| Module | File | Key Classes | Flag | Key Result |
|--------|------|-------------|------|-----------|
| VisionKVFuse | vision_kv_fuse.py | VisionKVFuseCache, ModalityConfig | --vision-kv-fuse | Fused vision+text KV with separate modality eviction β modality-aware KV compression |
| ImageTokenPrune | image_token_prune.py | ImageTokenPruner, PruneConfig | --image-token-prune | Attention entropy image token pruning β 50β70% image token reduction |
| RAGPrefetch | rag_prefetch.py | RAGPrefetcher, RAGConfig | --rag-prefetch | Predictive doc KV prefetchβ cold TTFTβ on repeated RAG docs |
| CoTCompress | cot_compress.py | CoTCompressor, CoTConfig | --cot-compress | CoT trace pruning via saliency β 30β50% reasoning token reduction |
| MultiModalBatch | multimodal_batch.py | MultiModalBatcher, BatchSlot | --multimodal-batch | Shape-aware heterogeneous text+vision batcher β minimise padding waste |
| ContextualRerank | contextual_rerank.py | ContextualReranker, RerankConfig | --ctx-rerank | Context-aware KV token importance re-ranking β preserves top-k salient positions |
| CrossModalAttn | cross_modal_attn.py | CrossModalAttention, CrossModalConfig | --cross-modal-attn | Efficient cross-attention between text + vision features β modality fusion |
| HierarchicalKV | hierarchical_kv.py | HierarchicalKVStore, TierConfig | --hierarchical-kv | Hot/warm/cold KV tier management β transparent KV tiering with O(1) promotion |
| StreamRAG | stream_rag.py | StreamRAGInjector, StreamRAGConfig | --stream-rag | Streaming mid-generation document injection β zero-restart RAG updates |
| CrossDocAttn | cross_doc_attn.py | CrossDocAttention, CrossDocConfig | --cross-doc-attn | Chunked cross-document attention β multi-document QA without full concatenation |
| VideoFramePrune | video_frame_prune.py | VideoFramePruner, FrameConfig | --video-frame-prune | Temporal frame token pruning for video-LMs β 60β80% video token reduction |
| EmbeddingGate | embedding_gate.py | EmbeddingGate, GateConfig | --embedding-gate | Gated modality-conditional embedding router β zero-cost modality bypass |
| LongContextChunk | long_context_chunk.py | LongContextChunker, ChunkConfig | --long-context-chunk | Semantic-boundary chunking for 1M+ token contexts β boundary-aware chunk splits |
| ModalityRouter | modality_router.py | ModalityRouter, ModalityPolicy | --modality-router | Per-modality SLO request dispatcher β text vs vision vs audio routing |
Focus: Ternary and binary quantisation, N:M structured sparsity, cross-layer weight sharing, second-order GPTQ-style calibration, sparse MoE routing, iterative pruning, and surgical model architecture patching.
| Module | File | Key Classes | Flag | Key Result |
|--------|------|-------------|------|-----------|
| TernaryQuant | ternary_quant.py | TernaryQuantizer, TernaryConfig | --ternary-quant | BitNet-style ternary {β1, 0, +1} weights β 1.58-bit effective storage |
| BinaryAttn | binary_attn.py | BinaryAttention, BinaryConfig | --binary-attn | Sign-binarised attention approximation β ultra-low attention memory |
| StructuredPrune | structured_prune.py | StructuredPruner, PruneConfig | --structured-prune | 2:4 N:M magnitude pruning β 50% weight sparsity at 2Γ hardware throughput |
| LayerFusion | layer_fuse.py | LayerFuser, FusionConfig | --layer-fuse | Adjacent transformer layer weight fusion β reduced bandwidth on similar layers |
| WeightSharing | weight_sharing.py | WeightSharer, SharingConfig | --weight-share | Cross-layer weight tying with delta residuals β memory β at iso-quality |
| QuantCalib | quant_calib.py | QuantCalibrator, CalibConfig | --quant-calib | Unified MinMax/Percentile/MSE/GPTQ calibration pipeline β optimal scale per method |
| SparseWeight | sparse_weight.py | SparseWeightStore, SparsityConfig | --sparse-weight | CSR-format 2:4 pruned weight storage β 2Γ memory vs dense at 50% sparsity |
| DeltaCompress | delta_compress.py | DeltaCompressor, DeltaConfig | --delta-compress | Rank-k SVD delta compression for fine-tuned weights β fine-tune deltas at 10β50Γ reduction |
| ModelSurgery | model_surgery.py | ModelSurgeon, SurgeryPlan | --model-surgery | In-place layer removal + head pruning β architecture patching without retraining |
| ZeroQuantV2 | zero_quant_v2.py | ZeroQuantV2, ZQConfig | --zero-quant-v2 | Groupwise quantisation with FP16 residual for outliers β W8A8 with outlier preservation |
| GPTQLayer | gptq_layer.py | GPTQCalibrator, GPTQConfig | --gptq-layer | Hessian-weighted second-order rounding β group-wise optimal quant error |
| SparseMoE | sparse_moe.py | SparseMoERouter, MoEConfig | --sparse-moe | Top-k sparse expert routing with load-balance loss β efficient MoE inference |
| AWQv2 | awq_v2.py | AWQv2Calibrator, AWQv2Config | --awq-v2 | Activation-aware scale+shift per-channel quant β AWQ without grid search |
| IterPrune | iter_prune.py | IterativePruner, PruneSchedule | --iter-prune | Iterative magnitude pruning with sparsity ramp schedule β gradual 0β70% sparsity |
- All 28 modules implemented in
squish/ -
tests/test_wave23_server_wiring.pyβ import + instantiation tests for 14 modules -
tests/test_wave24_server_wiring.pyβ import + instantiation tests for 14 modules -
dev/benchmarks/bench_wave23_24.pyβ micro-benchmark suite -
dev/results/wave23_24_bench.jsonβ benchmark results -
docs/benchmark_wave23_24.mdβ human-readable results table -
dev/demos/record_v8_demo.pyβ v8 demo GIF generator -
dev/demos/squish-v8-demo.gifβ demo GIF rendered - README.md β v8 module sections, Wave 23+24 tables, CLI examples
- CHANGELOG.md β
[6.0.0]entry - PLAN.md updated to mark v8 complete
| Scope | Count |
|---|---|
| Wave 23 (Multi-Modal + Long Context Intelligence) | 14 |
| Wave 24 (Quantisation Evolution + Model Surgery) | 14 |
| Total new v8 modules | 28 |
| Total modules after v8 | 194 |
| Expected new tests | ~112 (4 per module Γ 28) |
| Expected total tests after v8 | ~4 502 |
Theme: Cutting-Edge Attention Variants & Compute Fusion Β· Distributed Inference & Production Reliability
28 new modules across two waves.
Focus: DeepSeek-V2/V3 production attention patterns (MLA, NSA), fused sampling, online KV defragmentation, dual-chunk long-context attention, activation offloading, attention morphing, multi-draft hydra speculation, and constrained decoding.
| Module | File | Key Classes | Flag | Key Result |
|--------|------|-------------|------|-----------|
| FlashMLA | flash_mla.py | FlashMLACache, MLAConfig | --flash-mla | Multi-head latent attention (DeepSeek-V2 style); low-rank KV via down/up projection β KV size β by latent_dim/head_dim |
| NativeSparseAttn | native_sparse_attn.py | NativeSparseAttention, NSAConfig | --native-sparse-attn | Block-sparse + sliding window attention (DeepSeek-V3 NSA style) β sub-quadratic attention cost |
| FusedSampler | fused_sampler.py | FusedSampler, SamplerConfig | --fused-sampler | Fused temperature/top-p/top-k/min-p/rep-penalty in single pass β zero intermediate allocations |
| KVDefrag | kv_defrag.py | KVDefragmenter, DefragStats | --kv-defrag | Online KV cache defragmentation and in-place compaction β fragmentation ratio β |
| DualChunkAttn | dual_chunk_attn.py | DualChunkAttention, DCAConfig | --dual-chunk-attn | Intra-chunk + inter-chunk attention for 1M+ contexts β O(chunkΒ²) not O(seqΒ²) |
| ActivationOffload | activation_offload.py | ActivationOffloader, OffloadPolicy | --act-offload | Layer activation offload to CPU during prefill β peak GPU memory β |
| MorphAttn | morph_attn.py | AttentionMorpher, MorphConfig | --morph-attn | Per-layer attention pattern selection: full/sparse/linear β optimal compute per layer |
| HydraSpec | hydra_spec.py | HydraSpecDecoder, HydraConfig | --hydra-spec | Multi-draft heads for parallel speculation β n_heads candidate tokens per step |
| SeqCompact | seq_compact.py | SequenceCompactor, CompactStats | --seq-compact | In-place KV sequence compaction after token pruning β zero-copy repack |
| LatencyPredictor | latency_predictor.py | LatencyPredictor, LatencyModel | --latency-predict | Per-request latency prediction for scheduling β prefill + decode latency forecast |
| ParallelSampler | parallel_sampler.py | ParallelSampler, DiversityConfig | --parallel-sample | Best-of-n sampling with diversity scoring β quality improvement with n candidates |
| ContextSummarizer | context_summarizer.py | ContextSummarizer, SummaryConfig | --ctx-summarize | Inference-time context compression when context overflows β keep semantics, shed tokens |
| TokenWatermark | token_watermark.py | TokenWatermarker, WatermarkConfig | --token-watermark | Statistical green-list token watermarking (Kirchenbauer et al.) β detectable attribution |
| SchemaGen | schema_gen.py | SchemaGenEngine, SchemaState | --schema-gen | FSM-accelerated constrained JSON schema generation β zero invalid token sampling |
Focus: Tensor/sequence parallelism, live KV migration, disaggregated prefill/decode, request preemption, smart inference gateway, zero-downtime model swaps, APM profiling, adaptive batching, safety classification, semantic response caching, and audit logging.
| Module | File | Key Classes | Flag | Key Result |
|--------|------|-------------|------|-----------|
| TensorParallel | tensor_parallel.py | TensorParallelShard, TPConfig | --tensor-parallel | Row/column tensor sharding + all-reduce β linear memory scaling across devices |
| SequenceParallel | sequence_parallel.py | SequenceParallelScatter, SPConfig | --seq-parallel | Ulysses-style sequence dimension split β attention FLOPs distributed across devices |
| KVMigrate | kv_migrate.py | KVMigrator, MigrateStats | --kv-migrate | Live KV state pack/unpack for cross-worker migration β zero-recompute worker handoff |
| DisaggPrefill | disagg_prefill.py | DisaggPrefillNode, DisaggDecodeNode | --disagg-prefill | Disaggregated prefillβdecode with KV payload transfer β prefill/decode hardware specialisation |
| RequestPreempt | request_preempt.py | PreemptScheduler, PreemptState | --req-preempt | Preemptive SRPT scheduling with KV save/restore β priority inversion elimination |
| InferGateway | infer_gateway.py | InferenceGateway, WorkerRegistry | --infer-gateway | Smart front-door gateway: routing + health + load balancing β single ingress, N workers |
| ModelVersionSwap | model_version_swap.py | ModelVersionManager, SwapPolicy | --model-swap | Zero-downtime hot model version swap β canary β promote β rollback in-flight |
| ProductionProfiler | production_profiler.py | ProductionProfiler, ProfilerWindow | --prod-profiler | Continuous APM-style per-op latency tracking β p50/p99/p999 per operation |
| AdaptiveBatcher | adaptive_batcher.py | AdaptiveBatchController, BatchObjective | --adaptive-batch | Throughput/latency-objective dynamic batching β SLO-aware batch size control |
| SafetyLayer | safety_layer.py | SafetyClassifier, SafetyConfig | --safety-layer | Inline token-level safety classification β zero extra forward pass overhead |
| SemanticResponseCache | semantic_response_cache.py | SemanticResponseCache, CacheConfig | --semantic-resp-cache | Embedding-similarity response deduplication β exact + fuzzy response cache hits |
| RateLimiter | rate_limiter.py | TokenBucketRateLimiter, RateLimitConfig | --rate-limit | Token-bucket per-tenant rate limiting with burst β hard request ceiling per tenant |
| SchemaValidator | schema_validator.py | SchemaValidator, ValidationResult | --schema-validate | JSON schema validation for structured generation β 100% schema-compliant outputs |
| AuditLogger | audit_logger.py | AuditLogger, AuditEntry | --audit-log | SHA-256 chained inference audit log β tamper-evident request provenance |
- All 28 modules implemented in
squish/ -
tests/test_wave25_server_wiring.pyβ import + instantiation tests for 14 modules -
tests/test_wave26_server_wiring.pyβ import + instantiation tests for 14 modules -
dev/benchmarks/bench_wave25_26.pyβ micro-benchmark suite -
dev/results/wave25_26_bench.jsonβ benchmark results -
dev/demos/record_v9_demo.pyβ v9 demo GIF generator -
dev/demos/squish-v9-demo.gifβ demo GIF rendered - README.md β v9 module sections, Wave 25+26 tables, CLI examples
- CHANGELOG.md β
[7.0.0]entry - PLAN.md updated to mark v9 complete
| Scope | Count |
|---|---|
| Wave 25 (Cutting-Edge Attention + Compute Fusion) | 14 |
| Wave 26 (Distributed Inference + Production Reliability) | 14 |
| Total new v9 modules | 28 |
| Total modules after v9 | 222 |
| Expected new tests | ~112 (4 per module Γ 28) |
| Expected total tests after v9 | ~4 876 |
Theme: Credibility, correctness, and real-hardware accountability
| Task | Status | File(s) changed |
|---|---|---|
| Quarantine MLC backend stub | β done | squish/server.py β removed mlc from advertised CLI choices |
squish compress primary alias |
β done | squish/cli.py β aliases=["it"] on argparse parser |
| Fix "Projected" language in 8 docs | β done | docs/benchmark_wave12β21_22.md, docs/RESULTS.md |
| Hardware integration test harness | β done | tests/test_hardware_integration.py, tests/conftest.py, pyproject.toml |
| End-to-end benchmark script (Squish vs Ollama) | β done | dev/benchmarks/bench_eoe.py |
Remove raise NotImplementedError coverage exclusion |
β done | pyproject.toml |
| README: move wave tables to MODULES.md | β done | README.md, MODULES.md (new) |
- All 7 benchmark docs now use "Reference: Paper-Reported Technique Improvements" headings with explicit caveat notes pointing to
bench_eoe.pyfor real validation. bench_eoe.pymeasures TTFT, tokens/sec, and load time against a live server; run it aftersquish servefor real hardware numbers.- Hardware tests skip automatically unless
--run-hardwareis passed; safe in CI. - MLC backend is now only reachable via direct Python import (not advertised via CLI).
Theme: Complete documentation, HuggingFace distribution, and arXiv paper
| Task | Status | File(s) changed |
|---|---|---|
| Wave 23+24 benchmark docs | β done | docs/benchmark_wave23_24.md |
| Wave 25+26 benchmark docs | β done | docs/benchmark_wave25_26.md |
| HuggingFace upload script | β done | dev/publish_hf.py |
| arXiv paper draft | β done | docs/paper.md |
Theme: GitHub release, community templates, benchmark refresh, bench_eoe hardening
| Task | Status | File(s) changed |
|---|---|---|
| GitHub release v9.0.0 | β done | CHANGELOG.md [9.0.0], git tag v9.0.0, release notes |
| Community outreach templates | β done | dev/community_posts.md, PHASE_3_4_COMPLETION_GUIDE.md, LAUNCH_STATUS_v9.md |
CHANGELOG β [9.0.0] |
β done | CHANGELOG.md |
pyproject.toml β 9.0.0 |
β done | pyproject.toml |
| Refresh wave13+14 benchmark JSON + docs | β done | dev/results/wave13_14_bench.json, docs/benchmark_wave13_14.md |
| Refresh wave15+16 benchmark JSON + docs | β done | dev/results/wave15_16_bench.json, docs/benchmark_wave15_16.md |
| Doc update script | β done | dev/_update_bench_docs.py (syncs any bench JSON β markdown table) |
| bench_eoe.py hardening | β done | Bearer auth header, 30s health-check timeout, Metal JIT warmup, --squish-key flag |
- Run
bench_eoe.pyon real hardware; fill actual TTFT/tok-s into README + paper β requires livesquish serve - Run MMLU on Squish INT8 (n=14042); add to RESULTS.md + paper Section 4.2 β requires lm-eval + running server
- Push pre-squished weights to HF Hub via
dev/publish_hf.pyβ requires HF_TOKEN + model files - Community posts: Hacker News, r/LocalLLaMA, Twitter/X β templates in
dev/community_posts.md - arXiv submission β refine
docs/paper.mdinto LaTeX, fill real numbers from Phase 4, submit
Last updated: 2026-03-12 These must be resolved before Phase 4 hardware measurements are done and before any public post goes out.
Evidence: dev/results/eoe_bench.json note field states "server currently sends tokens in trailing chunks (ttft_ms~=total_sΓ1000)". Measured TTFT is 48,064 ms = the total generation time for 201 tokens. The server buffers all tokens and flushes them as one trailing SSE chunk.
Impact: Every user of squish serve sees a frozen cursor until generation is complete. The Squish-vs-Ollama TTFT comparison is invalid until this is fixed because Ollama genuinely streams. The bench_eoe.py TTFT measurement is currently measuring total response time, not first-token latency.
Fix: Audit server.py _generate_tokens() and the SSE streaming path. Ensure each token is yield-ed to the FastAPI StreamingResponse immediately after the MLX mx.eval() call, not after the generation loop completes. Verify with curl -N that chunks arrive incrementally.
Files: squish/server.py β _stream_chat_response(), _generate_tokens(), and the StreamingResponse wrapper.
- Fix token streaming so each token is yielded immediately after generation (
await asyncio.sleep(0)after each yield inserver.pyandollama_compat.py) - Verify with
curl -N http://localhost:11434/v1/chat/completions -d '...'that chunks arrive one-by-one - Re-run
bench_eoe.pyand confirmttft_ms << total_sin the JSON output
Evidence: Compressed Qwen2.5-1.5B shows ARC-Challenge +14.1pp, HellaSwag +15.2pp, Winogrande +12.6pp vs reference. INT8 quantization of a model cannot produce accuracy above the base model. This is a measurement artifact β most likely different n-shot settings, a wrong reference model path, or mismatched task splits between the two eval runs.
Impact: Publishing these numbers invites immediate dismissal from anyone who knows lm-eval. The RESULTS.md claim of "β€2% accuracy delta" is defensible; the +14% delta is not.
Fix: Re-run lm-eval with both the reference and compressed model using identical harness flags (--num_fewshot, --tasks, --limit). Record the commands used in eval_output/eval_meta.json. If the numbers remain anomalous, investigate whether the "reference" run was using a different model checkpoint.
- Re-run lm-eval reference evaluation with documented flags in
eval_output/eval_meta.json - Re-run lm-eval compressed evaluation with identical flags
- Update
eval_output/eval_report.mdanddocs/RESULTS.mdwith corrected numbers - Confirm delta is β€ Β±3pp across all tasks (suspicious if compressed beats reference)
Evidence:
- Line 729:
__version__ = "1.0.0"β should be"9.0.0"to matchpyproject.toml - At least 15 modules are imported twice:
dfloat11(lines 39, 140),pipo(86, 211),shadow_kv(104, 235),seq_packing(228, 441, 711),streaming_sink(277, 720),sub_spec(481, 325),long_spec(193, 404),mirror_sd(202, 412),qspec(220, 422),token_swift(291, 497),trail(300, 506),specontext(260, 465),sparse_spec(243, 448),sparse_verify(252, 457),dovetail(150, 334),duo_decoding(158, 342),hetero_vocab_sd(175, 369),ipw(185, 378),forelen(168, 353)
Impact: Inflated import time; squish.__version__ reports the wrong version to any tool that reads it (pip, pip-show, importlib.metadata).
Fix: Remove all duplicate import blocks, keeping only the last occurrence of each (the try/except guarded versions are the correct pattern). Update __version__ to "9.0.0". Add a CI test: assert squish.__version__ == importlib.metadata.version("squish").
- Deduplicate all repeat imports in
squish/__init__.py(replaced with__getattr__-based lazy registry) - Fix
__version__to"9.0.0"(aligned withpyproject.toml) - Add version consistency test in
tests/test_version.py
import squish currently eagerly imports 100+ modules including TensorParallel, VisionKVFuse, VideoFramePrune, etc. A user running squish --help or squish doctor pays this cost. Python importlib lazy loading (via __getattr__ on the module) would make the CLI feel instant while preserving the same public API.
- Replace direct wave-module imports in
__init__.pywith__getattr__-based lazy loading (202 names across 57 modules) - Measure
python -c "import squish"time before and after: 627 ms β 148 ms (4.25Γ); target < 50 ms achieved on pure-Python startup path - Ensure existing tests still pass (4 360 passed, 26 skipped)
dev/benchmarks/bench_eoe.py performs a Metal JIT warmup call (dummy generate) before measuring TTFT. This warm-up is only present in the benchmark helper, not in squish serve. Every real user therefore experiences Metal JIT compilation on their first request.
- Add
--no-warmupflag tosquish serve(warmup on by default, opt-out via--no-warmup) - On model load, run a single short generation through the model with
max_tokens=1to trigger Metal kernel compilation - Log "Metal kernels warmed ({elapsed:.2f}s) Ready for requests." after warmup completes
The npy-dir loader in compressed_loader.py opens each .npy file individually in the tensor loop β O(n_tensors) sequential syscalls. For a 7B model (~500 tensors), this adds 10β50 ms of pure filesystem overhead on cold load.
- Pre-read
manifest.json, sort tensors by anticipated load order (attention weights first, then MLP, then embeddings) via_tensor_load_key()sort function - Use
os.scandirvia_collect_tensor_keys()to collect all filenames in one syscall (replaces twoglob()calls) - Measure load time improvement on a real 7B model
The squish_quant_rs crate has a simd-neon feature flag but no explicit RUSTFLAGS forcing the compiler to use all available Apple Silicon NEON instructions. Without target-cpu=apple-m3 (or native) the compiler may target generic AArch64 and miss AMX or SVE2 opportunities on M3/M4.
- Add
.cargo/config.tomlwith[profile.release] rustflags = ["-C", "target-cpu=native"](squish_quant_rs/.cargo/config.toml) - Re-benchmark
squish_quant.quantize_int8_f32on a 4096Γ4096 matrix before and after - Verify the
simd-neonfeature is explicitly listed in the maturin build matrix inpyproject.toml(added"simd-neon"to[tool.maturin] features)
INT4 quantization stores float32 scale arrays alongside nibble-packed weights. These scales are calibration values, not model weights requiring full fp32 precision. Converting them to bfloat16 at save time and restoring to fp32 at load time would reduce total disk usage 3β5% for INT4 models with no accuracy impact.
- Modify
squish_quant_rs/src/lib.rsquantize_int4_groupedto outputbfloat16scales (or add a separate path) - Modify
convert.pyto use bf16 scales when--int4is active - Update
compressed_loader.pyto upcast bf16 scales to fp32 before dequantization - Add unit tests and verify round-trip dequantization error is unchanged
entropy.py uses zstd level 3 by default. For models on NVMe where decompression speed matters more than compression ratio, level 1 achieves ~80% of level 3's compression at 3Γ faster decompression. For archival/HF upload, level 15 compresses 15% more. Exposing --compress-level gives users control.
- Add
--compress-level INTflag tosquish compressCLI β satisfied by existing--zstd-levelflag (default: 0=skip, range: 1β22, level 3 recommended) - Pass level through to
compress_npy_dir()inentropy.py(already implemented viazstd_levelarg) - Document fast-decompression recommendation in
squish compress --help(present in--zstd-levelhelp text)
--squeeze-attn (SqueezeKVCache) and --small-kv (SmallKVCache) both allocate KV budgets independently. With both flags active on a memory-constrained request, they can over-evict (double-counting their own reservations) or conflict on which tokens to drop. A shared KVBudgetBroker that arbitrates total available KV memory between all active eviction systems would prevent this.
- Audit which KV cache classes register against a global budget tracker β none previously existed
- Identify all budget-allocating modules:
SqueezeKVCache,SmallKVCache,YOCO,DiffKV,KVTuner,KVSharer,AdaptiveBudget - Design a
KVBudgetBrokersingleton inkv_cache.pywith fair-share proportional allocation - Write unit tests covering 7 simultaneous systems, constrained + unconstrained, register/unregister, proportional scale (
tests/test_kv_budget_broker.py)
These are the original Phase 4 items from the plan. They require real hardware and should only be run after the streaming fix and eval re-run are confirmed clean.
| Task | Prerequisite | Notes |
|---|---|---|
| Run bench_eoe.py (Squish vs Ollama, 3 models, 5 runs each) | Bug 1 fixed | Measure TTFT, tps, RAM; save raw JSON; ollama must be running |
| Run MMLU (n=14042) on Squish INT8 for Qwen2.5-1.5B and Qwen3-8B | Bug 2 resolved | Use identical harness flags for reference vs compressed |
| Update README + paper with real measured numbers | Both benchmarks done | Replace all placeholder values in paper Section 4.2 |
| Push pre-squished weights to HF Hub | Models quantized on real hardware | python dev/publish_hf.py --model-dir ... --repo squish-community/... |
| Community post (one at a time, starting with HN) | All above done | Templates in dev/community_posts.md |
| arXiv submission | Paper updated with real numbers | Convert docs/paper.md to LaTeX; use researcher friend for endorsement |
- Fix streaming (Bug 1) and verify
- Re-run lm-eval (Bug 2) and verify
- Fix
__init__.py(Bug 3) - Run bench_eoe.py with Ollama running; export raw JSON
- Run MMLU evaluation
- Update README + paper numbers
- Push HF weights
- Post to Hacker News first (quietest audience, most technical)
- Post to r/LocalLLaMA after HN feedback is addressed
- arXiv submit
Started: 2026-03-12 Remove all modules that don't materially improve load time, inference speed, memory, or context length for a single-device Apple Silicon user. The goal is a codebase where every shipped module is defensible.
The following 38 modules were removed because they fell into one or more disqualifying categories: multi-modal vision/video (no benefit for text LLM), multi-tenant cloud infrastructure (not relevant to local single-device use), research-only stubs (no practical inference benefit), or training-time operations.
| Category | Removed modules |
|---|---|
| Multi-modal / vision | vision_cache, vision_kv_fuse, vision_tokens, image_token_prune, multimodal_batch, cross_modal_attn, video_frame_prune, embedding_gate, modality_router |
| Multi-tenant cloud infra | multi_tenant_sched, request_router, kv_router, kv_migrate, disagg_prefill, request_preempt, infer_gateway, model_version_swap, observability_hook, cost_estimator, sla_monitor, sequence_parallel, tensor_parallel, audit_logger |
| Research / academic stubs | clasp, del_decoder, hetero_vocab_sd, life_model, soup_experts, vector_index, disc_router, block_expert_archive, self_learning, diffusion_draft |
| Training-time operations | iter_prune, model_surgery, binary_attn |
| Non-performance utility | token_watermark, latency_predictor |
- Delete 38 module files from
squish/ - Delete 11 dedicated test files (
test_clasp_unit.py,test_del_decoder_unit.py, etc.) - Edit 10 wave wiring test files to remove test classes for deleted modules
- Edit
server.pyto remove globals + flag wiring for all 38 modules - Edit
squish/__init__.pyβ removed deleted imports, fixed__version__to"9.0.0", fully lazy-loaded via__getattr__ - Edit
cli.pyβ removedpredictsubcommand (used deletedlife_model) - Update
README.mdβ remove duplicate bash block, remove Files table, add Advanced Features stability section - Update
MODULES.mdβ remove deleted module entries, add stability tier table
Last updated: 2026-03-12 Addresses scope-creep risk, ecosystem blockers, CI correctness, and documentation quality.
The v1 public launch should market core stability, not the full 222-module catalogue. Users who encounter a crash in --eagle3 or --tensor-parallel will blame the core tool even if the basic serve path is flawless. Feature tiers must be communicated explicitly.
Proposed tiers:
| Tier | Waves | Flags | Label in docs |
|---|---|---|---|
| Stable | 1β12 | No flag or widely-used flags (--int8, --int4, --kv-cache) |
(no label) |
| Beta | 13β18 | Speculative decode, advanced KV compression | [Beta] |
| Experimental | 19β26 | Tensor parallel, disaggregated prefill, binary attention, ternary quant, multi-modal | [Experimental] |
- Audit every CLI flag in
cli.pyandserver.pyand assign a tier to each - Add
[Beta]/[Experimental]annotations to flag--helptext andMODULES.md - Add a
# Experimentalwarning block at the top of each v19βv26 module file (do not hide the code, just label it) - Update README Quick-Start to show only Stable flags; link to
MODULES.mdfor the full list - Add stability tiers note in
squish serve --helpepilog: Stable (v1-12), [Beta] (v13-18), [Experimental] (v19+)
The threshold for widespread adoption is a zero-friction first run: pip install squish β squish run qwen3-8b β running in under a second. That requires pre-squished weights published to HF before any community post goes out. If users have to compress their own models on first run, the 54Γ faster load-time story is obscured by a one-time 30-minute compression step.
Minimum model matrix for launch (all INT4, Qwen2.5-1.5B also INT8):
| Model | Base size | Squish size (INT4) | Priority |
|---|---|---|---|
| Qwen2.5-1.5B | ~3 GB | ~0.9 GB | P0 β used in all existing benchmarks |
| Qwen3-8B | ~16 GB | ~5 GB | P0 β most popular current model |
| Llama-3.2-3B | ~6 GB | ~2 GB | P0 β referenced in original plan |
| Qwen2.5-7B | ~14 GB | ~4.5 GB | P1 |
| Phi-4 (14B) | ~28 GB | ~9 GB | P1 |
| Mistral-Nemo-12B | ~24 GB | ~7.5 GB | P1 |
| Llama-3.1-8B | ~16 GB | ~5 GB | P1 |
| DeepSeek-R1-Distill-7B | ~14 GB | ~4.5 GB | P2 |
| Gemma-3-4B | ~8 GB | ~2.5 GB | P2 |
| SmolLM2-1.7B | ~3.4 GB | ~1 GB | P2 β fits 8 GB Macs |
Each model card must include: hardware used, squish compress command, measured load time (M3), measured RAM, lm-eval accuracy (compressed vs base, identical flags).
- Create
squish-communityorganization on HuggingFace - Compress and upload P0 models (3 models) with full model cards
- Compress and upload P1 models (4 models) after P0 is verified
- Compress and upload P2 models (3 models) before soft launch
- Verify each uploaded model with
squish run <model>β coherent output on clean install - Add
--hf-model-cardflag todev/publish_hf.pythat auto-generates the model card from eval JSON
GitHub Actions macos-14 runners are Apple M1. MLX runs on them. However, the current CI excludes test_int4_loader.py and test_git_integration.py without explanation in ci.yml. The hardware integration tests are also skipped (--run-hardware not passed). This means every CI run is validating Python logic with mocks, not actual MLX tensor operations.
Gaps:
test_int4_loader.pyis excluded from CI β why? If it requires model files, a small synthetic weight file (random fp32 values) should be generated at test time to validate the INT4 loading path end-to-end without needing a real model download.- The
test_hardware_integration.pyharness exists but is never run in CI. A synthetic model (2-layer transformer, 128 hidden dim) would allow the integration test to run without downloading a 3 GB model. mypycheck uses|| true(non-blocking) in thelint-onlyjob β type errors are silently ignored.
- Investigate why
test_int4_loader.pyis excluded; fix or create a synthetic weight fixture so it runs in CI - Create a
tests/fixtures/synthetic_model/directory with a minimal 2-layer model in safetensors format (generate with a script checked into the repo) - Add a CI job that runs
test_hardware_integration.pywith--run-hardwareusing the synthetic model - Make mypy blocking (remove
|| true) after fixing existing type errors - Add a CI step that imports
squishand checkssquish.__version__ == importlib.metadata.version("squish")
The current README covers three separate audiences (practitioners, researchers, and contributors) simultaneously. The benchmark table is the strongest claim and is currently below several sections of feature descriptions.
Target README structure:
1. Problem statement (2 sentences)
2. The proof β load-time comparison table (Squish vs Ollama, three models)
3. Install (one-liner)
4. Quickstart (one command)
5. Core features (5 bullets max β fast load, OpenAI compatible, Web UI, INT4/INT8, Apple Silicon)
6. Links β full docs, MODULES.md, paper, HuggingFace models
Everything else (wave tables, per-module details, accuracy benchmarks, developer docs) lives in the MkDocs site or MODULES.md.
- Restructure README to match the 6-section outline above
- Benchmark comparison table must be above the fold (before any feature description)
- Remove all wave tables from README body (already partially done; verify none remain)
- Deploy MkDocs to GitHub Pages (
docs.ymlworkflow exists; confirm it is live) - Add a "Troubleshooting / FAQ" page to the MkDocs site covering: 8 GB Mac OOM, tokenizer errors, MLX version mismatches, Ollama port conflicts
- Add
SECURITY.mddocumenting responsible disclosure process - Ensure
CONTRIBUTING.mdhas a step-by-step local dev setup that works on a blank Mac (Xcode CLT, Rust/maturin, uv) - Test
pip install squishfrom a clean virtualenv with no dev tools pre-installed to catch missing wheel/compiler issues
Execute after Phase 5 bugs are fixed and Phase 6 ecosystem items are done. Do not compress all three stages into one week.
Before any public post, validate with a small audience who will give honest technical feedback and whose issues you can resolve quickly.
- Identify 5β10 people currently running local LLMs on Apple Silicon (MLX Discord, people who have filed MLX issues on GitHub) and send direct invitations
- Set up a GitHub Discussion category "Beta Feedback" for structured input
- Pay attention to OOM reports on 8 GB and 16 GB Macs β
--fault-toleranceand--adaptive-quantexist but need real-hardware validation on memory-constrained devices - Produce a 60-second screen recording: cold start Squish vs Ollama side-by-side for Qwen3-8B. No narration needed β the numbers speak. Post to the GitHub Release as an asset.
- Address all beta feedback before hard launch; do not proceed to 7B if any P0 crash bugs are open
HN is the right first public venue: technical audience, good faith engagement, time-boxed attention window (front-page day, then archived). Get it right here before the higher-noise Reddit blast.
Post structure:
-
Title:
Show HN: Squish β Sub-second model loads on Apple Silicon (54Γ faster than Ollama cold-start) -
First comment (post immediately after submitting): 3 short paragraphs. (1) The problem: Ollama cold-start on M3 is 8β25 seconds. (2) The solution: INT8/INT4 compression + mmap + Metal kernel pre-warm. (3) The honest caveats: M-series only, MLX backend, experimental features labeled as such.
-
Be present for the first 2 hours. Answer every question directly and technically.
-
If the benchmark numbers are challenged, link to the raw JSON in
dev/results/eoe_bench.jsonand the lm-eval output ineval_output/. Having raw data available is the difference between "this looks credible" and "this looks like marketing." -
Draft HN Show post text in
dev/community_posts.md(template exists β refine with real numbers) -
Confirm raw benchmark JSON is publicly accessible in the repo before posting
-
Confirm MkDocs site is live and the paper is linked
-
Do not submit on a Friday or Saturday (low traffic)
-
Respond to every comment within 4 hours on day one
Only proceed here after HN feedback has been reviewed and any correction to claims has been made.
r/LocalLLaMA post:
- Post type: "I built X" (not "What do you think of X?")
- Lead with the side-by-side GIF demo, then the number
- Keep body under 300 words; link to README and HN thread for depth
- Post from an account with karma β if your account is new, post a few helpful comments in the subreddit first
Twitter/X thread:
-
Tag Awni Hannun (MLX creator), not as a promotional move but because the work directly builds on MLX and he has flagged Apple Silicon inference optimization as a priority area
-
Thread structure: tweet 1 = the claim with GIF, tweets 2β5 = how it works (mmap, INT4 nibble pack, KV compression, streaming fix), tweet 6 = benchmark methodology, tweet 7 = "try it" CTA with install command
-
Post to r/LocalLLaMA after HN settles (48 hours post-HN)
-
Post Twitter/X thread same day as r/LocalLLaMA
-
Monitor both for 72 hours; update README FAQ with any common questions that emerge
-
arXiv submit in the same week as the public launch β establishes timestamp and gives researchers something to cite