Skip to content

Latest commit

Β 

History

History
909 lines (682 loc) Β· 65.8 KB

File metadata and controls

909 lines (682 loc) Β· 65.8 KB

Squish β€” Development Plan

Last updated: 2026-03-12 (v9 complete + pre-launch hardening phase 1+2+3)

This document tracks completed waves, the current release, and the next phase.


Versioning Convention

Version Waves Theme
v1 1–11 Core baseline β€” loader, quantizer, server, API, CLI, speculative decode
v2 12 Reasoning-Aware KV Β· INT3 Β· Async I/O
v3 13–14 Ultra-Long Context Β· Adaptive Spec-Decode Β· Quantisation
v4 15–16 Serving Intelligence Β· KV Architecture Evolution Β· Heterogeneous Compute
v5 17–18 Attention Architecture Β· Memory Management Β· Adaptive Compute Β· Model Intelligence
v6 19–20 Next-Gen Precision Β· Serving Infrastructure Β· Intelligence
v7 21–22 Advanced Decode Β· Production Serving Β· Observability
v8 23–24 Multi-Modal & Long Context Β· Quantisation Evolution & Model Surgery
v9 25–26 Cutting-Edge Attention Variants & Compute Fusion Β· Distributed Inference & Production Reliability

βœ… v1 β€” Core Baseline (Released 2026-03-03)

  • Three-tier compressed weight loader (INT8 β†’ f16 β†’ bf16 MLX safetensors)
  • OpenAI-compatible API server (/v1/*) + Ollama drop-in (/api/*)
  • Web chat UI at /chat
  • CLI β€” squish run/serve/chat/pull/models/info/bench/catalog/compress
  • Speculative decoding, batch scheduler, KV cache quantisation, prefix cache
  • Tool / function calling, Rust/PyO3 INT8 quantiser

βœ… v2 β€” Wave 12 (Released 2026-03-04)

Modules: PM-KVQ, MixKVQ, CocktailKV, MiLo INT3, AgileIO, SageAttn, SpargeAttn

Key results: 4.2Γ— KV memory Β· 5.3Γ— weight compression Β· 40–60% I/O latency reduction


βœ… v3 β€” Waves 13+14 (Released 2026-03-11)

Wave 13 (10 modules): DuoAttention, ShadowKV, PQCache, SpeCache, DuoDecoding, KnapSpec, TokenMerging, TokenSwift, C2T, CLaSP

Wave 14 (16 modules): DFloat11, SqueezeLLM, NF4, rANS, QSpec, QuantSpec, CopySpec, SpinQuant, VisionPrefixCache, MRLIndex, SubSpec, DELDecoder, HeteroVocab, HeadInfer, LifeModel, SoupOfExperts

Key results: 10–30Γ— KV memory Β· 55% draft acceptance Β· 5–10Γ— weight compression


βœ… v4 β€” Waves 15+16 (Released 2026-03-12)

Theme: Serving Intelligence Β· KV Architecture Evolution Β· Heterogeneous Compute

Wave 15 β€” Serving Intelligence + KV Architecture Evolution (10 modules)

Module Flag Key Result
AdaServe --ada-serve SLO-customized spec decode trees β†’ 30% latency ↓ for tight SLOs
ConfSpec --conf-spec Confidence-gated verification β†’ 54% verification cost ↓
SeqPacking --seq-packing Barrel effect elimination β†’ 1.8Γ— effective throughput
MetaReasoner --meta-reasoner Dynamic thinking budget β†’ 44–89% energy saved on CoT
YOCO --yoco-kv You Only Cache Once β†’ 50% KV memory reduction
DiffKV --diff-kv Asymmetric K/V precision β†’ 2.7–5.7Γ— KV memory, 1.9–5.4Γ— throughput
KVTuner --kvtuner Sensitivity-aware mixed-precision KV β†’ 2Γ— compression vs naive
KVSharer --kv-share Cross-layer KV sharing β†’ 30% KV memory reduction
ParisKV --paris-kv Drift-robust online KV quantisation β†’ 4Γ— KV compression
CLA --cla Cross-Layer Attention sharing β†’ 10–30% KV memory reduction

Wave 16 β€” Heterogeneous Compute + Advanced Spec-Decode (11 modules)

Module Flag Key Result
Dovetail --dovetail CPU+GPU heterogeneous spec decode β†’ 2Γ— throughput
SwiftSpec --swift-spec Async disaggregated decode β†’ minimal overlap overhead
PIPO --pipo Pipelined prefetch offloading β†’ 1.7Γ— throughput >VRAM models
MobileMoE --mobile-moe MoE balanced layer skip β†’ 1.4Γ— throughput on MoE models
OnlineSD --online-sd Continuous draft adaptation β†’ +5–8 pp acceptance rate
LookaheadReasoning --lookahead Parallel step verification β†’ 2.1Γ— throughput on reasoning
SparseSpec --sparse-spec Dynamic sparse self-speculation β†’ 2.13Γ— throughput
FRSpec --fr-spec Frequency-ranked vocab compression β†’ 13% draft latency ↓
LongSpec --long-spec Shared-KV draft head β†’ zero draft KV overhead at any context
ForeLen --forelen Entropy-guided length prediction β†’ 29% MAE ↓ vs TRAIL
RASD --rasd Retrieval-augmented spec decode β†’ 40–60% corpus hit rate

Deliverables checklist

  • All 21 modules implemented and wired in server.py
  • tests/test_wave15_server_wiring.py β€” 44 tests, 44 passing
  • tests/test_wave16_server_wiring.py β€” 45 tests, 45 passing
  • dev/benchmarks/bench_wave15_16.py β€” micro-benchmark suite
  • dev/results/wave15_16_bench.json β€” benchmark results
  • docs/benchmark_wave15_16.md β€” human-readable results table
  • dev/demos/record_v4_demo.py β€” v4 demo GIF generator
  • dev/demos/squish-v4-demo.gif β€” demo GIF rendered
  • README.md β€” v4 module sections, Wave 15+16 tables, CLI examples
  • CHANGELOG.md β€” [2.0.0] entry

βœ… v5 β€” Waves 17+18 (Released 2026-03-11)

Theme: Attention Architecture Β· Memory Management Β· Adaptive Compute Β· Model Intelligence

28 modules across two waves β€” all implemented, tested, benchmarked, and documented.


Wave 17 β€” Attention Architecture + Memory Management (14 modules)

Focus: Next-generation attention kernels, zero-allocation KV memory, prompt and token compression, and speculative context retrieval.

Module File Key Classes Flag Key Result
SageAttn2 sage_attention2.py SageAttention2Kernel, SageAttention2Config --sage-attn2 INT4 warp QK + FP8 PxV β†’ ~3.1Γ— vs FlashAttention2
StreamingSink streaming_sink.py SinkKVCache, SinkConfig --streaming-sink Attention sink eviction β†’ infinite context at fixed KV budget
KVSlab kv_slab.py KVSlabAllocator, KVPage --kv-slab Pre-allocated slab β†’ eliminates >10 ms per-request heap stalls
SqueezeAttn squeeze_attention.py SqueezeKVCache, BudgetAllocator --squeeze-attn Dynamic per-layer KV budget β†’ configurable KV footprint
SmallKV smallkv.py SmallKVCache, SaliencyTracker --small-kv Saliency-compensated 10% KV budget β†’ 1.75–2.56Γ— throughput
SpeContext specontext.py SpeContextCache, DistilledRetrievalHead --spe-context Distilled retrieval head β†’ >90% param reduction, 90% transfer ↓
SVDq svdq.py SVDqCalibrator, SVDqPrecisionMap --svdq Per-head SVD key mixed precision β†’ calibrated rank-aware quantisation
CommVQ comm_vq.py CommVQCodebook, MultiCodebookVQ --comm-vq Commutative VQ KV β†’ 8Γ— (2-bit) / 4Γ— (4-bit) memory, near-lossless
ChunkedPrefill chunked_prefill.py ChunkedPrefillConfig --chunked-prefill Interleaved chunk+decode β†’ O(chunk_size) prefill latency
GemFilter gemfilter.py GemSelector, AttentionScoreBuffer --gemfilter Early-layer token compression β†’ 2.4Γ— speedup, 1000Γ— @ 108K tokens
MInference minference_patch.py (monkey-patch) --minference Dynamic sparse attention β†’ 10Γ— prefill speedup @ 1M context
PromptCompressor prompt_compressor.py (functional API) --prompt-compress Token-budget long-context trimming β†’ ~1 ms per 1K-word prompt
PromptLookup prompt_lookup.py PromptLookupDecoder, NGramIndex --prompt-lookup N-gram spec decode from prompt β†’ zero draft model required
TRAIL trail.py TrailPredictor, TrailLinearProbe --trail Probe-layer length predictor β†’ 2.66Γ— lower MAE vs BERT, 1.66–2.01Γ— lower latency

Wave 18 β€” Adaptive Compute + Model Intelligence + Evaluation (14 modules)

Focus: Task-adaptive layer skipping, next-generation speculative decoding, continuous self-improvement, serving intelligence, and battery-aware evaluation.

Module File Key Classes Flag Key Result
VPTQ vptq.py VPTQQuantizer, VPTQCodebook --vptq Vector post-training quant (NeurIPS 2025) β†’ sub-2-bit weights near fp16 quality
LayerSkip layer_skip.py EarlyExitDecoder, ConfidenceEstimator --layer-skip Early exit self-spec decode β†’ (totalβˆ’exit)/total compute saved per easy token
SWIFT swift.py SWIFTDecoder, SWIFTCalibrator --swift Task-adaptive layer skip with calibration β†’ per-task skip schedules
SpecReason spec_reason.py SpecReasonOrchestrator, ReasoningStep --spec-reason Step-level reasoning speculation β†’ 1.4–3.0Γ— speedup, 8.8–58% token reduction
MirrorSD mirror_sd.py MirrorSDDecoder, MirrorDraftPipeline --mirror-sd Overlapped dual-pipeline draft β†’ 2.8–5.8Γ— vs EAGLE-3 on SpecBench
SparseVerify sparse_verify.py SparseVerifyPass, InterDraftReuseCache --sparse-verify Sparse verification + inter-draft token reuse β†’ verification FLOPs ↓
RobustScheduler robust_scheduler.py ABalancedScheduler, AMaxScheduler --robust-sched Interval-prediction adaptive batching β†’ balanced or max-throughput policy
BlockExpertArchive block_expert_archive.py BlockExpertArchive, ExpertRouter --block-archive K-means cluster-delta expert compression β†’ MoE weight deduplication
DISCRouter disc_router.py DISCRouter, DISCPlan --disc-router Task decomposition + parallel LLM routing β†’ multi-step agent acceleration
SelfLearning self_learning.py (LearnRequest API) --self-learn Online LoRA-delta adaptation from feedback β†’ continuous quality improvement
SemanticCache semantic_cache.py SquishSemanticCache --semantic-cache N-gram semantic prompt dedup β†’ zero-model cache hits
IPW ipw.py IPWTracker, IPWMeasurement --ipw Intelligence-per-watt tracking β†’ quality Γ· energy metric for M-series
PowerMonitor power_monitor.py PowerMonitor, PowerModeConfig --power-monitor pmset-based battery-adaptive mode selection β†’ auto power-aware scheduling
DiffusionDraft diffusion_draft.py DiffusionDraftModel --diffusion-draft Non-autoregressive diffusion LLM drafting β†’ short-text parallel decode

v5 Deliverables checklist

  • tests/test_wave17_server_wiring.py β€” 56 tests, 56 passing
  • tests/test_wave18_server_wiring.py β€” 56 tests, 56 passing
  • dev/benchmarks/bench_wave17_18.py β€” micro-benchmark suite (24 modules timed, 4 skipped)
  • dev/results/wave17_18_bench.json β€” benchmark results
  • docs/benchmark_wave17_18.md β€” human-readable results table
  • dev/demos/record_v5_demo.py β€” v5 demo GIF generator (448 events, 85.2s)
  • dev/demos/squish-v5-demo.gif β€” demo GIF rendered (2.6 MB, 448 events, 85.2s)
  • README.md β€” v5 module sections, Wave 17+18 tables, CLI examples
  • CHANGELOG.md β€” [3.0.0] entry
  • PLAN.md updated to mark v5 complete

v5 Module Count Summary

Scope Count
Wave 17 (Attention + Memory) 14
Wave 18 (Adaptive Compute + Intelligence) 14
Total new v5 modules 28
Total modules after v5 110
New tests 112 (56 Wave 17 + 56 Wave 18)
Total tests after v5 4 166

βœ… v6 β€” Waves 19+20 (Released 2026-03-11)

Theme: Next-Gen Precision Β· Advanced Attention Β· Model Composition Β· Serving Infrastructure

28 new modules across two waves β€” all implemented, tested, benchmarked, and documented.


Wave 19 β€” Next-Gen Attention & Precision (14 modules)

Focus: FP8/MX microscaling quantization, advanced attention patterns (paged KV, GQA, sliding window, RoPE scaling), activation sparsity, and advanced speculative decode heads (MEDUSA, EAGLE-3).

Module File Key Classes Flag Key Result
FP8Quant fp8_quant.py FP8Quantizer, FP8Config --fp8-quant E4M3/E5M2 weight encoding β†’ ~60% storage vs BF16
MXQuant mx_quant.py MXQuantizer, MXConfig --mx-quant OCP MX4/MX6/MX9 microscaling β†’ better quality than INT4 at same bits
FlashDecode flash_decode.py FlashDecodeAttention, FlashDecodeConfig --flash-decode Split-KV parallel decode β†’ O(1) memory overhead per decode step
PagedKV paged_kv.py PagedKVCache, BlockTable --paged-kv Virtual block mapping β†’ zero KV fragmentation across requests
GQA gqa.py GQACache, GQAConfig --gqa Grouped Query Attention β†’ 4–8Γ— KV reduction vs MHA
SlidingWindowAttn sliding_window_attn.py SlidingWindowKVCache, SWAConfig --sliding-window Sliding window KV β†’ O(window_size) memory at any context length
RoPEScaling rope_scaling.py RoPEScaler, YaRNScaler, NTKScaler --rope-scaling NTK/YaRN/LongRoPE β†’ 4–32Γ— context extension without fine-tuning
ActSparsity act_sparsity.py ActSparsityPredictor, SparsityConfig --act-sparsity Activation sparsity gating β†’ 30–60% FFN compute saved
FusedRMSNorm fused_rmsnorm.py FusedRMSNorm, FusedLayerNorm --fused-norm Fused RMSNorm + residual β†’ single kernel pass, reduced bandwidth
LoRAInference lora_inference.py LoRAInferenceAdapter, LoRAConfig --lora-inference Zero-copy LoRA delta inference β†’ adapter switching without re-quant
MEDUSA medusa.py MedusaHead, MedusaDecoder --medusa Multi-head tree speculation β†’ 2–3Γ— decode throughput
EAGLE3 eagle3.py Eagle3DraftHead, Eagle3Decoder --eagle3 Feature-level draft head β†’ 3.5Γ— accept rate vs token-prediction draft
PrefixPool prefix_pool.py PrefixPool, PrefixPoolConfig --prefix-pool Cross-request KV prefix sharing β†’ 40–80% KV savings on shared prompts
TokenHealer token_healer.py TokenHealer, HealerConfig --token-healer Boundary-aware token healing β†’ eliminates prefix-artifact generation

Wave 20 β€” Serving Infrastructure & Intelligence (14 modules)

Focus: Model composition (merge, compose), continuous batching, evaluation harness, power profiling, multi-modal efficiency, and knowledge distillation for spec heads.

Module File Key Classes Flag Key Result
ModelMerge model_merge.py ModelMerger, MergeConfig --model-merge SLERP/DARE/TIES merging β†’ combine domains without retraining
LoRACompose lora_compose.py LoRAComposer, AdapterStack --lora-compose Multi-LoRA mixture β†’ blend adapters with learnable coefficients
ContinuousBatching continuous_batching.py CBScheduler, InFlightRequest --continuous-batching Mid-generation insertion β†’ max GPU utilization at any request rate
MatryoshkaEmb matryoshka_emb.py MatryoshkaEmbedding, MRLConfig --matryoshka-emb Nested embedding truncation β†’ 1 forward pass, any dimensionality
ANEProfiler ane_profiler.py ANEProfiler, ANEMetrics --ane-profiler Apple Neural Engine utilization β†’ op-level ANE vs GPU breakdown
SpecBench spec_bench.py SpecBenchRunner, SpecBenchResult --spec-bench SpecBench CI harness β†’ acceptance rate + throughput across tasks
PPLTracker ppl_tracker.py PPLTracker, PPLWindow --ppl-tracker Rolling perplexity tracker β†’ real-time quality degradation detection
GrammarCache grammar_cache.py GrammarCache, FSMState --grammar-cache FSM grammar cache β†’ constrained decoding without per-token rebuild
QuantAware quant_aware.py QuantAwareCalibrator, QAConfig --quant-aware Activation-range calibration β†’ per-channel optimal scale selection
AdaptiveBudget adaptive_budget.py AdaptiveBudgetController, BudgetConfig --adaptive-budget Dynamic compute budget β†’ SLO-aware KV + layer skip joint control
VisionTokens vision_tokens.py VisionTokenCompressor, VTConfig --vision-tokens Visual token pruning β†’ 50–80% vision token reduction without quality loss
ToolCache tool_cache.py ToolSchemaCache, ToolRouter --tool-cache Schema + routing cache β†’ zero tool-call parse overhead on repeated schemas
DistilSpec distil_spec.py DistilSpecCalibrator, DistilConfig --distil-spec Draft-head knowledge distillation β†’ +10–15 pp acceptance from calibration
BatchEmbed batch_embed.py BatchEmbedder, PoolingConfig --batch-embed Dynamic pooling strategies β†’ mean/max/cls/weighted pool in single pass

v6 Deliverables checklist

Progress (2026-03-11): Wave 20 modules 1–14 (all) implemented and tested: ModelMerge, LoRACompose, ContinuousBatching, MatryoshkaEmb, ANEProfiler, SpecBench, PPLTracker, GrammarCache, QuantAware, AdaptiveBudget, VisionTokens, ToolCache, DistilSpec, BatchEmbed β€” 262+ new tests.

  • All 28 modules implemented in squish/
  • tests/test_wave19_server_wiring.py β€” import + instantiation tests for 14 modules
  • tests/test_wave20_server_wiring.py β€” import + instantiation tests for 14 modules
  • dev/benchmarks/bench_wave19_20.py β€” micro-benchmark suite
  • dev/results/wave19_20_bench.json β€” benchmark results
  • docs/benchmark_wave19_20.md β€” human-readable results table
  • dev/demos/record_v6_demo.py β€” v6 demo GIF generator
  • dev/demos/squish-v6-demo.gif β€” demo GIF rendered
  • README.md β€” v6 module sections, Wave 19+20 tables, CLI examples
  • CHANGELOG.md β€” [4.0.0] entry
  • PLAN.md updated to mark v6 complete

v6 Module Count Summary

Scope Count
Wave 19 (Next-Gen Attention + Precision) 14
Wave 20 (Serving Infrastructure + Intelligence) 14
Total new v6 modules 28
Total modules after v6 138
Expected new tests ~112 (4 per module Γ— 28)
Expected total tests after v6 4 278

βœ… v7 β€” Waves 21+22 (Released 2026-03-12)

Theme: Advanced Decode Β· Production Serving Β· Observability

28 new modules across two waves.


Wave 21 β€” Advanced Memory & Decode (14 modules)

Focus: Tree-parallel speculative verification, online KV compression, mixed-precision KV per head, pipeline-parallel decode, learned KV codecs, retention-style recurrent attention, and context-length-adaptive RoPE scaling.

| Module | File | Key Classes | Flag | Key Result | |--------|------|-------------|------|-----------|
| TreeVerifier | tree_verifier.py | TreeVerifier, TokenTree | --tree-verify | Batched tree-parallel speculative verification β†’ structured multi-token acceptance | | KVCompress | kv_compress.py | KVCompressor, KVCompressConfig | --kv-compress | Online KV quantisation + pruning during generation β†’ adaptive old-context compression | | DynamicNTK | dynamic_ntk.py | DynamicNTKScaler, NTKState | --dynamic-ntk | Per-request runtime RoPE base auto-scaling β†’ auto-extends at 80% context fill | | QuantSpecDecode | quant_spec_decode.py | QuantSpecDecoder, QSDConfig | --quant-spec-decode | INT4 draft + FP16 verify β†’ draft memory ↓ 4Γ— vs FP16 | | SparseAttnIndex | sparse_attn_index.py | SparseAttnIndex, ANCandidates | --sparse-attn-index | ANN KV retrieval index β†’ sub-linear attention cost at very long context | | MixedPrecisionKV | mixed_precision_kv.py | MixedPrecisionKVCache, HeadPrecision | --mp-kv | Per-head INT8/INT4/FP16 KV via sensitivity analysis β†’ 2–4Γ— KV memory at iso-quality | | PipelineBubble | pipeline_bubble.py | BubbleEliminator, StageSchedule | --pipeline-bubble | Overlapped prefill + decode across pipeline stages β†’ bubble-free pipeline utilisation | | LayerwiseDecode | layerwise_decode.py | LayerwiseDecoder, LayerStream | --layerwise-decode | Layer-by-layer early-exit decode with multi-stream output β†’ configurable exit-layer latency | | CodecKV | codec_kv.py | KVCodec, CodecConfig | --codec-kv | Learned encode/decode KV codec β†’ 2–4Γ— KV compression via latent reconstruction | | DedupeAttn | dedupe_attn.py | AttentionDeduplicator, DedupStats | --dedupe-attn | Near-duplicate Q/K detection + output reuse β†’ attention FLOPs ↓ on repetitive context | | FlashPrefill | flash_prefill.py | FlashPrefillKernel, PrefillConfig | --flash-prefill | Chunked flash attention for prefill with causal mask β†’ O(chunkΒ²) not O(seqΒ²) memory | | BudgetSpec | budget_spec.py | BudgetSpecDecoder, BudgetConfig | --budget-spec | Token-budget-aware speculative decode β†’ exits drafting when budget threshold hit | | RetentionAttn | retention_attn.py | RetentionState, RetentionKernel | --retention-attn | Retention-style recurrent state β†’ O(1) per-step memory, linear recurrence | | KVRouter | kv_router.py | KVRouter, KVRouteTable | --kv-router | Cross-instance KV routing for disaggregated prefill/decode β†’ KV transfer without recomputation |

Wave 22 β€” Production Serving & Observability (14 modules)

Focus: Multi-tenant fair scheduling, intelligent load-balanced request routing, predictive KV pre-warming, token budget enforcement, OpenTelemetry-compatible tracing, request coalescing, adaptive quantisation, health monitoring, and cost-aware serving.

| Module | File | Key Classes | Flag | Key Result | |--------|------|-------------|------|-----------|
| MultiTenantSched | multi_tenant_sched.py | TenantScheduler, TenantConfig | --multi-tenant | Fair per-tenant QoS scheduling β†’ SLO-isolated multi-tenant serving | | RequestRouter | request_router.py | RequestRouter, ReplicaRegistry | --request-router | Load-aware request routing across replicas β†’ consistent-hash + least-loaded | | CacheWarmup | cache_warmup.py | CacheWarmupPredictor, WarmupConfig | --cache-warmup | Predictive KV cache pre-warming from patterns β†’ TTFT ↓ on hot prefix paths | | TokenBudgetGate | token_budget_gate.py | TokenBudgetGate, BudgetPolicy | --token-budget | Hard per-request token budget with graceful truncation β†’ deterministic cost control | | ObservabilityHook | observability_hook.py | InferenceTracer, SpanCollector | --observability | Zero-overhead per-step inference tracing β†’ OpenTelemetry-compatible spans | | RequestCoalesce | request_coalesce.py | PrefixCoalescer, CoalesceStats | --req-coalesce | Merge requests sharing long common prefixes β†’ shared prefill forward pass | | AdaptiveQuantize | adaptive_quantize.py | AdaptiveQuantizer, PressureMonitor | --adaptive-quant | Runtime precision switching under memory pressure β†’ auto INT8/INT4 under OOM | | HealthCheck | health_check.py | InferenceHealthMonitor, HealthState | --health-check | Degradation-aware server health monitoring β†’ automatic quality regression alerting | | FaultTolerance | fault_tolerance.py | FaultHandler, FaultPolicy | --fault-tolerance | Graceful OOM degradation β†’ auto KV eviction + draft disable + SLO re-negotiation | | ModelPool | model_pool.py | ModelPool, PoolEntry | --model-pool | Hot model pool with lazy-load + LRU eviction β†’ multi-model serving without reload latency | | StreamingChunk | streaming_chunk.py | ChunkedStreamer, BackpressureBuffer | --streaming-chunk | Sub-token-latency chunked streaming with backpressure β†’ first-chunk latency ↓ | | CostEstimator | cost_estimator.py | RequestCostEstimator, CostModel | --cost-estimate | Per-request compute cost estimation β†’ supports billing and priority queuing | | SLAMonitor | sla_monitor.py | SLAMonitor, ViolationPolicy | --sla-monitor | Real-time SLA violation detection + remediation β†’ auto-escalation on breach | | ContextCache | context_cache.py | PersistentContextCache, CacheEntry | --context-cache | Persistent cross-session context cache with TTL β†’ zero re-encode on repeated context |

v7 Deliverables checklist

  • All 28 modules implemented in squish/
  • tests/test_wave21_server_wiring.py β€” import + instantiation tests for 14 modules
  • tests/test_wave22_server_wiring.py β€” import + instantiation tests for 14 modules
  • dev/benchmarks/bench_wave21_22.py β€” micro-benchmark suite
  • dev/results/wave21_22_bench.json β€” benchmark results
  • docs/benchmark_wave21_22.md β€” human-readable results table
  • dev/demos/record_v7_demo.py β€” v7 demo GIF generator
  • dev/demos/squish-v7-demo.gif β€” demo GIF rendered
  • README.md β€” v7 module sections, Wave 21+22 tables, CLI examples
  • CHANGELOG.md β€” [5.0.0] entry
  • PLAN.md updated to mark v7 complete

v7 Module Count Summary

Scope Count
Wave 21 (Advanced Memory + Decode) 14
Wave 22 (Production Serving + Observability) 14
Total new v7 modules 28
Total modules after v7 166
Expected new tests ~112 (4 per module Γ— 28)
Expected total tests after v7 ~4 390

βœ… v8 β€” Waves 23+24 β€” Released 2026-03-12

Theme: Multi-Modal & Long Context Β· Quantisation Evolution & Model Surgery

28 new modules across two waves.


Wave 23 β€” Multi-Modal & Long Context Intelligence (14 modules)

Focus: Vision-language model efficiency, RAG-aware serving patterns, reasoning trace compression, cross-modal attention, hierarchical KV management, and 1M+ token context indexing.

| Module | File | Key Classes | Flag | Key Result | |--------|------|-------------|------|-----------|
| VisionKVFuse | vision_kv_fuse.py | VisionKVFuseCache, ModalityConfig | --vision-kv-fuse | Fused vision+text KV with separate modality eviction β†’ modality-aware KV compression | | ImageTokenPrune | image_token_prune.py | ImageTokenPruner, PruneConfig | --image-token-prune | Attention entropy image token pruning β†’ 50–70% image token reduction | | RAGPrefetch | rag_prefetch.py | RAGPrefetcher, RAGConfig | --rag-prefetch | Predictive doc KV prefetchβ†’ cold TTFT↓ on repeated RAG docs | | CoTCompress | cot_compress.py | CoTCompressor, CoTConfig | --cot-compress | CoT trace pruning via saliency β†’ 30–50% reasoning token reduction | | MultiModalBatch | multimodal_batch.py | MultiModalBatcher, BatchSlot | --multimodal-batch | Shape-aware heterogeneous text+vision batcher β†’ minimise padding waste | | ContextualRerank | contextual_rerank.py | ContextualReranker, RerankConfig | --ctx-rerank | Context-aware KV token importance re-ranking β†’ preserves top-k salient positions | | CrossModalAttn | cross_modal_attn.py | CrossModalAttention, CrossModalConfig | --cross-modal-attn | Efficient cross-attention between text + vision features β†’ modality fusion | | HierarchicalKV | hierarchical_kv.py | HierarchicalKVStore, TierConfig | --hierarchical-kv | Hot/warm/cold KV tier management β†’ transparent KV tiering with O(1) promotion | | StreamRAG | stream_rag.py | StreamRAGInjector, StreamRAGConfig | --stream-rag | Streaming mid-generation document injection β†’ zero-restart RAG updates | | CrossDocAttn | cross_doc_attn.py | CrossDocAttention, CrossDocConfig | --cross-doc-attn | Chunked cross-document attention β†’ multi-document QA without full concatenation | | VideoFramePrune | video_frame_prune.py | VideoFramePruner, FrameConfig | --video-frame-prune | Temporal frame token pruning for video-LMs β†’ 60–80% video token reduction | | EmbeddingGate | embedding_gate.py | EmbeddingGate, GateConfig | --embedding-gate | Gated modality-conditional embedding router β†’ zero-cost modality bypass | | LongContextChunk | long_context_chunk.py | LongContextChunker, ChunkConfig | --long-context-chunk | Semantic-boundary chunking for 1M+ token contexts β†’ boundary-aware chunk splits | | ModalityRouter | modality_router.py | ModalityRouter, ModalityPolicy | --modality-router | Per-modality SLO request dispatcher β†’ text vs vision vs audio routing |

Wave 24 β€” Quantisation Evolution & Model Surgery (14 modules)

Focus: Ternary and binary quantisation, N:M structured sparsity, cross-layer weight sharing, second-order GPTQ-style calibration, sparse MoE routing, iterative pruning, and surgical model architecture patching.

| Module | File | Key Classes | Flag | Key Result | |--------|------|-------------|------|-----------|
| TernaryQuant | ternary_quant.py | TernaryQuantizer, TernaryConfig | --ternary-quant | BitNet-style ternary {βˆ’1, 0, +1} weights β†’ 1.58-bit effective storage | | BinaryAttn | binary_attn.py | BinaryAttention, BinaryConfig | --binary-attn | Sign-binarised attention approximation β†’ ultra-low attention memory | | StructuredPrune | structured_prune.py | StructuredPruner, PruneConfig | --structured-prune | 2:4 N:M magnitude pruning β†’ 50% weight sparsity at 2Γ— hardware throughput | | LayerFusion | layer_fuse.py | LayerFuser, FusionConfig | --layer-fuse | Adjacent transformer layer weight fusion β†’ reduced bandwidth on similar layers | | WeightSharing | weight_sharing.py | WeightSharer, SharingConfig | --weight-share | Cross-layer weight tying with delta residuals β†’ memory ↓ at iso-quality | | QuantCalib | quant_calib.py | QuantCalibrator, CalibConfig | --quant-calib | Unified MinMax/Percentile/MSE/GPTQ calibration pipeline β†’ optimal scale per method | | SparseWeight | sparse_weight.py | SparseWeightStore, SparsityConfig | --sparse-weight | CSR-format 2:4 pruned weight storage β†’ 2Γ— memory vs dense at 50% sparsity | | DeltaCompress | delta_compress.py | DeltaCompressor, DeltaConfig | --delta-compress | Rank-k SVD delta compression for fine-tuned weights β†’ fine-tune deltas at 10–50Γ— reduction | | ModelSurgery | model_surgery.py | ModelSurgeon, SurgeryPlan | --model-surgery | In-place layer removal + head pruning β†’ architecture patching without retraining | | ZeroQuantV2 | zero_quant_v2.py | ZeroQuantV2, ZQConfig | --zero-quant-v2 | Groupwise quantisation with FP16 residual for outliers β†’ W8A8 with outlier preservation | | GPTQLayer | gptq_layer.py | GPTQCalibrator, GPTQConfig | --gptq-layer | Hessian-weighted second-order rounding β†’ group-wise optimal quant error | | SparseMoE | sparse_moe.py | SparseMoERouter, MoEConfig | --sparse-moe | Top-k sparse expert routing with load-balance loss β†’ efficient MoE inference | | AWQv2 | awq_v2.py | AWQv2Calibrator, AWQv2Config | --awq-v2 | Activation-aware scale+shift per-channel quant β†’ AWQ without grid search | | IterPrune | iter_prune.py | IterativePruner, PruneSchedule | --iter-prune | Iterative magnitude pruning with sparsity ramp schedule β†’ gradual 0β†’70% sparsity |

v8 Deliverables checklist

  • All 28 modules implemented in squish/
  • tests/test_wave23_server_wiring.py β€” import + instantiation tests for 14 modules
  • tests/test_wave24_server_wiring.py β€” import + instantiation tests for 14 modules
  • dev/benchmarks/bench_wave23_24.py β€” micro-benchmark suite
  • dev/results/wave23_24_bench.json β€” benchmark results
  • docs/benchmark_wave23_24.md β€” human-readable results table
  • dev/demos/record_v8_demo.py β€” v8 demo GIF generator
  • dev/demos/squish-v8-demo.gif β€” demo GIF rendered
  • README.md β€” v8 module sections, Wave 23+24 tables, CLI examples
  • CHANGELOG.md β€” [6.0.0] entry
  • PLAN.md updated to mark v8 complete

v8 Module Count Summary

Scope Count
Wave 23 (Multi-Modal + Long Context Intelligence) 14
Wave 24 (Quantisation Evolution + Model Surgery) 14
Total new v8 modules 28
Total modules after v8 194
Expected new tests ~112 (4 per module Γ— 28)
Expected total tests after v8 ~4 502

βœ… v9 β€” Waves 25+26 β€” Released 2026-03-12

Theme: Cutting-Edge Attention Variants & Compute Fusion Β· Distributed Inference & Production Reliability

28 new modules across two waves.


Wave 25 β€” Cutting-Edge Attention Variants & Compute Fusion (14 modules)

Focus: DeepSeek-V2/V3 production attention patterns (MLA, NSA), fused sampling, online KV defragmentation, dual-chunk long-context attention, activation offloading, attention morphing, multi-draft hydra speculation, and constrained decoding.

| Module | File | Key Classes | Flag | Key Result | |--------|------|-------------|------|-----------|
| FlashMLA | flash_mla.py | FlashMLACache, MLAConfig | --flash-mla | Multi-head latent attention (DeepSeek-V2 style); low-rank KV via down/up projection β†’ KV size ↓ by latent_dim/head_dim | | NativeSparseAttn | native_sparse_attn.py | NativeSparseAttention, NSAConfig | --native-sparse-attn | Block-sparse + sliding window attention (DeepSeek-V3 NSA style) β†’ sub-quadratic attention cost | | FusedSampler | fused_sampler.py | FusedSampler, SamplerConfig | --fused-sampler | Fused temperature/top-p/top-k/min-p/rep-penalty in single pass β†’ zero intermediate allocations | | KVDefrag | kv_defrag.py | KVDefragmenter, DefragStats | --kv-defrag | Online KV cache defragmentation and in-place compaction β†’ fragmentation ratio ↓ | | DualChunkAttn | dual_chunk_attn.py | DualChunkAttention, DCAConfig | --dual-chunk-attn | Intra-chunk + inter-chunk attention for 1M+ contexts β†’ O(chunkΒ²) not O(seqΒ²) | | ActivationOffload | activation_offload.py | ActivationOffloader, OffloadPolicy | --act-offload | Layer activation offload to CPU during prefill β†’ peak GPU memory ↓ | | MorphAttn | morph_attn.py | AttentionMorpher, MorphConfig | --morph-attn | Per-layer attention pattern selection: full/sparse/linear β†’ optimal compute per layer | | HydraSpec | hydra_spec.py | HydraSpecDecoder, HydraConfig | --hydra-spec | Multi-draft heads for parallel speculation β†’ n_heads candidate tokens per step | | SeqCompact | seq_compact.py | SequenceCompactor, CompactStats | --seq-compact | In-place KV sequence compaction after token pruning β†’ zero-copy repack | | LatencyPredictor | latency_predictor.py | LatencyPredictor, LatencyModel | --latency-predict | Per-request latency prediction for scheduling β†’ prefill + decode latency forecast | | ParallelSampler | parallel_sampler.py | ParallelSampler, DiversityConfig | --parallel-sample | Best-of-n sampling with diversity scoring β†’ quality improvement with n candidates | | ContextSummarizer | context_summarizer.py | ContextSummarizer, SummaryConfig | --ctx-summarize | Inference-time context compression when context overflows β†’ keep semantics, shed tokens | | TokenWatermark | token_watermark.py | TokenWatermarker, WatermarkConfig | --token-watermark | Statistical green-list token watermarking (Kirchenbauer et al.) β†’ detectable attribution | | SchemaGen | schema_gen.py | SchemaGenEngine, SchemaState | --schema-gen | FSM-accelerated constrained JSON schema generation β†’ zero invalid token sampling |

Wave 26 β€” Distributed Inference & Production Reliability (14 modules)

Focus: Tensor/sequence parallelism, live KV migration, disaggregated prefill/decode, request preemption, smart inference gateway, zero-downtime model swaps, APM profiling, adaptive batching, safety classification, semantic response caching, and audit logging.

| Module | File | Key Classes | Flag | Key Result | |--------|------|-------------|------|-----------|
| TensorParallel | tensor_parallel.py | TensorParallelShard, TPConfig | --tensor-parallel | Row/column tensor sharding + all-reduce → linear memory scaling across devices | | SequenceParallel | sequence_parallel.py | SequenceParallelScatter, SPConfig | --seq-parallel | Ulysses-style sequence dimension split → attention FLOPs distributed across devices | | KVMigrate | kv_migrate.py | KVMigrator, MigrateStats | --kv-migrate | Live KV state pack/unpack for cross-worker migration → zero-recompute worker handoff | | DisaggPrefill | disagg_prefill.py | DisaggPrefillNode, DisaggDecodeNode | --disagg-prefill | Disaggregated prefill→decode with KV payload transfer → prefill/decode hardware specialisation | | RequestPreempt | request_preempt.py | PreemptScheduler, PreemptState | --req-preempt | Preemptive SRPT scheduling with KV save/restore → priority inversion elimination | | InferGateway | infer_gateway.py | InferenceGateway, WorkerRegistry | --infer-gateway | Smart front-door gateway: routing + health + load balancing → single ingress, N workers | | ModelVersionSwap | model_version_swap.py | ModelVersionManager, SwapPolicy | --model-swap | Zero-downtime hot model version swap → canary → promote → rollback in-flight | | ProductionProfiler | production_profiler.py | ProductionProfiler, ProfilerWindow | --prod-profiler | Continuous APM-style per-op latency tracking → p50/p99/p999 per operation | | AdaptiveBatcher | adaptive_batcher.py | AdaptiveBatchController, BatchObjective | --adaptive-batch | Throughput/latency-objective dynamic batching → SLO-aware batch size control | | SafetyLayer | safety_layer.py | SafetyClassifier, SafetyConfig | --safety-layer | Inline token-level safety classification → zero extra forward pass overhead | | SemanticResponseCache | semantic_response_cache.py | SemanticResponseCache, CacheConfig | --semantic-resp-cache | Embedding-similarity response deduplication → exact + fuzzy response cache hits | | RateLimiter | rate_limiter.py | TokenBucketRateLimiter, RateLimitConfig | --rate-limit | Token-bucket per-tenant rate limiting with burst → hard request ceiling per tenant | | SchemaValidator | schema_validator.py | SchemaValidator, ValidationResult | --schema-validate | JSON schema validation for structured generation → 100% schema-compliant outputs | | AuditLogger | audit_logger.py | AuditLogger, AuditEntry | --audit-log | SHA-256 chained inference audit log → tamper-evident request provenance |

v9 Deliverables checklist

  • All 28 modules implemented in squish/
  • tests/test_wave25_server_wiring.py β€” import + instantiation tests for 14 modules
  • tests/test_wave26_server_wiring.py β€” import + instantiation tests for 14 modules
  • dev/benchmarks/bench_wave25_26.py β€” micro-benchmark suite
  • dev/results/wave25_26_bench.json β€” benchmark results
  • dev/demos/record_v9_demo.py β€” v9 demo GIF generator
  • dev/demos/squish-v9-demo.gif β€” demo GIF rendered
  • README.md β€” v9 module sections, Wave 25+26 tables, CLI examples
  • CHANGELOG.md β€” [7.0.0] entry
  • PLAN.md updated to mark v9 complete

v9 Module Count Summary

Scope Count
Wave 25 (Cutting-Edge Attention + Compute Fusion) 14
Wave 26 (Distributed Inference + Production Reliability) 14
Total new v9 modules 28
Total modules after v9 222
Expected new tests ~112 (4 per module Γ— 28)
Expected total tests after v9 ~4 876

βœ… Pre-Launch Hardening β€” 2026-03-12

Theme: Credibility, correctness, and real-hardware accountability

Phase 1 β€” Close Credibility Gaps

Task Status File(s) changed
Quarantine MLC backend stub βœ… done squish/server.py β€” removed mlc from advertised CLI choices
squish compress primary alias βœ… done squish/cli.py β€” aliases=["it"] on argparse parser
Fix "Projected" language in 8 docs βœ… done docs/benchmark_wave12–21_22.md, docs/RESULTS.md
Hardware integration test harness βœ… done tests/test_hardware_integration.py, tests/conftest.py, pyproject.toml
End-to-end benchmark script (Squish vs Ollama) βœ… done dev/benchmarks/bench_eoe.py
Remove raise NotImplementedError coverage exclusion βœ… done pyproject.toml
README: move wave tables to MODULES.md βœ… done README.md, MODULES.md (new)

Notes

  • All 7 benchmark docs now use "Reference: Paper-Reported Technique Improvements" headings with explicit caveat notes pointing to bench_eoe.py for real validation.
  • bench_eoe.py measures TTFT, tokens/sec, and load time against a live server; run it after squish serve for real hardware numbers.
  • Hardware tests skip automatically unless --run-hardware is passed; safe in CI.
  • MLC backend is now only reachable via direct Python import (not advertised via CLI).

βœ… Pre-Launch Hardening Phase 2 β€” 2026-03-12

Theme: Complete documentation, HuggingFace distribution, and arXiv paper

Task Status File(s) changed
Wave 23+24 benchmark docs βœ… done docs/benchmark_wave23_24.md
Wave 25+26 benchmark docs βœ… done docs/benchmark_wave25_26.md
HuggingFace upload script βœ… done dev/publish_hf.py
arXiv paper draft βœ… done docs/paper.md

βœ… Pre-Launch Hardening Phase 3 β€” 2026-03-12

Theme: GitHub release, community templates, benchmark refresh, bench_eoe hardening

Task Status File(s) changed
GitHub release v9.0.0 βœ… done CHANGELOG.md [9.0.0], git tag v9.0.0, release notes
Community outreach templates βœ… done dev/community_posts.md, PHASE_3_4_COMPLETION_GUIDE.md, LAUNCH_STATUS_v9.md
CHANGELOG β†’ [9.0.0] βœ… done CHANGELOG.md
pyproject.toml β†’ 9.0.0 βœ… done pyproject.toml
Refresh wave13+14 benchmark JSON + docs βœ… done dev/results/wave13_14_bench.json, docs/benchmark_wave13_14.md
Refresh wave15+16 benchmark JSON + docs βœ… done dev/results/wave15_16_bench.json, docs/benchmark_wave15_16.md
Doc update script βœ… done dev/_update_bench_docs.py (syncs any bench JSON β†’ markdown table)
bench_eoe.py hardening βœ… done Bearer auth header, 30s health-check timeout, Metal JIT warmup, --squish-key flag

Remaining (Phase 4 β€” hardware + human)

  • Run bench_eoe.py on real hardware; fill actual TTFT/tok-s into README + paper β€” requires live squish serve
  • Run MMLU on Squish INT8 (n=14042); add to RESULTS.md + paper Section 4.2 β€” requires lm-eval + running server
  • Push pre-squished weights to HF Hub via dev/publish_hf.py β€” requires HF_TOKEN + model files
  • Community posts: Hacker News, r/LocalLLaMA, Twitter/X β€” templates in dev/community_posts.md
  • arXiv submission β€” refine docs/paper.md into LaTeX, fill real numbers from Phase 4, submit

Phase 5 β€” Pre-Launch Blockers & Performance Hardening

Last updated: 2026-03-12 These must be resolved before Phase 4 hardware measurements are done and before any public post goes out.


5A β€” Critical Bug Fixes (block everything else)

Bug 1: Server streaming is broken β€” TTFT equals total generation time

Evidence: dev/results/eoe_bench.json note field states "server currently sends tokens in trailing chunks (ttft_ms~=total_sΓ—1000)". Measured TTFT is 48,064 ms = the total generation time for 201 tokens. The server buffers all tokens and flushes them as one trailing SSE chunk.

Impact: Every user of squish serve sees a frozen cursor until generation is complete. The Squish-vs-Ollama TTFT comparison is invalid until this is fixed because Ollama genuinely streams. The bench_eoe.py TTFT measurement is currently measuring total response time, not first-token latency.

Fix: Audit server.py _generate_tokens() and the SSE streaming path. Ensure each token is yield-ed to the FastAPI StreamingResponse immediately after the MLX mx.eval() call, not after the generation loop completes. Verify with curl -N that chunks arrive incrementally.

Files: squish/server.py β€” _stream_chat_response(), _generate_tokens(), and the StreamingResponse wrapper.

  • Fix token streaming so each token is yielded immediately after generation (await asyncio.sleep(0) after each yield in server.py and ollama_compat.py)
  • Verify with curl -N http://localhost:11434/v1/chat/completions -d '...' that chunks arrive one-by-one
  • Re-run bench_eoe.py and confirm ttft_ms << total_s in the JSON output

Bug 2: eval_output/eval_report.md shows impossible accuracy numbers

Evidence: Compressed Qwen2.5-1.5B shows ARC-Challenge +14.1pp, HellaSwag +15.2pp, Winogrande +12.6pp vs reference. INT8 quantization of a model cannot produce accuracy above the base model. This is a measurement artifact β€” most likely different n-shot settings, a wrong reference model path, or mismatched task splits between the two eval runs.

Impact: Publishing these numbers invites immediate dismissal from anyone who knows lm-eval. The RESULTS.md claim of "≀2% accuracy delta" is defensible; the +14% delta is not.

Fix: Re-run lm-eval with both the reference and compressed model using identical harness flags (--num_fewshot, --tasks, --limit). Record the commands used in eval_output/eval_meta.json. If the numbers remain anomalous, investigate whether the "reference" run was using a different model checkpoint.

  • Re-run lm-eval reference evaluation with documented flags in eval_output/eval_meta.json
  • Re-run lm-eval compressed evaluation with identical flags
  • Update eval_output/eval_report.md and docs/RESULTS.md with corrected numbers
  • Confirm delta is ≀ Β±3pp across all tasks (suspicious if compressed beats reference)

Bug 3: squish/__init__.py β€” version mismatch and duplicate imports

Evidence:

  • Line 729: __version__ = "1.0.0" β€” should be "9.0.0" to match pyproject.toml
  • At least 15 modules are imported twice: dfloat11 (lines 39, 140), pipo (86, 211), shadow_kv (104, 235), seq_packing (228, 441, 711), streaming_sink (277, 720), sub_spec (481, 325), long_spec (193, 404), mirror_sd (202, 412), qspec (220, 422), token_swift (291, 497), trail (300, 506), specontext (260, 465), sparse_spec (243, 448), sparse_verify (252, 457), dovetail (150, 334), duo_decoding (158, 342), hetero_vocab_sd (175, 369), ipw (185, 378), forelen (168, 353)

Impact: Inflated import time; squish.__version__ reports the wrong version to any tool that reads it (pip, pip-show, importlib.metadata).

Fix: Remove all duplicate import blocks, keeping only the last occurrence of each (the try/except guarded versions are the correct pattern). Update __version__ to "9.0.0". Add a CI test: assert squish.__version__ == importlib.metadata.version("squish").

  • Deduplicate all repeat imports in squish/__init__.py (replaced with __getattr__-based lazy registry)
  • Fix __version__ to "9.0.0" (aligned with pyproject.toml)
  • Add version consistency test in tests/test_version.py

5B β€” Load-Time Optimizations

Opt 1: Lazy imports for wave modules in __init__.py

import squish currently eagerly imports 100+ modules including TensorParallel, VisionKVFuse, VideoFramePrune, etc. A user running squish --help or squish doctor pays this cost. Python importlib lazy loading (via __getattr__ on the module) would make the CLI feel instant while preserving the same public API.

  • Replace direct wave-module imports in __init__.py with __getattr__-based lazy loading (202 names across 57 modules)
  • Measure python -c "import squish" time before and after: 627 ms β†’ 148 ms (4.25Γ—); target < 50 ms achieved on pure-Python startup path
  • Ensure existing tests still pass (4 360 passed, 26 skipped)

Opt 2: Metal JIT warmup integrated into server startup

dev/benchmarks/bench_eoe.py performs a Metal JIT warmup call (dummy generate) before measuring TTFT. This warm-up is only present in the benchmark helper, not in squish serve. Every real user therefore experiences Metal JIT compilation on their first request.

  • Add --no-warmup flag to squish serve (warmup on by default, opt-out via --no-warmup)
  • On model load, run a single short generation through the model with max_tokens=1 to trigger Metal kernel compilation
  • Log "Metal kernels warmed ({elapsed:.2f}s) Ready for requests." after warmup completes

Opt 3: Manifest-driven batched file open in npy-dir loader

The npy-dir loader in compressed_loader.py opens each .npy file individually in the tensor loop β€” O(n_tensors) sequential syscalls. For a 7B model (~500 tensors), this adds 10–50 ms of pure filesystem overhead on cold load.

  • Pre-read manifest.json, sort tensors by anticipated load order (attention weights first, then MLP, then embeddings) via _tensor_load_key() sort function
  • Use os.scandir via _collect_tensor_keys() to collect all filenames in one syscall (replaces two glob() calls)
  • Measure load time improvement on a real 7B model

Opt 4: Rust build with target-cpu=native for Apple Silicon

The squish_quant_rs crate has a simd-neon feature flag but no explicit RUSTFLAGS forcing the compiler to use all available Apple Silicon NEON instructions. Without target-cpu=apple-m3 (or native) the compiler may target generic AArch64 and miss AMX or SVE2 opportunities on M3/M4.

  • Add .cargo/config.toml with [profile.release] rustflags = ["-C", "target-cpu=native"] (squish_quant_rs/.cargo/config.toml)
  • Re-benchmark squish_quant.quantize_int8_f32 on a 4096Γ—4096 matrix before and after
  • Verify the simd-neon feature is explicitly listed in the maturin build matrix in pyproject.toml (added "simd-neon" to [tool.maturin] features)

5C β€” Memory & Inference Optimizations

Opt 5: Scale array quantization in npy-dir (3–5% disk reduction)

INT4 quantization stores float32 scale arrays alongside nibble-packed weights. These scales are calibration values, not model weights requiring full fp32 precision. Converting them to bfloat16 at save time and restoring to fp32 at load time would reduce total disk usage 3–5% for INT4 models with no accuracy impact.

  • Modify squish_quant_rs/src/lib.rs quantize_int4_grouped to output bfloat16 scales (or add a separate path)
  • Modify convert.py to use bf16 scales when --int4 is active
  • Update compressed_loader.py to upcast bf16 scales to fp32 before dequantization
  • Add unit tests and verify round-trip dequantization error is unchanged

Opt 6: Configurable zstd compression level in squish compress

entropy.py uses zstd level 3 by default. For models on NVMe where decompression speed matters more than compression ratio, level 1 achieves ~80% of level 3's compression at 3Γ— faster decompression. For archival/HF upload, level 15 compresses 15% more. Exposing --compress-level gives users control.

  • Add --compress-level INT flag to squish compress CLI β€” satisfied by existing --zstd-level flag (default: 0=skip, range: 1–22, level 3 recommended)
  • Pass level through to compress_npy_dir() in entropy.py (already implemented via zstd_level arg)
  • Document fast-decompression recommendation in squish compress --help (present in --zstd-level help text)

Opt 7: Unified KV budget controller

--squeeze-attn (SqueezeKVCache) and --small-kv (SmallKVCache) both allocate KV budgets independently. With both flags active on a memory-constrained request, they can over-evict (double-counting their own reservations) or conflict on which tokens to drop. A shared KVBudgetBroker that arbitrates total available KV memory between all active eviction systems would prevent this.

  • Audit which KV cache classes register against a global budget tracker β€” none previously existed
  • Identify all budget-allocating modules: SqueezeKVCache, SmallKVCache, YOCO, DiffKV, KVTuner, KVSharer, AdaptiveBudget
  • Design a KVBudgetBroker singleton in kv_cache.py with fair-share proportional allocation
  • Write unit tests covering 7 simultaneous systems, constrained + unconstrained, register/unregister, proportional scale (tests/test_kv_budget_broker.py)

5D β€” Phase 4 Hardware Work (after Bugs 1–3 are fixed)

These are the original Phase 4 items from the plan. They require real hardware and should only be run after the streaming fix and eval re-run are confirmed clean.

Task Prerequisite Notes
Run bench_eoe.py (Squish vs Ollama, 3 models, 5 runs each) Bug 1 fixed Measure TTFT, tps, RAM; save raw JSON; ollama must be running
Run MMLU (n=14042) on Squish INT8 for Qwen2.5-1.5B and Qwen3-8B Bug 2 resolved Use identical harness flags for reference vs compressed
Update README + paper with real measured numbers Both benchmarks done Replace all placeholder values in paper Section 4.2
Push pre-squished weights to HF Hub Models quantized on real hardware python dev/publish_hf.py --model-dir ... --repo squish-community/...
Community post (one at a time, starting with HN) All above done Templates in dev/community_posts.md
arXiv submission Paper updated with real numbers Convert docs/paper.md to LaTeX; use researcher friend for endorsement
  • Fix streaming (Bug 1) and verify
  • Re-run lm-eval (Bug 2) and verify
  • Fix __init__.py (Bug 3)
  • Run bench_eoe.py with Ollama running; export raw JSON
  • Run MMLU evaluation
  • Update README + paper numbers
  • Push HF weights
  • Post to Hacker News first (quietest audience, most technical)
  • Post to r/LocalLLaMA after HN feedback is addressed
  • arXiv submit

Phase 8 β€” Experimental Module Removal & Codebase Solidification

Started: 2026-03-12 Remove all modules that don't materially improve load time, inference speed, memory, or context length for a single-device Apple Silicon user. The goal is a codebase where every shipped module is defensible.

8A β€” Modules Removed

The following 38 modules were removed because they fell into one or more disqualifying categories: multi-modal vision/video (no benefit for text LLM), multi-tenant cloud infrastructure (not relevant to local single-device use), research-only stubs (no practical inference benefit), or training-time operations.

Category Removed modules
Multi-modal / vision vision_cache, vision_kv_fuse, vision_tokens, image_token_prune, multimodal_batch, cross_modal_attn, video_frame_prune, embedding_gate, modality_router
Multi-tenant cloud infra multi_tenant_sched, request_router, kv_router, kv_migrate, disagg_prefill, request_preempt, infer_gateway, model_version_swap, observability_hook, cost_estimator, sla_monitor, sequence_parallel, tensor_parallel, audit_logger
Research / academic stubs clasp, del_decoder, hetero_vocab_sd, life_model, soup_experts, vector_index, disc_router, block_expert_archive, self_learning, diffusion_draft
Training-time operations iter_prune, model_surgery, binary_attn
Non-performance utility token_watermark, latency_predictor

8B β€” Changes Made

  • Delete 38 module files from squish/
  • Delete 11 dedicated test files (test_clasp_unit.py, test_del_decoder_unit.py, etc.)
  • Edit 10 wave wiring test files to remove test classes for deleted modules
  • Edit server.py to remove globals + flag wiring for all 38 modules
  • Edit squish/__init__.py β€” removed deleted imports, fixed __version__ to "9.0.0", fully lazy-loaded via __getattr__
  • Edit cli.py β€” removed predict subcommand (used deleted life_model)
  • Update README.md β€” remove duplicate bash block, remove Files table, add Advanced Features stability section
  • Update MODULES.md β€” remove deleted module entries, add stability tier table

Last updated: 2026-03-12 Addresses scope-creep risk, ecosystem blockers, CI correctness, and documentation quality.


6A β€” Feature Gating: Core vs Experimental

The v1 public launch should market core stability, not the full 222-module catalogue. Users who encounter a crash in --eagle3 or --tensor-parallel will blame the core tool even if the basic serve path is flawless. Feature tiers must be communicated explicitly.

Proposed tiers:

Tier Waves Flags Label in docs
Stable 1–12 No flag or widely-used flags (--int8, --int4, --kv-cache) (no label)
Beta 13–18 Speculative decode, advanced KV compression [Beta]
Experimental 19–26 Tensor parallel, disaggregated prefill, binary attention, ternary quant, multi-modal [Experimental]
  • Audit every CLI flag in cli.py and server.py and assign a tier to each
  • Add [Beta] / [Experimental] annotations to flag --help text and MODULES.md
  • Add a # Experimental warning block at the top of each v19–v26 module file (do not hide the code, just label it)
  • Update README Quick-Start to show only Stable flags; link to MODULES.md for the full list
  • Add stability tiers note in squish serve --help epilog: Stable (v1-12), [Beta] (v13-18), [Experimental] (v19+)

6B β€” HuggingFace Model Ecosystem

The threshold for widespread adoption is a zero-friction first run: pip install squish β†’ squish run qwen3-8b β†’ running in under a second. That requires pre-squished weights published to HF before any community post goes out. If users have to compress their own models on first run, the 54Γ— faster load-time story is obscured by a one-time 30-minute compression step.

Minimum model matrix for launch (all INT4, Qwen2.5-1.5B also INT8):

Model Base size Squish size (INT4) Priority
Qwen2.5-1.5B ~3 GB ~0.9 GB P0 β€” used in all existing benchmarks
Qwen3-8B ~16 GB ~5 GB P0 β€” most popular current model
Llama-3.2-3B ~6 GB ~2 GB P0 β€” referenced in original plan
Qwen2.5-7B ~14 GB ~4.5 GB P1
Phi-4 (14B) ~28 GB ~9 GB P1
Mistral-Nemo-12B ~24 GB ~7.5 GB P1
Llama-3.1-8B ~16 GB ~5 GB P1
DeepSeek-R1-Distill-7B ~14 GB ~4.5 GB P2
Gemma-3-4B ~8 GB ~2.5 GB P2
SmolLM2-1.7B ~3.4 GB ~1 GB P2 β€” fits 8 GB Macs

Each model card must include: hardware used, squish compress command, measured load time (M3), measured RAM, lm-eval accuracy (compressed vs base, identical flags).

  • Create squish-community organization on HuggingFace
  • Compress and upload P0 models (3 models) with full model cards
  • Compress and upload P1 models (4 models) after P0 is verified
  • Compress and upload P2 models (3 models) before soft launch
  • Verify each uploaded model with squish run <model> β†’ coherent output on clean install
  • Add --hf-model-card flag to dev/publish_hf.py that auto-generates the model card from eval JSON

6C β€” CI/CD: Apple Silicon Test Coverage

GitHub Actions macos-14 runners are Apple M1. MLX runs on them. However, the current CI excludes test_int4_loader.py and test_git_integration.py without explanation in ci.yml. The hardware integration tests are also skipped (--run-hardware not passed). This means every CI run is validating Python logic with mocks, not actual MLX tensor operations.

Gaps:

  1. test_int4_loader.py is excluded from CI β€” why? If it requires model files, a small synthetic weight file (random fp32 values) should be generated at test time to validate the INT4 loading path end-to-end without needing a real model download.
  2. The test_hardware_integration.py harness exists but is never run in CI. A synthetic model (2-layer transformer, 128 hidden dim) would allow the integration test to run without downloading a 3 GB model.
  3. mypy check uses || true (non-blocking) in the lint-only job β€” type errors are silently ignored.
  • Investigate why test_int4_loader.py is excluded; fix or create a synthetic weight fixture so it runs in CI
  • Create a tests/fixtures/synthetic_model/ directory with a minimal 2-layer model in safetensors format (generate with a script checked into the repo)
  • Add a CI job that runs test_hardware_integration.py with --run-hardware using the synthetic model
  • Make mypy blocking (remove || true) after fixing existing type errors
  • Add a CI step that imports squish and checks squish.__version__ == importlib.metadata.version("squish")

6D β€” Documentation: README Focus

The current README covers three separate audiences (practitioners, researchers, and contributors) simultaneously. The benchmark table is the strongest claim and is currently below several sections of feature descriptions.

Target README structure:

1. Problem statement (2 sentences)
2. The proof β€” load-time comparison table (Squish vs Ollama, three models)
3. Install (one-liner)
4. Quickstart (one command)
5. Core features (5 bullets max β€” fast load, OpenAI compatible, Web UI, INT4/INT8, Apple Silicon)
6. Links β†’ full docs, MODULES.md, paper, HuggingFace models

Everything else (wave tables, per-module details, accuracy benchmarks, developer docs) lives in the MkDocs site or MODULES.md.

  • Restructure README to match the 6-section outline above
  • Benchmark comparison table must be above the fold (before any feature description)
  • Remove all wave tables from README body (already partially done; verify none remain)
  • Deploy MkDocs to GitHub Pages (docs.yml workflow exists; confirm it is live)
  • Add a "Troubleshooting / FAQ" page to the MkDocs site covering: 8 GB Mac OOM, tokenizer errors, MLX version mismatches, Ollama port conflicts
  • Add SECURITY.md documenting responsible disclosure process
  • Ensure CONTRIBUTING.md has a step-by-step local dev setup that works on a blank Mac (Xcode CLT, Rust/maturin, uv)
  • Test pip install squish from a clean virtualenv with no dev tools pre-installed to catch missing wheel/compiler issues

Phase 7 β€” Staged Public Launch

Execute after Phase 5 bugs are fixed and Phase 6 ecosystem items are done. Do not compress all three stages into one week.


7A β€” Soft Launch (Beta Cohort)

Before any public post, validate with a small audience who will give honest technical feedback and whose issues you can resolve quickly.

  • Identify 5–10 people currently running local LLMs on Apple Silicon (MLX Discord, people who have filed MLX issues on GitHub) and send direct invitations
  • Set up a GitHub Discussion category "Beta Feedback" for structured input
  • Pay attention to OOM reports on 8 GB and 16 GB Macs β€” --fault-tolerance and --adaptive-quant exist but need real-hardware validation on memory-constrained devices
  • Produce a 60-second screen recording: cold start Squish vs Ollama side-by-side for Qwen3-8B. No narration needed β€” the numbers speak. Post to the GitHub Release as an asset.
  • Address all beta feedback before hard launch; do not proceed to 7B if any P0 crash bugs are open

7B β€” Hacker News (Show HN)

HN is the right first public venue: technical audience, good faith engagement, time-boxed attention window (front-page day, then archived). Get it right here before the higher-noise Reddit blast.

Post structure:

  • Title: Show HN: Squish – Sub-second model loads on Apple Silicon (54Γ— faster than Ollama cold-start)

  • First comment (post immediately after submitting): 3 short paragraphs. (1) The problem: Ollama cold-start on M3 is 8–25 seconds. (2) The solution: INT8/INT4 compression + mmap + Metal kernel pre-warm. (3) The honest caveats: M-series only, MLX backend, experimental features labeled as such.

  • Be present for the first 2 hours. Answer every question directly and technically.

  • If the benchmark numbers are challenged, link to the raw JSON in dev/results/eoe_bench.json and the lm-eval output in eval_output/. Having raw data available is the difference between "this looks credible" and "this looks like marketing."

  • Draft HN Show post text in dev/community_posts.md (template exists β€” refine with real numbers)

  • Confirm raw benchmark JSON is publicly accessible in the repo before posting

  • Confirm MkDocs site is live and the paper is linked

  • Do not submit on a Friday or Saturday (low traffic)

  • Respond to every comment within 4 hours on day one


7C β€” r/LocalLLaMA and Twitter/X

Only proceed here after HN feedback has been reviewed and any correction to claims has been made.

r/LocalLLaMA post:

  • Post type: "I built X" (not "What do you think of X?")
  • Lead with the side-by-side GIF demo, then the number
  • Keep body under 300 words; link to README and HN thread for depth
  • Post from an account with karma β€” if your account is new, post a few helpful comments in the subreddit first

Twitter/X thread:

  • Tag Awni Hannun (MLX creator), not as a promotional move but because the work directly builds on MLX and he has flagged Apple Silicon inference optimization as a priority area

  • Thread structure: tweet 1 = the claim with GIF, tweets 2–5 = how it works (mmap, INT4 nibble pack, KV compression, streaming fix), tweet 6 = benchmark methodology, tweet 7 = "try it" CTA with install command

  • Post to r/LocalLLaMA after HN settles (48 hours post-HN)

  • Post Twitter/X thread same day as r/LocalLLaMA

  • Monitor both for 72 hours; update README FAQ with any common questions that emerge

  • arXiv submit in the same week as the public launch β€” establishes timestamp and gives researchers something to cite