Releases: Andyyyy64/whichllm
v0.5.2
Hardening release: every Round 3 fix now has a regression test verified
to fail when reverted, the CI lint pipeline is green again (it was red
for the entire 0.5.1 release), and two correctness bugs found by
stress-testing previously unexercised axes are fixed.
Fixed
--profile vision generation inversion
Text leaderboards don't score VLMs, so the only model with a direct
benchmark hit was a two-generations-old Qwen2-VL-7B, which outranked
the current Qwen3-VL-32B even on an 80 GB H100. A curated
multimodal capability source (MMMU-Pro / MMBench, 2026-05) now scores
the Qwen3-VL / Qwen2.5-VL / Qwen2-VL / Llama-Vision / Phi-vision /
Gemma-3 / Pixtral / InternVL3 lines. Qwen3-VL-32B now leads vision at
73-76; the legacy 7B correctly drops to the low 30s.
Apple Silicon partial-offload speed (~3x under-estimate)
The flat 0.45x partial-offload penalty modelled a discrete GPU
spilling to CPU RAM across PCIe. Apple Silicon shares one unified-memory
pool, so spilled weights stay at full bandwidth. DeepSeek-R1-class
models on M2/M3 Ultra reported ~1.7 t/s when real-world is 4-15; now
0.85x for unified memory, 0.45x kept for discrete GPUs.
CI lint was red for all of 0.5.1
Qwen/Qwen3-Coder-30B-A3B-Instruct was a duplicate key in the
LiveBench fallback (silently scored 62 instead of 58) and 12 files were
unformatted — both broke the Lint job. Fixed; Lint + Tests are now
green on this release commit in actual GitHub CI.
Added
- Round 3 regression suite (
tests/test_r3_regressions.py, 20 tests).
Every test was verified to go red when its fix is reverted — they
pin real bugs, not the current implementation. - Benchmark snapshot date shown under every ranking, so a stale
recommendation is self-evident instead of silently trusted.
CI
- GitHub Actions runners updated to Node 24 (
checkout@v5,
setup-python@v6); Node 20 actions are deprecated from 2026-06.
Full changelog: CHANGELOG.md
v0.5.1
What's New
whichllm upgrade — Compare GPU upgrades side-by-side
whichllm upgrade --target "RTX 4090"
Shows the current machine and a target GPU together with delta scores
and a verdict (worth it / meaningful / marginal / flat / downgrade).
Apple Silicon support in --gpu
whichllm --gpu "M3 Max" --vram 64
whichllm --gpu "M2 Ultra" --vram 192
Simulator now understands every M1-M4 chip (base / Pro / Max / Ultra),
so Mac users can stress-test rankings without owning the hardware. No
more spurious "ROCm requires Linux" warnings on simulated Apple boxes.
Frontier-model coverage refresh
2026-Q2 releases that did not previously surface are now included:
Kimi-K2, MiMo, DeepSeek-V4, GLM-5, Qwen3.6 / Qwen3-Next, gpt-oss,
Llama-4, Mistral Small/Large, Devstral, Codestral, MiniMax,
Granite 3.3/4.0, Olmo-3, Nemotron-3, plus the reasoning lines
QwQ-32B, Qwen3-4B-Thinking, DeepSeek-R1 and the R1-Distill family.
Smarter VRAM / speed estimates
- KV cache scaling tuned to match real 128K-context runs.
- MoE models split correctly: total params drive VRAM and knowledge,
active params drive speed. - Per-backend speed multipliers (CUDA / Apple / AMD / Intel) and
per-quant efficiency factors so Apple Silicon and partial-offload
numbers stop overshooting. - Lineage-aware demotion stops 2024-era leaderboards (OLLB v2, Arena
ELO) from over-rewarding older generations against their newer
siblings.
Bug fixes
- Family inheritance no longer treats a 6.6B "imatrix-aligned" /
MTP-head fork as the same model as its 158B base. - Family grouping prefers the upstream model as the base, not whichever
fork has the most downloads. - httpx
follow_redirects=Trueso case-mismatch HuggingFace URLs (307)
no longer drop frontier IDs silently. - Quality floor (≥ 20) and speed floor (≥ 1.5 t/s) drop junk Q1_0 /
Bonsai-class candidates that previously slipped into low-VRAM
recommendations. - Removed 11 non-existent HF IDs from curated benchmark fallbacks.
Full changelog: CHANGELOG.md
v0.5.0
What's New
whichllm run — One-command chat
Download and chat with any model instantly. Auto-creates an isolated environment, installs dependencies, and starts an interactive session — zero manual setup.
whichllm run "qwen 2.5 1.5b gguf"
whichllm run # auto-picks the best model for your hardwareSupports all formats: GGUF, AWQ, GPTQ, FP16/BF16.
whichllm snippet — Ready-to-run Python code
Print a copy-paste Python script for any model.
whichllm snippet "qwen 7b"Improvements
- Smarter model search: auto-picks top match by downloads instead of erroring on ambiguous queries
- Shared helpers for model loading and search across commands
- Refactored
plancommand to use shared search logic