Skip to content

Releases: Andyyyy64/whichllm

v0.5.2

15 May 08:18

Choose a tag to compare

Hardening release: every Round 3 fix now has a regression test verified
to fail when reverted, the CI lint pipeline is green again (it was red
for the entire 0.5.1 release), and two correctness bugs found by
stress-testing previously unexercised axes are fixed.

Fixed

--profile vision generation inversion

Text leaderboards don't score VLMs, so the only model with a direct
benchmark hit was a two-generations-old Qwen2-VL-7B, which outranked
the current Qwen3-VL-32B even on an 80 GB H100. A curated
multimodal capability source (MMMU-Pro / MMBench, 2026-05) now scores
the Qwen3-VL / Qwen2.5-VL / Qwen2-VL / Llama-Vision / Phi-vision /
Gemma-3 / Pixtral / InternVL3 lines. Qwen3-VL-32B now leads vision at
73-76; the legacy 7B correctly drops to the low 30s.

Apple Silicon partial-offload speed (~3x under-estimate)

The flat 0.45x partial-offload penalty modelled a discrete GPU
spilling to CPU RAM across PCIe. Apple Silicon shares one unified-memory
pool, so spilled weights stay at full bandwidth. DeepSeek-R1-class
models on M2/M3 Ultra reported ~1.7 t/s when real-world is 4-15; now
0.85x for unified memory, 0.45x kept for discrete GPUs.

CI lint was red for all of 0.5.1

Qwen/Qwen3-Coder-30B-A3B-Instruct was a duplicate key in the
LiveBench fallback (silently scored 62 instead of 58) and 12 files were
unformatted — both broke the Lint job. Fixed; Lint + Tests are now
green on this release commit in actual GitHub CI.

Added

  • Round 3 regression suite (tests/test_r3_regressions.py, 20 tests).
    Every test was verified to go red when its fix is reverted — they
    pin real bugs, not the current implementation.
  • Benchmark snapshot date shown under every ranking, so a stale
    recommendation is self-evident instead of silently trusted.

CI

  • GitHub Actions runners updated to Node 24 (checkout@v5,
    setup-python@v6); Node 20 actions are deprecated from 2026-06.

Full changelog: CHANGELOG.md

v0.5.1

13 May 22:28

Choose a tag to compare

What's New

whichllm upgrade — Compare GPU upgrades side-by-side

whichllm upgrade --target "RTX 4090"

Shows the current machine and a target GPU together with delta scores
and a verdict (worth it / meaningful / marginal / flat / downgrade).

Apple Silicon support in --gpu

whichllm --gpu "M3 Max" --vram 64
whichllm --gpu "M2 Ultra" --vram 192

Simulator now understands every M1-M4 chip (base / Pro / Max / Ultra),
so Mac users can stress-test rankings without owning the hardware. No
more spurious "ROCm requires Linux" warnings on simulated Apple boxes.

Frontier-model coverage refresh

2026-Q2 releases that did not previously surface are now included:
Kimi-K2, MiMo, DeepSeek-V4, GLM-5, Qwen3.6 / Qwen3-Next, gpt-oss,
Llama-4, Mistral Small/Large, Devstral, Codestral, MiniMax,
Granite 3.3/4.0, Olmo-3, Nemotron-3, plus the reasoning lines
QwQ-32B, Qwen3-4B-Thinking, DeepSeek-R1 and the R1-Distill family.

Smarter VRAM / speed estimates

  • KV cache scaling tuned to match real 128K-context runs.
  • MoE models split correctly: total params drive VRAM and knowledge,
    active params drive speed.
  • Per-backend speed multipliers (CUDA / Apple / AMD / Intel) and
    per-quant efficiency factors so Apple Silicon and partial-offload
    numbers stop overshooting.
  • Lineage-aware demotion stops 2024-era leaderboards (OLLB v2, Arena
    ELO) from over-rewarding older generations against their newer
    siblings.

Bug fixes

  • Family inheritance no longer treats a 6.6B "imatrix-aligned" /
    MTP-head fork as the same model as its 158B base.
  • Family grouping prefers the upstream model as the base, not whichever
    fork has the most downloads.
  • httpx follow_redirects=True so case-mismatch HuggingFace URLs (307)
    no longer drop frontier IDs silently.
  • Quality floor (≥ 20) and speed floor (≥ 1.5 t/s) drop junk Q1_0 /
    Bonsai-class candidates that previously slipped into low-VRAM
    recommendations.
  • Removed 11 non-existent HF IDs from curated benchmark fallbacks.

Full changelog: CHANGELOG.md

v0.5.0

09 Mar 15:17

Choose a tag to compare

What's New

whichllm run — One-command chat

Download and chat with any model instantly. Auto-creates an isolated environment, installs dependencies, and starts an interactive session — zero manual setup.

whichllm run "qwen 2.5 1.5b gguf"
whichllm run  # auto-picks the best model for your hardware

Supports all formats: GGUF, AWQ, GPTQ, FP16/BF16.

whichllm snippet — Ready-to-run Python code

Print a copy-paste Python script for any model.

whichllm snippet "qwen 7b"

Improvements

  • Smarter model search: auto-picks top match by downloads instead of erroring on ambiguous queries
  • Shared helpers for model loading and search across commands
  • Refactored plan command to use shared search logic