Skip to content

Releases: Andyyyy64/whichllm

v0.5.12

18 Jun 12:58

Choose a tag to compare

Added

  • Markdown ranking output with --markdown / -m for pasteable GitHub issues, READMEs, Slack, and Discord.
  • Runtime-first ranking tables now show memory, estimated speed, fit type, and published date by default.
  • --speed any|usable|fast and the shorter --fit gpu alias for full-GPU recommendations.
  • --vram-headroom and --ram-budget for safer fit planning when runtimes or background processes need memory.

Changed

  • Speed colors now reflect practical generation speed: red under 4 tok/s, yellow from 4-10, green from 10-30, and bright green at 30+ tok/s.
  • --details restores the download-focused metadata table when needed.

Fixed

  • Invalid --vram-headroom values are now rejected even in CPU-only runs.

v0.5.11

18 Jun 06:01
8801ddc

Choose a tag to compare

Added

  • Multi-GPU simulation for repeated --gpu flags, comma-separated GPU specs, and count shorthand like 2x RTX 4090.
  • python -m whichllm now runs the CLI.
  • --gpu-only and --fit full-gpu filter recommendations to models that fit fully in GPU VRAM.
  • T5 lineage support for version-aware benchmark handling.

Fixed

  • Cached model and benchmark data are read as UTF-8.
  • GTX 1650 simulation distinguishes GDDR5 and GDDR6 variants by memory clock.
  • RAM reserve logic now uses a bounded reserve formula instead of a fixed 80% usable-RAM cap.

v0.5.10

11 Jun 07:52
9864db4

Choose a tag to compare

Fixed

  • Strong partial-offload candidates no longer get buried under weaker full-GPU models because the final sort no longer counts GPU fit twice.
  • Light partial offload is penalized less aggressively, while heavy dense offload still gets a strong discount.
  • MoE partial-offload scoring now gives a milder penalty when the active working set can plausibly stay on GPU.

v0.5.9

10 Jun 05:49

Choose a tag to compare

Highlights

  • GPU bandwidth detection now falls back to the bundled TechPowerUp database (2,824 GPUs) when a card is missing from the curated catalog. Uncatalogued cards no longer show BW: N/A with 0.0 tok/s estimates and oversized recommendations, and a laptop card can never inherit its desktop sibling's bandwidth. (#74, #98)
  • Fixed AMD discrete GPU detection on Linux, including RX 6750 XT and the compound lspci name path. (#61)
  • Artificial Analysis Intelligence Index is fetched live again after the site's App Router migration. Live scores overlay the curated snapshot, so coverage can only grow. (#87)
  • Added MXFP4 and NVFP4 quantization support. These repos were previously labeled FP16, overestimating VRAM by about 3.5x. (#27)
  • Added Apple M5-family simulation entries and Kepler-era Quadro catalog coverage.
  • Community GGUF repos without base_model metadata now match official benchmark scores by name.

QA

  • CI lint: passed
  • CI tests: Python 3.11, 3.12, and 3.13 passed
  • Local: 329 tests passed; sdist and wheel built successfully
  • Real hardware smoke test on Apple M2

v0.5.8

05 Jun 06:45
026ab36

Choose a tag to compare

Highlights

  • Fixed the A3000 Laptop 6GB ranking regression.
  • Added retry/backoff for transient Hugging Face and benchmark fetch failures.
  • Improved Error fetching models output when network exceptions have no message.
  • Added GPU catalog coverage for A3000 Laptop, RTX 3050, RTX 5060, RTX 5070 Ti, RX 9070, and RX 9070 XT.
  • Added context-length shorthand and benchmark source/confidence metadata in JSON output.

QA

  • CI lint: passed
  • CI tests: Python 3.11, 3.12, and 3.13 passed
  • Local build: sdist and wheel built successfully

v0.5.7

19 May 19:52

Choose a tag to compare

What's Changed

  • Detect DGX Spark / NVIDIA GB10 as a shared-memory NVIDIA GPU when NVIDIA reports memory.total as unavailable.
  • Fix whichllm run crashes for large Transformers models by providing an offload_folder.
  • Respect XDG_CACHE_HOME for cache paths, while ignoring relative values per the XDG spec.
  • Treat Apple Silicon as shared memory in fit detection.
  • Inline LiveBench fallback data and speed up benchmark score fetching.

Validation

  • ruff format --check .
  • ruff check .
  • pytest -q -s
  • python -m build
  • twine check dist/*

v0.5.6

17 May 18:54
ee0e666

Choose a tag to compare

What's Changed

  • Add speed estimate confidence metadata and estimated tok/s ranges.
  • Improve MoE speed estimates using active parameters and bandwidth-scaled read floors.
  • Add Windows AMD/Intel GPU detection fallback through Win32_VideoController and registry memory reads.
  • Treat Ryzen AI / Radeon 890M-class Windows iGPUs as shared-memory AMD GPUs.
  • Avoid summing dedicated GPU VRAM with shared-memory iGPU system RAM as one full-GPU target.

Validation

  • ruff format --check .
  • ruff check .
  • pytest -q -s
  • python -m build

v0.5.5

16 May 22:15

Choose a tag to compare

  • Fixed whichllm run resolving auto-picked GGUF recommendations to the official Transformers repository instead of a real GGUF repo/file.
  • This fixes the accidental Transformers launch path for models such as Qwen/Qwen3.6-27B.

v0.5.4

16 May 17:37
2d4e642

Choose a tag to compare

Fixed

  • Fix Strix Halo / Ryzen AI MAX shared-memory APU handling.
  • Detect and model STRXLGEN, Radeon 8050S, Radeon 8060S, and related names with a 256 GB/s bandwidth estimate.
  • Use the shared system-memory pool for fit checks to avoid false CPU-only, 99%-offload, and 0 tok/s recommendations on these systems.

Verification

  • CI green: lint, test (3.11), test (3.12), test (3.13).
  • Local verification: ruff check, ruff format --check, pytest, and whichllm --version.

v0.5.3

16 May 16:23

Choose a tag to compare

What's Changed

Added

  • Linux Intel integrated GPU detection via /sys/class/drm, so Intel iGPU systems are no longer treated as CPU-only by default.
  • NVIDIA nvidia-smi fallback detection when pynvml is missing, NVML init fails, or NVML reports no devices.
  • Apple-prefixed Apple Silicon simulator aliases, so --gpu "Apple M3 Max" works like --gpu "M3 Max".

Fixed

  • Fixed the whichllm run transformers chat path by passing tokenizer mappings into model.generate(**inputs), avoiding the KeyError: 'shape' crash.
  • RTX 5060 Ti bandwidth lookup now reports 448 GB/s instead of N/A.

Docs and maintenance

  • Updated install guidance toward uvx / uv tool install.
  • Removed the old marketing note and added sponsor metadata.

Verification

  • uv run pytest — 138 passed
  • uv run --with ruff ruff check . — passed
  • uv run --with ruff ruff format --check . — passed
  • uv run whichllm --version — 0.5.3
  • uv run --with build python -m build — built wheel and sdist