Releases: Andyyyy64/whichllm
Releases · Andyyyy64/whichllm
v0.5.12
Added
- Markdown ranking output with
--markdown/-mfor pasteable GitHub issues, READMEs, Slack, and Discord. - Runtime-first ranking tables now show memory, estimated speed, fit type, and published date by default.
--speed any|usable|fastand the shorter--fit gpualias for full-GPU recommendations.--vram-headroomand--ram-budgetfor safer fit planning when runtimes or background processes need memory.
Changed
- Speed colors now reflect practical generation speed: red under 4 tok/s, yellow from 4-10, green from 10-30, and bright green at 30+ tok/s.
--detailsrestores the download-focused metadata table when needed.
Fixed
- Invalid
--vram-headroomvalues are now rejected even in CPU-only runs.
v0.5.11
Added
- Multi-GPU simulation for repeated
--gpuflags, comma-separated GPU specs, and count shorthand like2x RTX 4090. python -m whichllmnow runs the CLI.--gpu-onlyand--fit full-gpufilter recommendations to models that fit fully in GPU VRAM.- T5 lineage support for version-aware benchmark handling.
Fixed
- Cached model and benchmark data are read as UTF-8.
- GTX 1650 simulation distinguishes GDDR5 and GDDR6 variants by memory clock.
- RAM reserve logic now uses a bounded reserve formula instead of a fixed 80% usable-RAM cap.
v0.5.10
Fixed
- Strong partial-offload candidates no longer get buried under weaker full-GPU models because the final sort no longer counts GPU fit twice.
- Light partial offload is penalized less aggressively, while heavy dense offload still gets a strong discount.
- MoE partial-offload scoring now gives a milder penalty when the active working set can plausibly stay on GPU.
v0.5.9
Highlights
- GPU bandwidth detection now falls back to the bundled TechPowerUp database (2,824 GPUs) when a card is missing from the curated catalog. Uncatalogued cards no longer show
BW: N/Awith0.0 tok/sestimates and oversized recommendations, and a laptop card can never inherit its desktop sibling's bandwidth. (#74, #98) - Fixed AMD discrete GPU detection on Linux, including RX 6750 XT and the compound lspci name path. (#61)
- Artificial Analysis Intelligence Index is fetched live again after the site's App Router migration. Live scores overlay the curated snapshot, so coverage can only grow. (#87)
- Added MXFP4 and NVFP4 quantization support. These repos were previously labeled FP16, overestimating VRAM by about 3.5x. (#27)
- Added Apple M5-family simulation entries and Kepler-era Quadro catalog coverage.
- Community GGUF repos without
base_modelmetadata now match official benchmark scores by name.
QA
- CI lint: passed
- CI tests: Python 3.11, 3.12, and 3.13 passed
- Local: 329 tests passed; sdist and wheel built successfully
- Real hardware smoke test on Apple M2
v0.5.8
Highlights
- Fixed the A3000 Laptop 6GB ranking regression.
- Added retry/backoff for transient Hugging Face and benchmark fetch failures.
- Improved Error fetching models output when network exceptions have no message.
- Added GPU catalog coverage for A3000 Laptop, RTX 3050, RTX 5060, RTX 5070 Ti, RX 9070, and RX 9070 XT.
- Added context-length shorthand and benchmark source/confidence metadata in JSON output.
QA
- CI lint: passed
- CI tests: Python 3.11, 3.12, and 3.13 passed
- Local build: sdist and wheel built successfully
v0.5.7
What's Changed
- Detect DGX Spark / NVIDIA GB10 as a shared-memory NVIDIA GPU when NVIDIA reports
memory.totalas unavailable. - Fix
whichllm runcrashes for large Transformers models by providing anoffload_folder. - Respect
XDG_CACHE_HOMEfor cache paths, while ignoring relative values per the XDG spec. - Treat Apple Silicon as shared memory in fit detection.
- Inline LiveBench fallback data and speed up benchmark score fetching.
Validation
ruff format --check .ruff check .pytest -q -spython -m buildtwine check dist/*
v0.5.6
What's Changed
- Add speed estimate confidence metadata and estimated tok/s ranges.
- Improve MoE speed estimates using active parameters and bandwidth-scaled read floors.
- Add Windows AMD/Intel GPU detection fallback through
Win32_VideoControllerand registry memory reads. - Treat Ryzen AI / Radeon 890M-class Windows iGPUs as shared-memory AMD GPUs.
- Avoid summing dedicated GPU VRAM with shared-memory iGPU system RAM as one full-GPU target.
Validation
ruff format --check .ruff check .pytest -q -spython -m build
v0.5.5
v0.5.4
Fixed
- Fix Strix Halo / Ryzen AI MAX shared-memory APU handling.
- Detect and model STRXLGEN, Radeon 8050S, Radeon 8060S, and related names with a 256 GB/s bandwidth estimate.
- Use the shared system-memory pool for fit checks to avoid false CPU-only, 99%-offload, and 0 tok/s recommendations on these systems.
Verification
- CI green: lint, test (3.11), test (3.12), test (3.13).
- Local verification: ruff check, ruff format --check, pytest, and whichllm --version.
v0.5.3
What's Changed
Added
- Linux Intel integrated GPU detection via
/sys/class/drm, so Intel iGPU systems are no longer treated as CPU-only by default. - NVIDIA
nvidia-smifallback detection when pynvml is missing, NVML init fails, or NVML reports no devices. - Apple-prefixed Apple Silicon simulator aliases, so
--gpu "Apple M3 Max"works like--gpu "M3 Max".
Fixed
- Fixed the
whichllm runtransformers chat path by passing tokenizer mappings intomodel.generate(**inputs), avoiding theKeyError: 'shape'crash. - RTX 5060 Ti bandwidth lookup now reports 448 GB/s instead of
N/A.
Docs and maintenance
- Updated install guidance toward
uvx/uv tool install. - Removed the old marketing note and added sponsor metadata.
Verification
uv run pytest— 138 passeduv run --with ruff ruff check .— passeduv run --with ruff ruff format --check .— passeduv run whichllm --version— 0.5.3uv run --with build python -m build— built wheel and sdist