@@ -56,6 +56,13 @@ whichllm
5656# Pretend you have a specific GPU
5757whichllm --gpu " RTX 4090"
5858
59+ # Only show models that fit fully in GPU VRAM
60+ whichllm --gpu-only
61+ whichllm --fit full-gpu --status
62+
63+ # Simulate a multi-GPU workstation
64+ whichllm --gpu " 2x RTX 4090"
65+
5966# Compare upgrade candidates
6067whichllm upgrade " RTX 4090" " RTX 5090" " H100"
6168
@@ -104,6 +111,9 @@ data, this is not a static list):
104111| CPU only | — | ` gpt-oss-20b ` (MoE) · Q4_K_M · score 45.2 | ~ 6 t/s |
105112
106113` whichllm --gpu "<your card>" ` simulates any of these before you buy.
114+ By default, rankings include full-GPU, partial-offload, and CPU-only
115+ candidates when they are usable. Use ` --gpu-only ` or ` --fit full-gpu ` when
116+ you only want models that fit entirely in GPU VRAM.
107117
108118## Why whichllm?
109119
@@ -136,14 +146,16 @@ whichllm is built to get right.
136146
137147## Features
138148
139- - ** Auto-detect hardware** — NVIDIA, AMD, Apple Silicon, CPU-only
149+ - ** Auto-detect hardware** — NVIDIA, AMD, Intel, Apple Silicon, CPU-only
140150- ** Smart ranking** — Scores models by VRAM fit, speed, and benchmark quality
141151- ** One-command chat** — ` whichllm run ` downloads and starts a chat session instantly
142152- ** Code snippets** — ` whichllm snippet ` prints ready-to-run Python for any model
143153- ** Live data** — Fetches models directly from HuggingFace (cached for performance)
144154- ** Benchmark-aware** — Integrates real eval scores with confidence-based dampening
145155- ** Task profiles** — Filter by general, coding, vision, or math use cases
146156- ** GPU simulation** — Test with any GPU: ` whichllm --gpu "RTX 4090" `
157+ - ** Multi-GPU simulation** — Repeat ` --gpu ` , use commas, or write ` 2x RTX 4090 `
158+ - ** Full-GPU filter** — ` --gpu-only ` / ` --fit full-gpu ` hides offload candidates
147159- ** Hardware planning** — Reverse lookup: ` whichllm plan "llama 3 70b" `
148160- ** Upgrade planning** — Compare your current machine with candidate GPUs
149161- ** JSON output** — Pipe-friendly: ` whichllm --json `
@@ -206,12 +218,23 @@ whichllm --gpu "RTX 4090"
206218whichllm --gpu " RTX 5090"
207219# Specify variant
208220whichllm --gpu " RTX 5060 16"
221+ # Simulate multiple GPUs
222+ whichllm --gpu " 2x RTX 4090"
223+ whichllm --gpu " RTX 4090" --gpu " RTX 3090"
224+ whichllm --gpu " RTX 4090, RTX 3090"
225+
226+ # Only show models that fit entirely in GPU VRAM
227+ whichllm --gpu-only
228+ whichllm --fit full-gpu --status
209229
210230# CPU-only mode
211231whichllm --cpu-only
212232
213233# More results / filters
214234whichllm --top 20
235+ whichllm --status
236+ whichllm --profile coding
237+ whichllm --context-length 64k
215238whichllm --quant Q4_K_M
216239whichllm --min-speed 30
217240whichllm --evidence base # allow id/base-model matches
@@ -245,9 +268,11 @@ whichllm snippet "qwen 7b"
245268whichllm snippet " llama 3 8b gguf" --quant Q5_K_M
246269```
247270
248- JSON model rows include ` estimated_tok_per_sec ` , ` speed_confidence ` ,
249- ` speed_range_tok_per_sec ` , and ` speed_notes ` . The speed range is a planning
250- range, not a live benchmark.
271+ JSON model rows include ` fit_type ` , ` vram_required_bytes ` ,
272+ ` vram_available_bytes ` , ` uses_multi_gpu ` , ` multi_gpu_effective_vram_bytes ` ,
273+ ` estimated_tok_per_sec ` , ` speed_confidence ` , ` speed_range_tok_per_sec ` ,
274+ ` speed_notes ` , ` benchmark_source ` , and ` benchmark_confidence ` . The speed range
275+ is a planning range, not a live benchmark.
251276
252277## Integrations
253278
@@ -334,13 +359,14 @@ Speed markers in `--status`:
334359 Inheritance is rejected when a model's params diverge more than 2× from
335360 its family's dominant member, catching draft / MTP / abliterated forks
336361 that share a ` family_id ` with a much larger base.
337- 4 . ** Cache** — ` ~/.cache/whichllm/ ` :
362+ 4 . ** Cache** — normally ` ~/.cache/whichllm/ ` , or ` $XDG_CACHE_HOME/whichllm/ `
363+ when ` XDG_CACHE_HOME ` is set to an absolute path:
338364 - ` models.json ` — 6h TTL
339365 - ` benchmark.json ` — 24h TTL
340366
341367### Ranking engine
342368
343- 1 . ** Hardware detection** — NVIDIA (nvidia-ml-py), AMD (dbgpu/ ROCm) , Apple Silicon (Metal), CPU cores, RAM, disk
369+ 1 . ** Hardware detection** — NVIDIA (nvidia-ml-py), AMD (ROCm/dbgpu), Intel , Apple Silicon (Metal), CPU cores, RAM, disk
3443702 . ** VRAM estimation** — Weights + KV cache + activation + framework overhead (~ 500MB)
3453713 . ** Compatibility** — Full GPU / Partial Offload / CPU-only; compute capability and OS checks
3463724 . ** Speed** — tok/s from GPU memory bandwidth, quantization, backend, fit type, and MoE active parameters
@@ -352,7 +378,8 @@ Speed markers in `--status`:
352378```
353379src/whichllm/
354380├── cli.py # Typer CLI: main, plan, run, snippet, hardware
355- ├── constants.py # GPU bandwidth, quantization bytes, compute capability
381+ ├── constants.py # Backward-compatible exports for registry data
382+ ├── data/ # GPU, quantization, framework, and lineage registries
356383├── hardware/
357384│ ├── detector.py # Orchestrates GPU/CPU/RAM detection
358385│ ├── nvidia.py # NVIDIA GPU via nvidia-ml-py
@@ -376,7 +403,11 @@ src/whichllm/
376403│ ├── ranker.py # Scoring, evidence filter, profile/match
377404│ └── types.py # CompatibilityResult
378405└── output/
379- └── display.py # Rich table, JSON output, hardware/plan displays
406+ ├── ranking.py # Rich hardware and recommendation tables
407+ ├── json_output.py # Ranking, plan, and upgrade JSON
408+ ├── plan.py # plan command display
409+ ├── upgrade.py # upgrade comparison display
410+ └── display.py # Compatibility re-export shim
380411```
381412
382413## Development
0 commit comments