Skip to content

Commit 329ca8a

Browse files
committed
docs: sync CLI and hardware documentation
1 parent 8801ddc commit 329ca8a

6 files changed

Lines changed: 127 additions & 26 deletions

File tree

README.md

Lines changed: 39 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,13 @@ whichllm
5656
# Pretend you have a specific GPU
5757
whichllm --gpu "RTX 4090"
5858

59+
# Only show models that fit fully in GPU VRAM
60+
whichllm --gpu-only
61+
whichllm --fit full-gpu --status
62+
63+
# Simulate a multi-GPU workstation
64+
whichllm --gpu "2x RTX 4090"
65+
5966
# Compare upgrade candidates
6067
whichllm upgrade "RTX 4090" "RTX 5090" "H100"
6168

@@ -104,6 +111,9 @@ data, this is not a static list):
104111
| CPU only || `gpt-oss-20b` (MoE) · Q4_K_M · score 45.2 | ~6 t/s |
105112

106113
`whichllm --gpu "<your card>"` simulates any of these before you buy.
114+
By default, rankings include full-GPU, partial-offload, and CPU-only
115+
candidates when they are usable. Use `--gpu-only` or `--fit full-gpu` when
116+
you only want models that fit entirely in GPU VRAM.
107117

108118
## Why whichllm?
109119

@@ -136,14 +146,16 @@ whichllm is built to get right.
136146

137147
## Features
138148

139-
- **Auto-detect hardware** — NVIDIA, AMD, Apple Silicon, CPU-only
149+
- **Auto-detect hardware** — NVIDIA, AMD, Intel, Apple Silicon, CPU-only
140150
- **Smart ranking** — Scores models by VRAM fit, speed, and benchmark quality
141151
- **One-command chat**`whichllm run` downloads and starts a chat session instantly
142152
- **Code snippets**`whichllm snippet` prints ready-to-run Python for any model
143153
- **Live data** — Fetches models directly from HuggingFace (cached for performance)
144154
- **Benchmark-aware** — Integrates real eval scores with confidence-based dampening
145155
- **Task profiles** — Filter by general, coding, vision, or math use cases
146156
- **GPU simulation** — Test with any GPU: `whichllm --gpu "RTX 4090"`
157+
- **Multi-GPU simulation** — Repeat `--gpu`, use commas, or write `2x RTX 4090`
158+
- **Full-GPU filter**`--gpu-only` / `--fit full-gpu` hides offload candidates
147159
- **Hardware planning** — Reverse lookup: `whichllm plan "llama 3 70b"`
148160
- **Upgrade planning** — Compare your current machine with candidate GPUs
149161
- **JSON output** — Pipe-friendly: `whichllm --json`
@@ -206,12 +218,23 @@ whichllm --gpu "RTX 4090"
206218
whichllm --gpu "RTX 5090"
207219
# Specify variant
208220
whichllm --gpu "RTX 5060 16"
221+
# Simulate multiple GPUs
222+
whichllm --gpu "2x RTX 4090"
223+
whichllm --gpu "RTX 4090" --gpu "RTX 3090"
224+
whichllm --gpu "RTX 4090, RTX 3090"
225+
226+
# Only show models that fit entirely in GPU VRAM
227+
whichllm --gpu-only
228+
whichllm --fit full-gpu --status
209229

210230
# CPU-only mode
211231
whichllm --cpu-only
212232

213233
# More results / filters
214234
whichllm --top 20
235+
whichllm --status
236+
whichllm --profile coding
237+
whichllm --context-length 64k
215238
whichllm --quant Q4_K_M
216239
whichllm --min-speed 30
217240
whichllm --evidence base # allow id/base-model matches
@@ -245,9 +268,11 @@ whichllm snippet "qwen 7b"
245268
whichllm snippet "llama 3 8b gguf" --quant Q5_K_M
246269
```
247270

248-
JSON model rows include `estimated_tok_per_sec`, `speed_confidence`,
249-
`speed_range_tok_per_sec`, and `speed_notes`. The speed range is a planning
250-
range, not a live benchmark.
271+
JSON model rows include `fit_type`, `vram_required_bytes`,
272+
`vram_available_bytes`, `uses_multi_gpu`, `multi_gpu_effective_vram_bytes`,
273+
`estimated_tok_per_sec`, `speed_confidence`, `speed_range_tok_per_sec`,
274+
`speed_notes`, `benchmark_source`, and `benchmark_confidence`. The speed range
275+
is a planning range, not a live benchmark.
251276

252277
## Integrations
253278

@@ -334,13 +359,14 @@ Speed markers in `--status`:
334359
Inheritance is rejected when a model's params diverge more than 2× from
335360
its family's dominant member, catching draft / MTP / abliterated forks
336361
that share a `family_id` with a much larger base.
337-
4. **Cache**`~/.cache/whichllm/`:
362+
4. **Cache** — normally `~/.cache/whichllm/`, or `$XDG_CACHE_HOME/whichllm/`
363+
when `XDG_CACHE_HOME` is set to an absolute path:
338364
- `models.json` — 6h TTL
339365
- `benchmark.json` — 24h TTL
340366

341367
### Ranking engine
342368

343-
1. **Hardware detection** — NVIDIA (nvidia-ml-py), AMD (dbgpu/ROCm), Apple Silicon (Metal), CPU cores, RAM, disk
369+
1. **Hardware detection** — NVIDIA (nvidia-ml-py), AMD (ROCm/dbgpu), Intel, Apple Silicon (Metal), CPU cores, RAM, disk
344370
2. **VRAM estimation** — Weights + KV cache + activation + framework overhead (~500MB)
345371
3. **Compatibility** — Full GPU / Partial Offload / CPU-only; compute capability and OS checks
346372
4. **Speed** — tok/s from GPU memory bandwidth, quantization, backend, fit type, and MoE active parameters
@@ -352,7 +378,8 @@ Speed markers in `--status`:
352378
```
353379
src/whichllm/
354380
├── cli.py # Typer CLI: main, plan, run, snippet, hardware
355-
├── constants.py # GPU bandwidth, quantization bytes, compute capability
381+
├── constants.py # Backward-compatible exports for registry data
382+
├── data/ # GPU, quantization, framework, and lineage registries
356383
├── hardware/
357384
│ ├── detector.py # Orchestrates GPU/CPU/RAM detection
358385
│ ├── nvidia.py # NVIDIA GPU via nvidia-ml-py
@@ -376,7 +403,11 @@ src/whichllm/
376403
│ ├── ranker.py # Scoring, evidence filter, profile/match
377404
│ └── types.py # CompatibilityResult
378405
└── output/
379-
└── display.py # Rich table, JSON output, hardware/plan displays
406+
├── ranking.py # Rich hardware and recommendation tables
407+
├── json_output.py # Ranking, plan, and upgrade JSON
408+
├── plan.py # plan command display
409+
├── upgrade.py # upgrade comparison display
410+
└── display.py # Compatibility re-export shim
380411
```
381412

382413
## Development

docs/README.ja.md

Lines changed: 23 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -60,15 +60,26 @@ whichllm
6060
whichllm --gpu "RTX 4090"
6161
whichllm --gpu "Apple M3 Max"
6262

63+
# 複数GPUをシミュレートする
64+
whichllm --gpu "2x RTX 4090"
65+
whichllm --gpu "RTX 4090" --gpu "RTX 3090"
66+
67+
# GPUのVRAMに全部載る候補だけを見る
68+
whichllm --gpu-only
69+
whichllm --fit full-gpu --status
70+
6371
# CPUのみとして評価する
6472
whichllm --cpu-only
6573

6674
# JSONで出力する
6775
whichllm --json
6876
```
6977

70-
JSONの各モデルには `estimated_tok_per_sec` に加えて、
71-
`speed_confidence``speed_range_tok_per_sec``speed_notes` が入ります。
78+
JSONの各モデルには `estimated_tok_per_sec` に加えて、`fit_type`
79+
`vram_required_bytes``vram_available_bytes``uses_multi_gpu`
80+
`multi_gpu_effective_vram_bytes``speed_confidence`
81+
`speed_range_tok_per_sec``speed_notes``benchmark_source`
82+
`benchmark_confidence` が入ります。
7283
速度は実測値ではなく、ハードウェア情報とモデル情報からの推定です。
7384

7485
## 主なコマンド
@@ -79,7 +90,9 @@ whichllm --top 20
7990
whichllm --quant Q4_K_M
8091
whichllm --min-speed 30
8192
whichllm --profile coding
93+
whichllm --context-length 64k
8294
whichllm --status
95+
whichllm --gpu-only
8396

8497
# ベンチ根拠の厳しさ
8598
whichllm --evidence strict
@@ -144,7 +157,12 @@ whichllm hardware
144157
5. 候補ごとに VRAM、互換性、速度、速度推定の信頼度、スコアを計算します。
145158
6. ファミリーごとに最も良い候補を残して表示します。
146159

147-
キャッシュは `~/.cache/whichllm/` に保存されます。
160+
通常は full GPU、partial offload、CPU-only の候補をまとめて見ます。GPUの
161+
VRAMに全部載るモデルだけを見たい場合は `--gpu-only`
162+
`--fit full-gpu` を使います。
163+
164+
キャッシュは通常 `~/.cache/whichllm/` に保存されます。`XDG_CACHE_HOME`
165+
絶対パスで設定されている場合は、その配下の `whichllm/` を使います。
148166

149167
- `models.json`: 6時間
150168
- `benchmark.json`: 24時間
@@ -154,7 +172,8 @@ whichllm hardware
154172
```text
155173
src/whichllm/
156174
├── cli.py # Typer CLI: main, plan, upgrade, run, snippet, hardware
157-
├── constants.py # GPU帯域、量子化、世代補正、compute capability
175+
├── constants.py # 互換用のregistry再export
176+
├── data/ # GPU、量子化、framework、lineageのregistry
158177
├── hardware/ # ハードウェア検出とGPUシミュレーション
159178
├── models/ # HuggingFace取得、ベンチ、キャッシュ、グルーピング
160179
├── engine/ # VRAM、互換性、速度、ランキング

docs/cli.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,10 @@ Common options:
3434
| `--vram` | Override simulated GPU VRAM in GB. Requires `--gpu` |
3535
| `--version` | Print the installed package version |
3636

37+
`--fit any` is the default. It can include full-GPU, partial-offload, and
38+
CPU-only candidates when they are runnable. `--fit full-gpu` and `--gpu-only`
39+
keep only rows whose `fit_type` is `full_gpu`.
40+
3741
Examples:
3842

3943
```bash
@@ -56,11 +60,17 @@ Ranking JSON model rows include:
5660

5761
| Field | Meaning |
5862
| --- | --- |
63+
| `fit_type` | Runtime fit classification: `full_gpu`, `partial_offload`, or `cpu_only` |
64+
| `vram_required_bytes` | Estimated runtime memory requirement for the candidate |
65+
| `vram_available_bytes` | GPU memory budget used for the fit check |
66+
| `uses_multi_gpu` | Whether the fit check used more than one GPU |
67+
| `multi_gpu_effective_vram_bytes` | Conservative effective VRAM budget for multi-GPU fits, when applicable |
5968
| `estimated_tok_per_sec` | Point estimate used by ranking |
6069
| `speed_confidence` | `high`, `medium`, or `low` |
6170
| `speed_range_tok_per_sec` | Estimated lower/upper tok/s range, when available |
6271
| `speed_notes` | Short reasons for the confidence level |
63-
| `benchmark_source` | How the speed estimate was derived: `direct`, `variant`, `base_model`, `line_interp`, `self_reported`, or `none` |
72+
| `benchmark_status` | Display marker category for benchmark evidence |
73+
| `benchmark_source` | How benchmark evidence was matched: `direct`, `variant`, `base_model`, `line_interp`, `self_reported`, or `none` |
6474
| `benchmark_confidence` | Confidence in the benchmark match, `0.0``1.0` |
6575

6676
## `hardware`

docs/hardware.md

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,9 @@
33
whichllm detects the current machine and can also simulate hardware for
44
purchase planning.
55

6-
The source of truth is the `hardware/` package plus GPU constants in
7-
`constants.py`.
6+
The source of truth is the `hardware/` package plus curated registry data in
7+
`data/gpu.py`. `constants.py` remains as a compatibility export layer for older
8+
imports.
89

910
## Detected data
1011

@@ -34,14 +35,22 @@ NVIDIA detection tries `nvidia-ml-py` first. If NVML is unavailable, fails to
3435
initialize, or returns no devices, whichllm falls back to:
3536

3637
```bash
37-
nvidia-smi --query-gpu=name,memory.total --format=csv,noheader,nounits
38+
nvidia-smi --query-gpu=name,memory.total,clocks.max.memory --format=csv,noheader,nounits
3839
```
3940

40-
For known cards, `constants.py` provides:
41+
If a driver rejects `clocks.max.memory`, whichllm retries the older
42+
`name,memory.total` query.
43+
44+
For known cards, curated data and strict `dbgpu` lookups provide:
4145

4246
- memory bandwidth
4347
- compute capability
4448

49+
The max memory clock is used when a marketing name covers multiple memory
50+
types. For example, GTX 1650 GDDR5 and GDDR6 cards share the same broad driver
51+
name, so whichllm uses the reported memory clock when available and falls back
52+
to the conservative bandwidth when it is not.
53+
4554
DGX Spark / NVIDIA GB10 uses unified system memory. When the driver reports
4655
`memory.total` as unavailable, whichllm treats GB10 as shared memory and uses
4756
system RAM for fit checks.
@@ -104,7 +113,8 @@ CPU detection reads:
104113

105114
- `/proc/cpuinfo` on Linux
106115
- `sysctl` on macOS
107-
- `wmic` on Windows
116+
- `wmic` on Windows, then PowerShell / CIM when `wmic` is unavailable or only
117+
returns a header
108118

109119
Physical core count comes from `psutil`, with a Linux `/proc/cpuinfo` fallback.
110120

docs/how-it-works.md

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ The implementation is intentionally split into small packages:
99
src/whichllm/
1010
├── cli.py
1111
├── constants.py
12+
├── data/
1213
├── hardware/
1314
├── models/
1415
├── engine/
@@ -39,7 +40,7 @@ returns an empty result on failure.
3940

4041
| Module | Role |
4142
| --- | --- |
42-
| `hardware/nvidia.py` | Uses `nvidia-ml-py`; falls back to `nvidia-smi` |
43+
| `hardware/nvidia.py` | Uses `nvidia-ml-py`; falls back to `nvidia-smi`, including optional memory-clock data |
4344
| `hardware/amd.py` | Uses `rocm-smi`; falls back to `lspci` and `/sys/class/drm` |
4445
| `hardware/intel.py` | Detects Linux Intel iGPUs through `lspci` or sysfs |
4546
| `hardware/windows.py` | Detects Windows AMD and Intel fallback GPUs through WMI and registry memory fields |
@@ -82,7 +83,8 @@ missing metadata.
8283

8384
## Caches
8485

85-
Both caches live under `~/.cache/whichllm/`.
86+
Both caches normally live under `~/.cache/whichllm/`. If `XDG_CACHE_HOME` is
87+
set to an absolute path, whichllm uses `$XDG_CACHE_HOME/whichllm/` instead.
8688

8789
| File | TTL | Contents |
8890
| --- | --- | --- |
@@ -177,13 +179,13 @@ See [Scoring](scoring.md) for the score details.
177179

178180
## Output
179181

180-
`output/display.py` renders:
182+
Output is split by surface:
181183

182-
- hardware panels
183-
- recommendation tables
184-
- JSON output
185-
- `plan` tables and JSON
186-
- `upgrade` comparison tables and JSON
184+
- `output/ranking.py` renders hardware panels and recommendation tables.
185+
- `output/json_output.py` renders ranking, `plan`, and `upgrade` JSON.
186+
- `output/plan.py` renders `plan` tables.
187+
- `output/upgrade.py` renders upgrade comparison tables.
188+
- `output/display.py` re-exports those functions for older imports.
187189

188190
Normal ranking tables show published date and downloads. With `--status`, the
189191
table instead shows memory required, estimated speed, and fit type. Speed cells

docs/troubleshooting.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,11 +41,16 @@ If detection is unavailable or you are planning a purchase, use `--gpu`:
4141
whichllm --gpu "RTX 4090"
4242
whichllm hardware --gpu "Apple M3 Max"
4343
whichllm --gpu "RTX 5060 Ti" --vram 16
44+
whichllm --gpu "2x RTX 4090"
45+
whichllm --gpu "RTX 4090" --gpu "RTX 3090"
4446
```
4547

4648
Use `--vram` when the GPU name has multiple memory variants or is not in the
4749
database.
4850

51+
`--vram` only applies to a single simulated GPU. For multi-GPU simulation, use
52+
known GPU names and omit `--vram`.
53+
4954
## `--cpu-only` conflicts with `--gpu`
5055

5156
These flags are mutually exclusive:
@@ -85,6 +90,7 @@ whichllm --refresh
8590
Common causes:
8691

8792
- the selected `--quant` is too restrictive
93+
- `--gpu-only` or `--fit full-gpu` filters out partial-offload and CPU-only candidates
8894
- `--min-speed` is too high
8995
- `--evidence strict` filters out all candidates
9096
- the requested context length is too large
@@ -97,6 +103,23 @@ For very small machines, remove optional filters first:
97103
whichllm --top 20
98104
```
99105

106+
## Recommendations use RAM or CPU offload, but I only want VRAM
107+
108+
By default, whichllm includes any runnable candidate: full-GPU, partial-offload,
109+
and CPU-only. This is useful for finding what can run at all, but it can be too
110+
loose when you want only models that fit entirely in GPU VRAM.
111+
112+
Use:
113+
114+
```bash
115+
whichllm --gpu-only
116+
whichllm --fit full-gpu --status
117+
```
118+
119+
If no rows are shown, this machine has no ranked candidates that fit fully in
120+
GPU memory under the current filters. Remove `--gpu-only`, lower the context
121+
length, or try a smaller quantization.
122+
100123
## Results look stale
101124

102125
whichllm caches model data for 6 hours and benchmark data for 24 hours.
@@ -114,6 +137,12 @@ The caches live under:
114137
~/.cache/whichllm/
115138
```
116139

140+
If `XDG_CACHE_HOME` is set to an absolute path, the caches live under:
141+
142+
```text
143+
$XDG_CACHE_HOME/whichllm/
144+
```
145+
117146
## `uvx` fails with `realpath: command not found`
118147

119148
Some older macOS versions do not include a `realpath` command. If the `uvx`

0 commit comments

Comments
 (0)