docs: sync CLI and hardware documentation

Andyyyy64 · Andyyyy64 · commit 329ca8a01b45 · 2026-06-18T20:25:15.000+09:00
diff --git a/README.md b/README.md
@@ -56,6 +56,13 @@ whichllm
 # Pretend you have a specific GPU
 whichllm --gpu "RTX 4090"
 
+# Only show models that fit fully in GPU VRAM
+whichllm --gpu-only
+whichllm --fit full-gpu --status
+
+# Simulate a multi-GPU workstation
+whichllm --gpu "2x RTX 4090"
+
 # Compare upgrade candidates
 whichllm upgrade "RTX 4090" "RTX 5090" "H100"
 
@@ -104,6 +111,9 @@ data, this is not a static list):
 | CPU only | — | `gpt-oss-20b` (MoE) · Q4_K_M · score 45.2 | ~6 t/s |
 
 `whichllm --gpu "<your card>"` simulates any of these before you buy.
+By default, rankings include full-GPU, partial-offload, and CPU-only
+candidates when they are usable. Use `--gpu-only` or `--fit full-gpu` when
+you only want models that fit entirely in GPU VRAM.
 
 ## Why whichllm?
 
@@ -136,14 +146,16 @@ whichllm is built to get right.
 
 ## Features
 
-- **Auto-detect hardware** — NVIDIA, AMD, Apple Silicon, CPU-only
+- **Auto-detect hardware** — NVIDIA, AMD, Intel, Apple Silicon, CPU-only
 - **Smart ranking** — Scores models by VRAM fit, speed, and benchmark quality
 - **One-command chat** — `whichllm run` downloads and starts a chat session instantly
 - **Code snippets** — `whichllm snippet` prints ready-to-run Python for any model
 - **Live data** — Fetches models directly from HuggingFace (cached for performance)
 - **Benchmark-aware** — Integrates real eval scores with confidence-based dampening
 - **Task profiles** — Filter by general, coding, vision, or math use cases
 - **GPU simulation** — Test with any GPU: `whichllm --gpu "RTX 4090"`
+- **Multi-GPU simulation** — Repeat `--gpu`, use commas, or write `2x RTX 4090`
+- **Full-GPU filter** — `--gpu-only` / `--fit full-gpu` hides offload candidates
 - **Hardware planning** — Reverse lookup: `whichllm plan "llama 3 70b"`
 - **Upgrade planning** — Compare your current machine with candidate GPUs
 - **JSON output** — Pipe-friendly: `whichllm --json`
@@ -206,12 +218,23 @@ whichllm --gpu "RTX 4090"
 whichllm --gpu "RTX 5090"
 # Specify variant
 whichllm --gpu "RTX 5060 16"
+# Simulate multiple GPUs
+whichllm --gpu "2x RTX 4090"
+whichllm --gpu "RTX 4090" --gpu "RTX 3090"
+whichllm --gpu "RTX 4090, RTX 3090"
+
+# Only show models that fit entirely in GPU VRAM
+whichllm --gpu-only
+whichllm --fit full-gpu --status
 
 # CPU-only mode
 whichllm --cpu-only
 
 # More results / filters
 whichllm --top 20
+whichllm --status
+whichllm --profile coding
+whichllm --context-length 64k
 whichllm --quant Q4_K_M
 whichllm --min-speed 30
 whichllm --evidence base   # allow id/base-model matches
@@ -245,9 +268,11 @@ whichllm snippet "qwen 7b"
 whichllm snippet "llama 3 8b gguf" --quant Q5_K_M
 ```
 
-JSON model rows include `estimated_tok_per_sec`, `speed_confidence`,
-`speed_range_tok_per_sec`, and `speed_notes`. The speed range is a planning
-range, not a live benchmark.
+JSON model rows include `fit_type`, `vram_required_bytes`,
+`vram_available_bytes`, `uses_multi_gpu`, `multi_gpu_effective_vram_bytes`,
+`estimated_tok_per_sec`, `speed_confidence`, `speed_range_tok_per_sec`,
+`speed_notes`, `benchmark_source`, and `benchmark_confidence`. The speed range
+is a planning range, not a live benchmark.
 
 ## Integrations
 
@@ -334,13 +359,14 @@ Speed markers in `--status`:
    Inheritance is rejected when a model's params diverge more than 2× from
    its family's dominant member, catching draft / MTP / abliterated forks
    that share a `family_id` with a much larger base.
-4. **Cache** — `~/.cache/whichllm/`:
+4. **Cache** — normally `~/.cache/whichllm/`, or `$XDG_CACHE_HOME/whichllm/`
+   when `XDG_CACHE_HOME` is set to an absolute path:
    - `models.json` — 6h TTL
    - `benchmark.json` — 24h TTL
 
 ### Ranking engine
 
-1. **Hardware detection** — NVIDIA (nvidia-ml-py), AMD (dbgpu/ROCm), Apple Silicon (Metal), CPU cores, RAM, disk
+1. **Hardware detection** — NVIDIA (nvidia-ml-py), AMD (ROCm/dbgpu), Intel, Apple Silicon (Metal), CPU cores, RAM, disk
 2. **VRAM estimation** — Weights + KV cache + activation + framework overhead (~500MB)
 3. **Compatibility** — Full GPU / Partial Offload / CPU-only; compute capability and OS checks
 4. **Speed** — tok/s from GPU memory bandwidth, quantization, backend, fit type, and MoE active parameters
@@ -352,7 +378,8 @@ Speed markers in `--status`:
 ```
 src/whichllm/
 ├── cli.py              # Typer CLI: main, plan, run, snippet, hardware
-├── constants.py        # GPU bandwidth, quantization bytes, compute capability
+├── constants.py        # Backward-compatible exports for registry data
+├── data/               # GPU, quantization, framework, and lineage registries
 ├── hardware/
 │   ├── detector.py     # Orchestrates GPU/CPU/RAM detection
 │   ├── nvidia.py       # NVIDIA GPU via nvidia-ml-py
@@ -376,7 +403,11 @@ src/whichllm/
 │   ├── ranker.py       # Scoring, evidence filter, profile/match
 │   └── types.py        # CompatibilityResult
 └── output/
-    └── display.py      # Rich table, JSON output, hardware/plan displays
+    ├── ranking.py      # Rich hardware and recommendation tables
+    ├── json_output.py  # Ranking, plan, and upgrade JSON
+    ├── plan.py         # plan command display
+    ├── upgrade.py      # upgrade comparison display
+    └── display.py      # Compatibility re-export shim
 ```
 
 ## Development
diff --git a/docs/README.ja.md b/docs/README.ja.md
@@ -60,15 +60,26 @@ whichllm
 whichllm --gpu "RTX 4090"
 whichllm --gpu "Apple M3 Max"
 
+# 複数GPUをシミュレートする
+whichllm --gpu "2x RTX 4090"
+whichllm --gpu "RTX 4090" --gpu "RTX 3090"
+
+# GPUのVRAMに全部載る候補だけを見る
+whichllm --gpu-only
+whichllm --fit full-gpu --status
+
 # CPUのみとして評価する
 whichllm --cpu-only
 
 # JSONで出力する
 whichllm --json
 ```
 
-JSONの各モデルには `estimated_tok_per_sec` に加えて、
-`speed_confidence`、`speed_range_tok_per_sec`、`speed_notes` が入ります。
+JSONの各モデルには `estimated_tok_per_sec` に加えて、`fit_type`、
+`vram_required_bytes`、`vram_available_bytes`、`uses_multi_gpu`、
+`multi_gpu_effective_vram_bytes`、`speed_confidence`、
+`speed_range_tok_per_sec`、`speed_notes`、`benchmark_source`、
+`benchmark_confidence` が入ります。
 速度は実測値ではなく、ハードウェア情報とモデル情報からの推定です。
 
 ## 主なコマンド
@@ -79,7 +90,9 @@ whichllm --top 20
 whichllm --quant Q4_K_M
 whichllm --min-speed 30
 whichllm --profile coding
+whichllm --context-length 64k
 whichllm --status
+whichllm --gpu-only
 
 # ベンチ根拠の厳しさ
 whichllm --evidence strict
@@ -144,7 +157,12 @@ whichllm hardware
 5. 候補ごとに VRAM、互換性、速度、速度推定の信頼度、スコアを計算します。
 6. ファミリーごとに最も良い候補を残して表示します。
 
-キャッシュは `~/.cache/whichllm/` に保存されます。
+通常は full GPU、partial offload、CPU-only の候補をまとめて見ます。GPUの
+VRAMに全部載るモデルだけを見たい場合は `--gpu-only` か
+`--fit full-gpu` を使います。
+
+キャッシュは通常 `~/.cache/whichllm/` に保存されます。`XDG_CACHE_HOME` が
+絶対パスで設定されている場合は、その配下の `whichllm/` を使います。
 
 - `models.json`: 6時間
 - `benchmark.json`: 24時間
@@ -154,7 +172,8 @@ whichllm hardware
 ```text
 src/whichllm/
 ├── cli.py              # Typer CLI: main, plan, upgrade, run, snippet, hardware
-├── constants.py        # GPU帯域、量子化、世代補正、compute capability
+├── constants.py        # 互換用のregistry再export
+├── data/               # GPU、量子化、framework、lineageのregistry
 ├── hardware/           # ハードウェア検出とGPUシミュレーション
 ├── models/             # HuggingFace取得、ベンチ、キャッシュ、グルーピング
 ├── engine/             # VRAM、互換性、速度、ランキング
diff --git a/docs/cli.md b/docs/cli.md
@@ -34,6 +34,10 @@ Common options:
 | `--vram` | Override simulated GPU VRAM in GB. Requires `--gpu` |
 | `--version` | Print the installed package version |
 
+`--fit any` is the default. It can include full-GPU, partial-offload, and
+CPU-only candidates when they are runnable. `--fit full-gpu` and `--gpu-only`
+keep only rows whose `fit_type` is `full_gpu`.
+
 Examples:
 
 ```bash
@@ -56,11 +60,17 @@ Ranking JSON model rows include:
 
 | Field | Meaning |
 | --- | --- |
+| `fit_type` | Runtime fit classification: `full_gpu`, `partial_offload`, or `cpu_only` |
+| `vram_required_bytes` | Estimated runtime memory requirement for the candidate |
+| `vram_available_bytes` | GPU memory budget used for the fit check |
+| `uses_multi_gpu` | Whether the fit check used more than one GPU |
+| `multi_gpu_effective_vram_bytes` | Conservative effective VRAM budget for multi-GPU fits, when applicable |
 | `estimated_tok_per_sec` | Point estimate used by ranking |
 | `speed_confidence` | `high`, `medium`, or `low` |
 | `speed_range_tok_per_sec` | Estimated lower/upper tok/s range, when available |
 | `speed_notes` | Short reasons for the confidence level |
-| `benchmark_source` | How the speed estimate was derived: `direct`, `variant`, `base_model`, `line_interp`, `self_reported`, or `none` |
+| `benchmark_status` | Display marker category for benchmark evidence |
+| `benchmark_source` | How benchmark evidence was matched: `direct`, `variant`, `base_model`, `line_interp`, `self_reported`, or `none` |
 | `benchmark_confidence` | Confidence in the benchmark match, `0.0`–`1.0` |
 
 ## `hardware`
diff --git a/docs/hardware.md b/docs/hardware.md
@@ -3,8 +3,9 @@
 whichllm detects the current machine and can also simulate hardware for
 purchase planning.
 
-The source of truth is the `hardware/` package plus GPU constants in
-`constants.py`.
+The source of truth is the `hardware/` package plus curated registry data in
+`data/gpu.py`. `constants.py` remains as a compatibility export layer for older
+imports.
 
 ## Detected data
 
@@ -34,14 +35,22 @@ NVIDIA detection tries `nvidia-ml-py` first. If NVML is unavailable, fails to
 initialize, or returns no devices, whichllm falls back to:
 
 ```bash
-nvidia-smi --query-gpu=name,memory.total --format=csv,noheader,nounits
+nvidia-smi --query-gpu=name,memory.total,clocks.max.memory --format=csv,noheader,nounits
 ```
 
-For known cards, `constants.py` provides:
+If a driver rejects `clocks.max.memory`, whichllm retries the older
+`name,memory.total` query.
+
+For known cards, curated data and strict `dbgpu` lookups provide:
 
 - memory bandwidth
 - compute capability
 
+The max memory clock is used when a marketing name covers multiple memory
+types. For example, GTX 1650 GDDR5 and GDDR6 cards share the same broad driver
+name, so whichllm uses the reported memory clock when available and falls back
+to the conservative bandwidth when it is not.
+
 DGX Spark / NVIDIA GB10 uses unified system memory. When the driver reports
 `memory.total` as unavailable, whichllm treats GB10 as shared memory and uses
 system RAM for fit checks.
@@ -104,7 +113,8 @@ CPU detection reads:
 
 - `/proc/cpuinfo` on Linux
 - `sysctl` on macOS
-- `wmic` on Windows
+- `wmic` on Windows, then PowerShell / CIM when `wmic` is unavailable or only
+  returns a header
 
 Physical core count comes from `psutil`, with a Linux `/proc/cpuinfo` fallback.
 
diff --git a/docs/how-it-works.md b/docs/how-it-works.md
@@ -9,6 +9,7 @@ The implementation is intentionally split into small packages:
 src/whichllm/
 ├── cli.py
 ├── constants.py
+├── data/
 ├── hardware/
 ├── models/
 ├── engine/
@@ -39,7 +40,7 @@ returns an empty result on failure.
 
 | Module | Role |
 | --- | --- |
-| `hardware/nvidia.py` | Uses `nvidia-ml-py`; falls back to `nvidia-smi` |
+| `hardware/nvidia.py` | Uses `nvidia-ml-py`; falls back to `nvidia-smi`, including optional memory-clock data |
 | `hardware/amd.py` | Uses `rocm-smi`; falls back to `lspci` and `/sys/class/drm` |
 | `hardware/intel.py` | Detects Linux Intel iGPUs through `lspci` or sysfs |
 | `hardware/windows.py` | Detects Windows AMD and Intel fallback GPUs through WMI and registry memory fields |
@@ -82,7 +83,8 @@ missing metadata.
 
 ## Caches
 
-Both caches live under `~/.cache/whichllm/`.
+Both caches normally live under `~/.cache/whichllm/`. If `XDG_CACHE_HOME` is
+set to an absolute path, whichllm uses `$XDG_CACHE_HOME/whichllm/` instead.
 
 | File | TTL | Contents |
 | --- | --- | --- |
@@ -177,13 +179,13 @@ See [Scoring](scoring.md) for the score details.
 
 ## Output
 
-`output/display.py` renders:
+Output is split by surface:
 
-- hardware panels
-- recommendation tables
-- JSON output
-- `plan` tables and JSON
-- `upgrade` comparison tables and JSON
+- `output/ranking.py` renders hardware panels and recommendation tables.
+- `output/json_output.py` renders ranking, `plan`, and `upgrade` JSON.
+- `output/plan.py` renders `plan` tables.
+- `output/upgrade.py` renders upgrade comparison tables.
+- `output/display.py` re-exports those functions for older imports.
 
 Normal ranking tables show published date and downloads. With `--status`, the
 table instead shows memory required, estimated speed, and fit type. Speed cells
diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
@@ -41,11 +41,16 @@ If detection is unavailable or you are planning a purchase, use `--gpu`:
 whichllm --gpu "RTX 4090"
 whichllm hardware --gpu "Apple M3 Max"
 whichllm --gpu "RTX 5060 Ti" --vram 16
+whichllm --gpu "2x RTX 4090"
+whichllm --gpu "RTX 4090" --gpu "RTX 3090"
 ```
 
 Use `--vram` when the GPU name has multiple memory variants or is not in the
 database.
 
+`--vram` only applies to a single simulated GPU. For multi-GPU simulation, use
+known GPU names and omit `--vram`.
+
 ## `--cpu-only` conflicts with `--gpu`
 
 These flags are mutually exclusive:
@@ -85,6 +90,7 @@ whichllm --refresh
 Common causes:
 
 - the selected `--quant` is too restrictive
+- `--gpu-only` or `--fit full-gpu` filters out partial-offload and CPU-only candidates
 - `--min-speed` is too high
 - `--evidence strict` filters out all candidates
 - the requested context length is too large
@@ -97,6 +103,23 @@ For very small machines, remove optional filters first:
 whichllm --top 20
 ```
 
+## Recommendations use RAM or CPU offload, but I only want VRAM
+
+By default, whichllm includes any runnable candidate: full-GPU, partial-offload,
+and CPU-only. This is useful for finding what can run at all, but it can be too
+loose when you want only models that fit entirely in GPU VRAM.
+
+Use:
+
+```bash
+whichllm --gpu-only
+whichllm --fit full-gpu --status
+```
+
+If no rows are shown, this machine has no ranked candidates that fit fully in
+GPU memory under the current filters. Remove `--gpu-only`, lower the context
+length, or try a smaller quantization.
+
 ## Results look stale
 
 whichllm caches model data for 6 hours and benchmark data for 24 hours.
@@ -114,6 +137,12 @@ The caches live under:
 ~/.cache/whichllm/
 ```
 
+If `XDG_CACHE_HOME` is set to an absolute path, the caches live under:
+
+```text
+$XDG_CACHE_HOME/whichllm/
+```
+
 ## `uvx` fails with `realpath: command not found`
 
 Some older macOS versions do not include a `realpath` command. If the `uvx`