Skip to content

Commit 5778034

Browse files
committed
feat: improve ranking output controls
1 parent 329ca8a commit 5778034

23 files changed

Lines changed: 1104 additions & 80 deletions

CHANGELOG.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,28 @@ All notable changes to this project will be documented in this file.
44

55
The format is based on [Keep a Changelog](https://keepachangelog.com/), and this project adheres to [Semantic Versioning](https://semver.org/).
66

7+
## [Unreleased]
8+
9+
## [0.5.12] - 2026-06-18
10+
11+
### Added
12+
13+
- Default ranking tables now show memory required, estimated generation speed,
14+
fit type, and published date. Download counts are still available with
15+
`--details`.
16+
- Added `--speed any|usable|fast` as named generation-speed filters while
17+
keeping `--min-speed` for exact tok/s thresholds.
18+
- Added `--fit gpu` as a natural alias for full-GPU-only recommendations.
19+
- Added `--markdown` / `-m` for pasteable GitHub-Flavored Markdown ranking
20+
tables. (#111)
21+
- Added `--vram-headroom` and `--ram-budget` so users can avoid edge VRAM fits
22+
and cap partial-offload planning to available or fixed system RAM.
23+
24+
### Changed
25+
26+
- Speed color now reflects practical generation speed. `~` and `?` remain
27+
estimate-confidence markers instead of being the primary speed color.
28+
729
## [0.5.11] - 2026-06-18
830

931
### Added

README.md

Lines changed: 41 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -58,11 +58,18 @@ whichllm --gpu "RTX 4090"
5858

5959
# Only show models that fit fully in GPU VRAM
6060
whichllm --gpu-only
61-
whichllm --fit full-gpu --status
61+
whichllm --fit gpu
6262

6363
# Simulate a multi-GPU workstation
6464
whichllm --gpu "2x RTX 4090"
6565

66+
# Hide models that are technically runnable but too slow
67+
whichllm --speed usable
68+
whichllm --speed fast
69+
70+
# Pasteable GitHub / Slack / Discord output
71+
whichllm --markdown
72+
6673
# Compare upgrade candidates
6774
whichllm upgrade "RTX 4090" "RTX 5090" "H100"
6875

@@ -114,6 +121,10 @@ data, this is not a static list):
114121
By default, rankings include full-GPU, partial-offload, and CPU-only
115122
candidates when they are usable. Use `--gpu-only` or `--fit full-gpu` when
116123
you only want models that fit entirely in GPU VRAM.
124+
The default table shows memory, estimated generation speed, fit type, and
125+
published date. Speed is colored by practical usability: under 4 tok/s is red,
126+
4-10 is yellow, 10-30 is green, and 30+ is bright green. `~` / `?` still mark
127+
estimate confidence.
117128

118129
## Why whichllm?
119130

@@ -156,6 +167,9 @@ whichllm is built to get right.
156167
- **GPU simulation** — Test with any GPU: `whichllm --gpu "RTX 4090"`
157168
- **Multi-GPU simulation** — Repeat `--gpu`, use commas, or write `2x RTX 4090`
158169
- **Full-GPU filter**`--gpu-only` / `--fit full-gpu` hides offload candidates
170+
- **Speed-aware filtering**`--speed usable|fast` hides slow rows by threshold
171+
- **Markdown output**`--markdown` / `-m` prints pasteable GFM tables
172+
- **Runtime memory budgets**`--vram-headroom` and `--ram-budget` avoid edge fits
159173
- **Hardware planning** — Reverse lookup: `whichllm plan "llama 3 70b"`
160174
- **Upgrade planning** — Compare your current machine with candidate GPUs
161175
- **JSON output** — Pipe-friendly: `whichllm --json`
@@ -225,18 +239,28 @@ whichllm --gpu "RTX 4090, RTX 3090"
225239

226240
# Only show models that fit entirely in GPU VRAM
227241
whichllm --gpu-only
228-
whichllm --fit full-gpu --status
242+
whichllm --fit gpu
243+
whichllm --fit full-gpu
244+
245+
# Avoid edge fits and background-RAM surprises
246+
whichllm --vram-headroom 1.5GB
247+
whichllm --ram-budget available
248+
whichllm --ram-budget 8GB
229249

230250
# CPU-only mode
231251
whichllm --cpu-only
232252

233253
# More results / filters
234254
whichllm --top 20
235-
whichllm --status
255+
whichllm --details # show Downloads metadata instead of runtime columns
256+
whichllm --speed usable # minimum 10 tok/s
257+
whichllm --speed fast # minimum 30 tok/s
258+
whichllm --min-speed 4 # exact tok/s floor
259+
whichllm --markdown # pasteable GitHub-Flavored Markdown table
236260
whichllm --profile coding
237261
whichllm --context-length 64k
238262
whichllm --quant Q4_K_M
239-
whichllm --min-speed 30
263+
whichllm --min-speed 30 # exact tok/s floor
240264
whichllm --evidence base # allow id/base-model matches
241265
whichllm --evidence strict # id-exact only (same as --direct)
242266
whichllm --direct
@@ -268,6 +292,14 @@ whichllm snippet "qwen 7b"
268292
whichllm snippet "llama 3 8b gguf" --quant Q5_K_M
269293
```
270294

295+
Markdown output is intended for GitHub issues, READMEs, Slack, Discord, and
296+
blog posts:
297+
298+
```bash
299+
whichllm --markdown
300+
whichllm -m --top 5 --gpu "RTX 4090"
301+
```
302+
271303
JSON model rows include `fit_type`, `vram_required_bytes`,
272304
`vram_available_bytes`, `uses_multi_gpu`, `multi_gpu_effective_vram_bytes`,
273305
`estimated_tok_per_sec`, `speed_confidence`, `speed_range_tok_per_sec`,
@@ -323,7 +355,11 @@ Score markers:
323355
- **`!sr`** (bright yellow) — Uploader-reported benchmark only, not independently verified
324356
- **`?`** (red) — No benchmark data available
325357

326-
Speed markers in `--status`:
358+
Speed display:
359+
- **red** — Slow generation speed (`<4 tok/s`)
360+
- **yellow** — Marginal generation speed (`4-10 tok/s`)
361+
- **green** — Usable generation speed (`10-30 tok/s`)
362+
- **bright green** — Fast local generation speed (`>=30 tok/s`)
327363
- **`~`** (yellow) — Estimated tok/s range is available
328364
- **`?`** (red) — Low-confidence speed estimate; backend/runtime sensitivity is high
329365

docs/README.ja.md

Lines changed: 30 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,19 @@ whichllm --gpu "RTX 4090" --gpu "RTX 3090"
6666

6767
# GPUのVRAMに全部載る候補だけを見る
6868
whichllm --gpu-only
69-
whichllm --fit full-gpu --status
69+
whichllm --fit gpu
70+
71+
# 速度の最低ラインを指定する
72+
whichllm --speed usable
73+
whichllm --speed fast
74+
whichllm --min-speed 4
75+
76+
# GitHubやSlackに貼りやすいMarkdown表で出力する
77+
whichllm --markdown
78+
79+
# 実行時のメモリ余白やRAM使用量を指定する
80+
whichllm --vram-headroom 1.5GB
81+
whichllm --ram-budget available
7082

7183
# CPUのみとして評価する
7284
whichllm --cpu-only
@@ -81,6 +93,10 @@ JSONの各モデルには `estimated_tok_per_sec` に加えて、`fit_type`、
8193
`speed_range_tok_per_sec``speed_notes``benchmark_source`
8294
`benchmark_confidence` が入ります。
8395
速度は実測値ではなく、ハードウェア情報とモデル情報からの推定です。
96+
通常の表には必要メモリ、推定生成速度、Fit種別、Published が表示されます。
97+
Downloads まで見たい場合は `--details` を使います。
98+
GitHub issue、README、Slack、Discord へ貼る場合は `--markdown` / `-m`
99+
でMarkdown表として出力できます。
84100

85101
## 主なコマンド
86102

@@ -89,9 +105,12 @@ JSONの各モデルには `estimated_tok_per_sec` に加えて、`fit_type`、
89105
whichllm --top 20
90106
whichllm --quant Q4_K_M
91107
whichllm --min-speed 30
108+
whichllm --speed usable
109+
whichllm --speed fast
110+
whichllm --markdown
92111
whichllm --profile coding
93112
whichllm --context-length 64k
94-
whichllm --status
113+
whichllm --details
95114
whichllm --gpu-only
96115

97116
# ベンチ根拠の厳しさ
@@ -140,8 +159,12 @@ whichllm hardware
140159
- `!sr`: アップローダー自己申告の評価値だけに基づくスコア
141160
- `?`: 利用できるベンチマーク根拠がないスコア
142161

143-
`--status` の速度欄のマーカー:
162+
速度表示:
144163

164+
- 赤: `4 tok/s` 未満の遅い生成速度
165+
- 黄: `4-10 tok/s` のぎりぎり使える生成速度
166+
- 緑: `10-30 tok/s` の実用的な生成速度
167+
- 明るい緑: `30 tok/s` 以上の高速なローカル生成速度
145168
- `~`: 速度推定の幅がある通常の推定値
146169
- `?`: backend や runtime の影響が大きい低信頼の推定値
147170

@@ -159,7 +182,10 @@ whichllm hardware
159182

160183
通常は full GPU、partial offload、CPU-only の候補をまとめて見ます。GPUの
161184
VRAMに全部載るモデルだけを見たい場合は `--gpu-only`
162-
`--fit full-gpu` を使います。
185+
`--fit gpu` を使います。遅い候補を最初から除外したい場合は
186+
`--speed usable``--speed fast` を使います。
187+
それぞれ `10 tok/s``30 tok/s` が最低ラインです。
188+
もっと低いラインを指定したい場合は `--min-speed 4` のように数値で指定します。
163189

164190
キャッシュは通常 `~/.cache/whichllm/` に保存されます。`XDG_CACHE_HOME`
165191
絶対パスで設定されている場合は、その配下の `whichllm/` を使います。

docs/cli.md

Lines changed: 35 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -19,24 +19,40 @@ Common options:
1919
| `--top`, `-n` | Number of ranked models to show. Default: `10` |
2020
| `--context-length`, `-c` | Context length used for KV cache estimation. Accepts integers or `k` shorthand such as `64k`. Default: `4096` |
2121
| `--quant`, `-q` | Keep only a quantization type such as `Q4_K_M` |
22-
| `--min-speed` | Keep only models above a tok/s estimate |
23-
| `--fit` | Runtime fit filter: `any` or `full-gpu` |
22+
| `--min-speed` | Keep only models above an exact tok/s estimate |
23+
| `--speed` | Named speed floor: `any`, `usable` (`10 tok/s`), or `fast` (`30 tok/s`) |
24+
| `--fit` | Runtime fit filter: `any`, `gpu`, or `full-gpu` |
2425
| `--gpu-only` | Alias for `--fit full-gpu`; excludes partial offload and CPU-only candidates |
2526
| `--profile` | Ranking profile: `general`, `coding`, `vision`, `math`, `any` |
2627
| `--evidence` | Benchmark evidence filter: `strict`, `base`, `any` |
2728
| `--direct` | Alias for `--evidence strict` |
28-
| `--status` | Show VRAM/RAM, speed, and fit columns instead of published/download columns. Speed may include `~` for estimated range or `?` for low confidence |
29+
| `--status` | Compatibility option. Runtime columns are now shown by default |
30+
| `--details` | Show download metadata instead of runtime columns |
2931
| `--min-params` | Minimum model knowledge capacity in billions of parameters |
3032
| `--json` | Print machine-readable JSON |
33+
| `--markdown`, `-m` | Print a pasteable GitHub-Flavored Markdown table |
3134
| `--refresh` | Ignore caches and fetch models/benchmarks again |
3235
| `--cpu-only` | Ignore GPUs and rank for CPU-only use |
3336
| `--gpu` | Simulate GPU(s) by name. Accepts repeated flags, comma-separated values, and count shorthand |
3437
| `--vram` | Override simulated GPU VRAM in GB. Requires `--gpu` |
38+
| `--vram-headroom` | Reserve per-GPU memory for runtime overhead. Default: `auto`. Accepts `none`, byte values like `1.5GB`, or percentages like `10%` |
39+
| `--ram-budget` | Cap RAM available for partial offload. Accepts `available`, byte values like `8GB`, or percentages like `50%` |
3540
| `--version` | Print the installed package version |
3641

3742
`--fit any` is the default. It can include full-GPU, partial-offload, and
38-
CPU-only candidates when they are runnable. `--fit full-gpu` and `--gpu-only`
39-
keep only rows whose `fit_type` is `full_gpu`.
43+
CPU-only candidates when they are runnable. `--fit gpu`, `--fit full-gpu`, and
44+
`--gpu-only` keep only rows whose `fit_type` is `full_gpu`.
45+
46+
The default table shows memory required, estimated generation speed, fit type,
47+
and published date. Use `--details` when you want download counts instead.
48+
Speed colors are absolute usability hints: red is under `4 tok/s`, yellow is
49+
`4-10 tok/s`, green is `10-30 tok/s`, and bright green is `30+ tok/s`. The `~`
50+
and `?` markers still refer to estimate confidence, not speed quality.
51+
52+
`--vram-headroom auto` subtracts a small budget from each GPU before fit
53+
checks, so near-edge recommendations are less likely to overflow in tools such
54+
as LM Studio. Use `--vram-headroom none` to restore the raw detected VRAM.
55+
`--ram-budget available` caps offload planning to current available RAM.
4056

4157
Examples:
4258

@@ -50,12 +66,21 @@ whichllm --gpu "RTX 4090, RTX 3090"
5066
whichllm --profile coding --top 5
5167
whichllm --context-length 64k
5268
whichllm --gpu-only
53-
whichllm --fit full-gpu --status
69+
whichllm --fit gpu
70+
whichllm --speed usable
71+
whichllm --speed fast
72+
whichllm --min-speed 4
73+
whichllm --markdown
74+
whichllm --vram-headroom 1.5GB
75+
whichllm --ram-budget available
76+
whichllm --details
5477
whichllm --evidence strict
55-
whichllm --status
5678
whichllm --json | jq '.models[0]'
5779
```
5880

81+
`--markdown` is mutually exclusive with `--json`. It prints a plain Markdown
82+
table without the Rich hardware panel, colors, or box-drawing characters.
83+
5984
Ranking JSON model rows include:
6085

6186
| Field | Meaning |
@@ -73,6 +98,9 @@ Ranking JSON model rows include:
7398
| `benchmark_source` | How benchmark evidence was matched: `direct`, `variant`, `base_model`, `line_interp`, `self_reported`, or `none` |
7499
| `benchmark_confidence` | Confidence in the benchmark match, `0.0``1.0` |
75100

101+
The top-level `hardware` object also includes `usable_vram_bytes` per GPU,
102+
`ram_budget_bytes`, and `budget_notes` when memory budgets are active.
103+
76104
## `hardware`
77105

78106
```bash

docs/hardware.md

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ Each GPU is represented as `GPUInfo`:
2424
- name
2525
- vendor
2626
- VRAM bytes
27+
- usable VRAM bytes, when a runtime headroom is active
2728
- NVIDIA compute capability, when known
2829
- CUDA or ROCm version, when known
2930
- memory bandwidth estimate
@@ -144,6 +145,16 @@ whichllm hardware --gpu "Unknown GPU" --vram 24
144145

145146
`--vram` requires `--gpu`.
146147

148+
By default, whichllm applies a small automatic VRAM headroom before fit checks.
149+
This avoids recommending models that only fit on paper but overflow in runtimes
150+
that need extra graph buffers or loader overhead. Tune it with:
151+
152+
```bash
153+
whichllm --vram-headroom 1.5GB
154+
whichllm --vram-headroom 10%
155+
whichllm --vram-headroom none
156+
```
157+
147158
Multi-GPU simulation accepts repeated flags, comma-separated values, and count
148159
shorthand:
149160

@@ -171,6 +182,9 @@ If neither GPU memory nor usable RAM can hold the model, the candidate is not
171182
ranked.
172183

173184
whichllm keeps a bounded system-RAM reserve for the OS and other processes.
185+
Use `--ram-budget available` to cap partial-offload planning to the current
186+
available RAM reported by the OS, or pass a fixed budget such as
187+
`--ram-budget 8GB`.
174188

175189
## Multiple GPUs
176190

@@ -199,8 +213,8 @@ disk space. If the model cannot be downloaded, it is marked unrunnable.
199213
## Known limitations
200214

201215
- GPU bandwidth is a lookup or database estimate, not a live benchmark.
202-
- Speed estimates are planning numbers. Use `--status` or JSON fields such as
203-
`speed_confidence` and `speed_range_tok_per_sec` to see uncertainty.
216+
- Speed estimates are planning numbers. The default table and JSON fields such
217+
as `speed_confidence` and `speed_range_tok_per_sec` show uncertainty.
204218
- Driver, runtime, batch size, prompt length, and thermal limits can change real
205219
performance.
206220
- Multi-GPU runtime behavior depends on the inference backend and is only

docs/how-it-works.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -187,7 +187,7 @@ Output is split by surface:
187187
- `output/upgrade.py` renders upgrade comparison tables.
188188
- `output/display.py` re-exports those functions for older imports.
189189

190-
Normal ranking tables show published date and downloads. With `--status`, the
191-
table instead shows memory required, estimated speed, and fit type. Speed cells
192-
use `~` for normal estimates with a range and `?` for low-confidence,
193-
backend-sensitive estimates.
190+
Normal ranking tables show memory required, estimated generation speed, fit
191+
type, and published date. `--details` switches to download-oriented metadata.
192+
Speed color is based on absolute usability, while `~` marks estimates with a
193+
range and `?` marks low-confidence, backend-sensitive estimates.

docs/scoring.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -139,9 +139,11 @@ exposes speed confidence:
139139
| `low` | `0.35x`-`2.00x` | CPU-only, partial offload, unknown bandwidth, Apple Silicon MoE |
140140
| `high` | `0.85x`-`1.20x` | Reserved for future measured-speed data |
141141

142-
With `--status`, speed cells use `~` for medium-confidence estimates and `?`
143-
for low-confidence estimates. JSON exposes the same data as
144-
`speed_confidence`, `speed_range_tok_per_sec`, and `speed_notes`.
142+
Speed cells are colored by absolute usability: red is under `4 tok/s`, yellow
143+
is `4-10 tok/s`, green is `10-30 tok/s`, and bright green is `30+ tok/s`. `~`
144+
marks medium-confidence estimates with a range, and `?` marks low-confidence
145+
estimates. JSON exposes the same uncertainty data as `speed_confidence`,
146+
`speed_range_tok_per_sec`, and `speed_notes`.
145147

146148
## Source trust
147149

0 commit comments

Comments
 (0)