Andyyyy64
diff --git a/‎CHANGELOG.md‎
Lines changed: 22 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 22 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 41 additions & 5 deletions b/‎README.md‎
Lines changed: 41 additions & 5 deletions
diff --git a/‎docs/README.ja.md‎
Lines changed: 30 additions & 4 deletions b/‎docs/README.ja.md‎
Lines changed: 30 additions & 4 deletions
diff --git a/‎docs/cli.md‎
Lines changed: 35 additions & 7 deletions b/‎docs/cli.md‎
Lines changed: 35 additions & 7 deletions
diff --git a/‎docs/hardware.md‎
Lines changed: 16 additions & 2 deletions b/‎docs/hardware.md‎
Lines changed: 16 additions & 2 deletions
diff --git a/‎docs/how-it-works.md‎
Lines changed: 4 additions & 4 deletions b/‎docs/how-it-works.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎docs/scoring.md‎
Lines changed: 5 additions & 3 deletions b/‎docs/scoring.md‎
Lines changed: 5 additions & 3 deletions
@@ -4,6 +4,28 @@ All notable changes to this project will be documented in this file.
 
 The format is based on [Keep a Changelog](https://keepachangelog.com/), and this project adheres to [Semantic Versioning](https://semver.org/).
 
+## [Unreleased]
+
+## [0.5.12] - 2026-06-18
+
+### Added
+
+- Default ranking tables now show memory required, estimated generation speed,
+  fit type, and published date. Download counts are still available with
+  `--details`.
+- Added `--speed any|usable|fast` as named generation-speed filters while
+  keeping `--min-speed` for exact tok/s thresholds.
+- Added `--fit gpu` as a natural alias for full-GPU-only recommendations.
+- Added `--markdown` / `-m` for pasteable GitHub-Flavored Markdown ranking
+  tables. (#111)
+- Added `--vram-headroom` and `--ram-budget` so users can avoid edge VRAM fits
+  and cap partial-offload planning to available or fixed system RAM.
+
+### Changed
+
+- Speed color now reflects practical generation speed. `~` and `?` remain
+  estimate-confidence markers instead of being the primary speed color.
+
 ## [0.5.11] - 2026-06-18
 
 ### Added
 
@@ -58,11 +58,18 @@ whichllm --gpu "RTX 4090"
 
 # Only show models that fit fully in GPU VRAM
 whichllm --gpu-only
-whichllm --fit full-gpu --status
+whichllm --fit gpu
 
 # Simulate a multi-GPU workstation
 whichllm --gpu "2x RTX 4090"
 
+# Hide models that are technically runnable but too slow
+whichllm --speed usable
+whichllm --speed fast
+
+# Pasteable GitHub / Slack / Discord output
+whichllm --markdown
+
 # Compare upgrade candidates
 whichllm upgrade "RTX 4090" "RTX 5090" "H100"
 
@@ -114,6 +121,10 @@ data, this is not a static list):
 By default, rankings include full-GPU, partial-offload, and CPU-only
 candidates when they are usable. Use `--gpu-only` or `--fit full-gpu` when
 you only want models that fit entirely in GPU VRAM.
+The default table shows memory, estimated generation speed, fit type, and
+published date. Speed is colored by practical usability: under 4 tok/s is red,
+4-10 is yellow, 10-30 is green, and 30+ is bright green. `~` / `?` still mark
+estimate confidence.
 
 ## Why whichllm?
 
@@ -156,6 +167,9 @@ whichllm is built to get right.
 - **GPU simulation** — Test with any GPU: `whichllm --gpu "RTX 4090"`
 - **Multi-GPU simulation** — Repeat `--gpu`, use commas, or write `2x RTX 4090`
 - **Full-GPU filter** — `--gpu-only` / `--fit full-gpu` hides offload candidates
+- **Speed-aware filtering** — `--speed usable|fast` hides slow rows by threshold
+- **Markdown output** — `--markdown` / `-m` prints pasteable GFM tables
+- **Runtime memory budgets** — `--vram-headroom` and `--ram-budget` avoid edge fits
 - **Hardware planning** — Reverse lookup: `whichllm plan "llama 3 70b"`
 - **Upgrade planning** — Compare your current machine with candidate GPUs
 - **JSON output** — Pipe-friendly: `whichllm --json`
@@ -225,18 +239,28 @@ whichllm --gpu "RTX 4090, RTX 3090"
 
 # Only show models that fit entirely in GPU VRAM
 whichllm --gpu-only
-whichllm --fit full-gpu --status
+whichllm --fit gpu
+whichllm --fit full-gpu
+
+# Avoid edge fits and background-RAM surprises
+whichllm --vram-headroom 1.5GB
+whichllm --ram-budget available
+whichllm --ram-budget 8GB
 
 # CPU-only mode
 whichllm --cpu-only
 
 # More results / filters
 whichllm --top 20
-whichllm --status
+whichllm --details          # show Downloads metadata instead of runtime columns
+whichllm --speed usable     # minimum 10 tok/s
+whichllm --speed fast       # minimum 30 tok/s
+whichllm --min-speed 4      # exact tok/s floor
+whichllm --markdown         # pasteable GitHub-Flavored Markdown table
 whichllm --profile coding
 whichllm --context-length 64k
 whichllm --quant Q4_K_M
-whichllm --min-speed 30
+whichllm --min-speed 30     # exact tok/s floor
 whichllm --evidence base   # allow id/base-model matches
 whichllm --evidence strict # id-exact only (same as --direct)
 whichllm --direct
@@ -268,6 +292,14 @@ whichllm snippet "qwen 7b"
 whichllm snippet "llama 3 8b gguf" --quant Q5_K_M
 ```
 
+Markdown output is intended for GitHub issues, READMEs, Slack, Discord, and
+blog posts:
+
+```bash
+whichllm --markdown
+whichllm -m --top 5 --gpu "RTX 4090"
+```
+
 JSON model rows include `fit_type`, `vram_required_bytes`,
 `vram_available_bytes`, `uses_multi_gpu`, `multi_gpu_effective_vram_bytes`,
 `estimated_tok_per_sec`, `speed_confidence`, `speed_range_tok_per_sec`,
@@ -323,7 +355,11 @@ Score markers:
 - **`!sr`** (bright yellow) — Uploader-reported benchmark only, not independently verified
 - **`?`** (red) — No benchmark data available
 
-Speed markers in `--status`:
+Speed display:
+- **red** — Slow generation speed (`<4 tok/s`)
+- **yellow** — Marginal generation speed (`4-10 tok/s`)
+- **green** — Usable generation speed (`10-30 tok/s`)
+- **bright green** — Fast local generation speed (`>=30 tok/s`)
 - **`~`** (yellow) — Estimated tok/s range is available
 - **`?`** (red) — Low-confidence speed estimate; backend/runtime sensitivity is high
 
 
@@ -66,7 +66,19 @@ whichllm --gpu "RTX 4090" --gpu "RTX 3090"
 
 # GPUのVRAMに全部載る候補だけを見る
 whichllm --gpu-only
-whichllm --fit full-gpu --status
+whichllm --fit gpu
+
+# 速度の最低ラインを指定する
+whichllm --speed usable
+whichllm --speed fast
+whichllm --min-speed 4
+
+# GitHubやSlackに貼りやすいMarkdown表で出力する
+whichllm --markdown
+
+# 実行時のメモリ余白やRAM使用量を指定する
+whichllm --vram-headroom 1.5GB
+whichllm --ram-budget available
 
 # CPUのみとして評価する
 whichllm --cpu-only
@@ -81,6 +93,10 @@ JSONの各モデルには `estimated_tok_per_sec` に加えて、`fit_type`、
 `speed_range_tok_per_sec`、`speed_notes`、`benchmark_source`、
 `benchmark_confidence` が入ります。
 速度は実測値ではなく、ハードウェア情報とモデル情報からの推定です。
+通常の表には必要メモリ、推定生成速度、Fit種別、Published が表示されます。
+Downloads まで見たい場合は `--details` を使います。
+GitHub issue、README、Slack、Discord へ貼る場合は `--markdown` / `-m`
+でMarkdown表として出力できます。
 
 ## 主なコマンド
 
@@ -89,9 +105,12 @@ JSONの各モデルには `estimated_tok_per_sec` に加えて、`fit_type`、
 whichllm --top 20
 whichllm --quant Q4_K_M
 whichllm --min-speed 30
+whichllm --speed usable
+whichllm --speed fast
+whichllm --markdown
 whichllm --profile coding
 whichllm --context-length 64k
-whichllm --status
+whichllm --details
 whichllm --gpu-only
 
 # ベンチ根拠の厳しさ
@@ -140,8 +159,12 @@ whichllm hardware
 - `!sr`: アップローダー自己申告の評価値だけに基づくスコア
 - `?`: 利用できるベンチマーク根拠がないスコア
 
-`--status` の速度欄のマーカー:
+速度表示:
 
+- 赤: `4 tok/s` 未満の遅い生成速度
+- 黄: `4-10 tok/s` のぎりぎり使える生成速度
+- 緑: `10-30 tok/s` の実用的な生成速度
+- 明るい緑: `30 tok/s` 以上の高速なローカル生成速度
 - `~`: 速度推定の幅がある通常の推定値
 - `?`: backend や runtime の影響が大きい低信頼の推定値
 
@@ -159,7 +182,10 @@ whichllm hardware
 
 通常は full GPU、partial offload、CPU-only の候補をまとめて見ます。GPUの
 VRAMに全部載るモデルだけを見たい場合は `--gpu-only` か
-`--fit full-gpu` を使います。
+`--fit gpu` を使います。遅い候補を最初から除外したい場合は
+`--speed usable`、`--speed fast` を使います。
+それぞれ `10 tok/s`、`30 tok/s` が最低ラインです。
+もっと低いラインを指定したい場合は `--min-speed 4` のように数値で指定します。
 
 キャッシュは通常 `~/.cache/whichllm/` に保存されます。`XDG_CACHE_HOME` が
 絶対パスで設定されている場合は、その配下の `whichllm/` を使います。
 
@@ -19,24 +19,40 @@ Common options:
 | `--top`, `-n` | Number of ranked models to show. Default: `10` |
 | `--context-length`, `-c` | Context length used for KV cache estimation. Accepts integers or `k` shorthand such as `64k`. Default: `4096` |
 | `--quant`, `-q` | Keep only a quantization type such as `Q4_K_M` |
-| `--min-speed` | Keep only models above a tok/s estimate |
-| `--fit` | Runtime fit filter: `any` or `full-gpu` |
+| `--min-speed` | Keep only models above an exact tok/s estimate |
+| `--speed` | Named speed floor: `any`, `usable` (`10 tok/s`), or `fast` (`30 tok/s`) |
+| `--fit` | Runtime fit filter: `any`, `gpu`, or `full-gpu` |
 | `--gpu-only` | Alias for `--fit full-gpu`; excludes partial offload and CPU-only candidates |
 | `--profile` | Ranking profile: `general`, `coding`, `vision`, `math`, `any` |
 | `--evidence` | Benchmark evidence filter: `strict`, `base`, `any` |
 | `--direct` | Alias for `--evidence strict` |
-| `--status` | Show VRAM/RAM, speed, and fit columns instead of published/download columns. Speed may include `~` for estimated range or `?` for low confidence |
+| `--status` | Compatibility option. Runtime columns are now shown by default |
+| `--details` | Show download metadata instead of runtime columns |
 | `--min-params` | Minimum model knowledge capacity in billions of parameters |
 | `--json` | Print machine-readable JSON |
+| `--markdown`, `-m` | Print a pasteable GitHub-Flavored Markdown table |
 | `--refresh` | Ignore caches and fetch models/benchmarks again |
 | `--cpu-only` | Ignore GPUs and rank for CPU-only use |
 | `--gpu` | Simulate GPU(s) by name. Accepts repeated flags, comma-separated values, and count shorthand |
 | `--vram` | Override simulated GPU VRAM in GB. Requires `--gpu` |
+| `--vram-headroom` | Reserve per-GPU memory for runtime overhead. Default: `auto`. Accepts `none`, byte values like `1.5GB`, or percentages like `10%` |
+| `--ram-budget` | Cap RAM available for partial offload. Accepts `available`, byte values like `8GB`, or percentages like `50%` |
 | `--version` | Print the installed package version |
 
 `--fit any` is the default. It can include full-GPU, partial-offload, and
-CPU-only candidates when they are runnable. `--fit full-gpu` and `--gpu-only`
-keep only rows whose `fit_type` is `full_gpu`.
+CPU-only candidates when they are runnable. `--fit gpu`, `--fit full-gpu`, and
+`--gpu-only` keep only rows whose `fit_type` is `full_gpu`.
+
+The default table shows memory required, estimated generation speed, fit type,
+and published date. Use `--details` when you want download counts instead.
+Speed colors are absolute usability hints: red is under `4 tok/s`, yellow is
+`4-10 tok/s`, green is `10-30 tok/s`, and bright green is `30+ tok/s`. The `~`
+and `?` markers still refer to estimate confidence, not speed quality.
+
+`--vram-headroom auto` subtracts a small budget from each GPU before fit
+checks, so near-edge recommendations are less likely to overflow in tools such
+as LM Studio. Use `--vram-headroom none` to restore the raw detected VRAM.
+`--ram-budget available` caps offload planning to current available RAM.
 
 Examples:
 
@@ -50,12 +66,21 @@ whichllm --gpu "RTX 4090, RTX 3090"
 whichllm --profile coding --top 5
 whichllm --context-length 64k
 whichllm --gpu-only
-whichllm --fit full-gpu --status
+whichllm --fit gpu
+whichllm --speed usable
+whichllm --speed fast
+whichllm --min-speed 4
+whichllm --markdown
+whichllm --vram-headroom 1.5GB
+whichllm --ram-budget available
+whichllm --details
 whichllm --evidence strict
-whichllm --status
 whichllm --json | jq '.models[0]'
 ```
 
+`--markdown` is mutually exclusive with `--json`. It prints a plain Markdown
+table without the Rich hardware panel, colors, or box-drawing characters.
+
 Ranking JSON model rows include:
 
 | Field | Meaning |
@@ -73,6 +98,9 @@ Ranking JSON model rows include:
 | `benchmark_source` | How benchmark evidence was matched: `direct`, `variant`, `base_model`, `line_interp`, `self_reported`, or `none` |
 | `benchmark_confidence` | Confidence in the benchmark match, `0.0`–`1.0` |
 
+The top-level `hardware` object also includes `usable_vram_bytes` per GPU,
+`ram_budget_bytes`, and `budget_notes` when memory budgets are active.
+
 ## `hardware`
 
 ```bash
 
@@ -24,6 +24,7 @@ Each GPU is represented as `GPUInfo`:
 - name
 - vendor
 - VRAM bytes
+- usable VRAM bytes, when a runtime headroom is active
 - NVIDIA compute capability, when known
 - CUDA or ROCm version, when known
 - memory bandwidth estimate
@@ -144,6 +145,16 @@ whichllm hardware --gpu "Unknown GPU" --vram 24
 
 `--vram` requires `--gpu`.
 
+By default, whichllm applies a small automatic VRAM headroom before fit checks.
+This avoids recommending models that only fit on paper but overflow in runtimes
+that need extra graph buffers or loader overhead. Tune it with:
+
+```bash
+whichllm --vram-headroom 1.5GB
+whichllm --vram-headroom 10%
+whichllm --vram-headroom none
+```
+
 Multi-GPU simulation accepts repeated flags, comma-separated values, and count
 shorthand:
 
@@ -171,6 +182,9 @@ If neither GPU memory nor usable RAM can hold the model, the candidate is not
 ranked.
 
 whichllm keeps a bounded system-RAM reserve for the OS and other processes.
+Use `--ram-budget available` to cap partial-offload planning to the current
+available RAM reported by the OS, or pass a fixed budget such as
+`--ram-budget 8GB`.
 
 ## Multiple GPUs
 
@@ -199,8 +213,8 @@ disk space. If the model cannot be downloaded, it is marked unrunnable.
 ## Known limitations
 
 - GPU bandwidth is a lookup or database estimate, not a live benchmark.
-- Speed estimates are planning numbers. Use `--status` or JSON fields such as
-  `speed_confidence` and `speed_range_tok_per_sec` to see uncertainty.
+- Speed estimates are planning numbers. The default table and JSON fields such
+  as `speed_confidence` and `speed_range_tok_per_sec` show uncertainty.
 - Driver, runtime, batch size, prompt length, and thermal limits can change real
   performance.
 - Multi-GPU runtime behavior depends on the inference backend and is only
 
@@ -187,7 +187,7 @@ Output is split by surface:
 - `output/upgrade.py` renders upgrade comparison tables.
 - `output/display.py` re-exports those functions for older imports.
 
-Normal ranking tables show published date and downloads. With `--status`, the
-table instead shows memory required, estimated speed, and fit type. Speed cells
-use `~` for normal estimates with a range and `?` for low-confidence,
-backend-sensitive estimates.
+Normal ranking tables show memory required, estimated generation speed, fit
+type, and published date. `--details` switches to download-oriented metadata.
+Speed color is based on absolute usability, while `~` marks estimates with a
+range and `?` marks low-confidence, backend-sensitive estimates.
@@ -139,9 +139,11 @@ exposes speed confidence:
 | `low` | `0.35x`-`2.00x` | CPU-only, partial offload, unknown bandwidth, Apple Silicon MoE |
 | `high` | `0.85x`-`1.20x` | Reserved for future measured-speed data |
 
-With `--status`, speed cells use `~` for medium-confidence estimates and `?`
-for low-confidence estimates. JSON exposes the same data as
-`speed_confidence`, `speed_range_tok_per_sec`, and `speed_notes`.
+Speed cells are colored by absolute usability: red is under `4 tok/s`, yellow
+is `4-10 tok/s`, green is `10-30 tok/s`, and bright green is `30+ tok/s`. `~`
+marks medium-confidence estimates with a range, and `?` marks low-confidence
+estimates. JSON exposes the same uncertainty data as `speed_confidence`,
+`speed_range_tok_per_sec`, and `speed_notes`.
 
 ## Source trust