This document is for agent/client builders integrating with llmfit serve.
llmfit serve exposes node-local model fit analysis (same core data used by TUI/CLI) over HTTP and serves a local web dashboard.
Primary use case:
- Query each node in a cluster for top runnable models.
- Aggregate externally (scheduler/controller/UI) for placement decisions.
llmfit serve --port 8787Global flags still apply:
llmfit --memory 24G --ram 64G --cpu-cores 16 --max-context 8192 serve --port 8787Hardware overrides (--memory, --ram, --cpu-cores) are reflected in API responses, making the server report the overridden values instead of the detected hardware.
Default local base URL:
http://127.0.0.1:8787
To expose outside localhost, pass --host 0.0.0.0.
If you are building from source and want the dashboard embedded in llmfit, build web assets first:
cd llmfit-web && npm ci && npm run buildWeb dashboard entrypoint (same-origin UI for fit exploration).
Liveness probe.
Example response:
{
"status": "ok",
"node": {
"name": "worker-1",
"os": "linux"
}
}Returns node identity + detected hardware.
Example response shape:
{
"node": {
"name": "worker-1",
"os": "linux"
},
"system": {
"total_ram_gb": 62.23,
"available_ram_gb": 41.08,
"cpu_cores": 14,
"cpu_name": "Intel(R) Core(TM) Ultra 7 165U",
"has_gpu": false,
"gpu_vram_gb": null,
"gpu_name": null,
"gpu_count": 0,
"unified_memory": false,
"backend": "CPU (x86)",
"gpus": []
}
}Returns filtered/sorted model-fit rows for this node.
Envelope shape:
{
"node": { "name": "worker-1", "os": "linux" },
"system": { "...": "..." },
"total_models": 23,
"returned_models": 10,
"filters": { "...": "echo of query state" },
"models": [
{
"name": "Qwen/Qwen2.5-Coder-7B-Instruct",
"provider": "Qwen",
"parameter_count": "7B",
"params_b": 7.0,
"context_length": 32768,
"use_case": "Coding",
"category": "Coding",
"release_date": "2025-03-14",
"is_moe": false,
"fit_level": "good",
"fit_label": "Good",
"run_mode": "gpu",
"run_mode_label": "GPU",
"score": 86.5,
"score_components": {
"quality": 87.0,
"speed": 81.2,
"fit": 90.1,
"context": 88.0
},
"estimated_tps": 42.5,
"runtime": "llamacpp",
"runtime_label": "llama.cpp",
"best_quant": "Q5_K_M",
"memory_required_gb": 5.8,
"memory_available_gb": 12.0,
"utilization_pct": 48.3,
"notes": [],
"gguf_sources": []
}
]
}Key scheduling endpoint. Same schema as /api/v1/models, but defaults to top 5 runnable entries.
Important behavior:
- Defaults
limit=5. - Excludes
too_tightrows unless explicitly overridden (and top endpoint still keeps runnable semantics).
Path-constrained search. Equivalent to a text search scoped by {name}.
Useful for:
- Client-side drilldown after selecting a model family.
Supported on /api/v1/models and /api/v1/models/top (also /api/v1/models/{name}):
limit(or aliasn): max rows returned.perfect:true|false(whentrue, only perfect fits).min_fit:perfect|good|marginal|too_tight.runtime:any|mlx|llamacpp.use_case:general|coding|reasoning|chat|multimodal|embedding.provider: provider substring filter.search: free-text filter (name/provider/params/use-case/category).sort:score|tps|params|mem|ctx|date|use_case.include_too_tight: include unrunnable rows (defaults true for/models, false for/models/top).max_context: per-request context cap used by memory estimation.force_runtime:mlx|llamacpp|vllm— override automatic runtime selection during analysis (e.g. get llama.cpp recommendations on Apple Silicon instead of MLX).
Invalid filter values return HTTP 400:
{
"error": "invalid min_fit value: use perfect|good|marginal|too_tight"
}Server errors return HTTP 500 with {"error": "..."}.
For each node agent:
- Call
/health. - Call
/api/v1/system. - Call
/api/v1/models/top?limit=K&min_fit=good. - Attach node metadata and forward to your central scheduler.
For production placement, prefer:
min_fit=good
include_too_tight=false
sort=score
limit=5..20
Examples:
- Coding workloads:
use_case=coding - Embedding workloads:
use_case=embedding - Runtime constrained to llama.cpp fleet:
runtime=llamacpp
Treat unknown fields as forward-compatible additions:
- Parse required fields you depend on.
- Ignore unknown fields.
curl http://127.0.0.1:8787/health
curl http://127.0.0.1:8787/api/v1/system
curl "http://127.0.0.1:8787/api/v1/models?limit=20&min_fit=marginal&sort=score"
curl "http://127.0.0.1:8787/api/v1/models/top?limit=5&min_fit=good&use_case=coding"
curl "http://127.0.0.1:8787/api/v1/models/Mistral?runtime=any"Current API prefix is v1.
If you build long-lived clients, pin to /api/v1/... and validate behavior with the local test script in scripts/test_api.py.