A terminal UI for managing multiple llama.cpp server instances across several GPUs simultaneously.
Running several llama-server instances across multiple GPUs is tedious to manage from the command line — each one needs a different port, the right CUDA_VISIBLE_DEVICES, log redirection, and manual process tracking. Switching models means killing a process, retyping a long command, and hoping you remembered the right flags.
llama-tui wraps all of that in a single terminal dashboard. You can see all three servers at a glance, start or stop any of them, swap models, tune per-GPU flags, and download new GGUF models from HuggingFace — without leaving the terminal or remembering a single command-line argument.
Quitting the TUI leaves every server running. The servers are not children of the TUI process and will keep serving requests until you explicitly stop them.
- Python 3.9+
- NVIDIA GPUs with CUDA (the app expects 3 GPUs by default; edit
NUM_GPUSinapp.pyto change this) - A
llama-serverbinary (pre-built releases)- This repo defaults to
./llama-cuda/llama-server.
- This repo defaults to
git clone https://github.com/bryanjonas/llama-tui
cd llama-tui
./run.sh # creates .venv, installs deps, launches the apprun.sh automatically creates a Python virtualenv on first run and installs the two dependencies (textual, requests). After that it just launches the app.
You can also manage the environment manually:
python3 -m venv .venv
.venv/bin/pip install -r requirements.txt
.venv/bin/python app.py- Set the llama-server path — press
sto open Settings and enter the full path to yourllama-serverbinary (default:./llama-cuda/llama-server, resolved to an absolute path). - Set the models directory — also in Settings. Defaults to
~/models/. Any.gguffiles found recursively under this directory will appear in the model picker. - Select a model per GPU — press
⊞ Change Modelon a panel to pick a.gguffile, then press▶ Startto launch the server.
The main screen shows one panel per GPU side by side:
┌─ GPU 0 ──────────┐ ┌─ GPU 1 ──────────┐ ┌─ GPU 2 ──────────┐
│ RTX 2080 Ti :8080│ │ RTX 2070 :8081│ │ RTX 2080 Ti :8082│
│ │ │ │ │ │
│ ● RUNNING (1234) │ │ ○ STOPPED │ │ ● RUNNING (5678) │
│ mistral-7b.gguf │ │ — │ │ llama-3.gguf │
│ │ │ │ │ │
│ [■ Stop ] │ │ [▶ Start ] │ │ [■ Stop ] │
│ [⊞ Change Model] │ │ [⊞ Change Model] │ │ [⊞ Change Model] │
│ [⚙ Flags ] │ │ [⚙ Flags ] │ │ [⚙ Flags ] │
│ [≡ View Logs ] │ │ [≡ View Logs ] │ │ [≡ View Logs ] │
└──────────────────┘ └──────────────────┘ └──────────────────┘
Each server listens on 0.0.0.0. By default, each panel uses its own GPU index (0, 1, 2) for CUDA_VISIBLE_DEVICES, but you can override this per panel (including multi-GPU values like 0,2) in ⚙ Flags.
| Key | Action |
|---|---|
d |
Open the HuggingFace download screen |
s |
Open Settings |
r |
Refresh all panels |
q |
Quit the TUI — servers keep running |
Q |
Quit and stop all servers |
Press ⚙ Flags on any panel to open the flags editor for that GPU:
| Flag | llama-server argument | Default |
|---|---|---|
| Context size | -c |
4096 |
| GPU layers | -ngl |
99 |
| Threads | --threads |
8 |
| Parallel slots | --parallel |
1 |
| Flash Attention | --flash-attn true |
off |
| mlock | --mlock |
off |
| no-mmap | --no-mmap |
off |
| CUDA devices | env CUDA_VISIBLE_DEVICES |
panel GPU index |
| Extra args | passed through verbatim | — |
Flags are saved to ~/.llama-tui/config.json and applied the next time a server is started. Changing flags does not restart a running server automatically.
Press d to open the download screen.
- List available files — enter a HuggingFace repo (e.g.
TheBloke/Mistral-7B-v0.1-GGUF) and leave the filename blank, then click Download / List. The app queries the HF API and lists all.gguffiles in the repo. - Download a file — enter the repo and the filename, then click Download / List. The download runs as a fully detached background process (
downloader.py) that survives closing the TUI. Progress is streamed into the log view. - Direct URL — paste a full
https://huggingface.co/…URL instead of a repo slug.
Downloaded files land in the configured models directory. The HuggingFace token is read from ~/.cache/huggingface/token or the $HF_TOKEN environment variable and is never exposed on the command line.
Press ≡ View Logs on any panel to tail the live log for that GPU's server. Logs are stored at:
~/.llama-tui/logs/gpu-0.log
~/.llama-tui/logs/gpu-1.log
~/.llama-tui/logs/gpu-2.log
Press c inside the log viewer to clear the log file. Download progress logs are at ~/.llama-tui/downloads/<filename>.log.
~/.llama-tui/config.json is created automatically on first run. You can edit it by hand if needed:
{
"llama_server_path": "/path/to/llama-server",
"models_dir": "/home/user/models",
"base_port": 8080,
"services": [
{
"gpu": 0,
"port": 8080,
"model": "/home/user/models/mistral-7b.Q4_K_M.gguf",
"flags": {
"ctx_size": 4096,
"gpu_layers": 99,
"flash_attn": false,
"threads": 8,
"parallel": 1,
"mlock": false,
"no_mmap": false,
"cuda_visible_devices": "0",
"extra_args": ""
}
}
]
}If llama-server processes are already running when the TUI starts, it scans /proc and automatically attaches to any instance whose port matches a configured service. The panel will show the PID and model name (if readable from the process command line). These pre-existing servers are not stopped when you press q.