Trinity is a local deliberation system that runs three LLMs side-by-side (analytical, diplomatic, and practical). Each one answers the same question, then they deliberate. If there’s disagreement, the third model can check whether the others are logically the same (even with different reasoning). Finally, Trinity synthesizes a single group answer.
- Python 3.10+
- llama.cpp build with Vulkan backend (AMD GPUs need this).
- Gradio (for the UI)
sentence-transformers(for semantic checks)python-dotenv
Install dependencies:
pip install -r requirements.txtPut your .gguf models under ./models. The defaults in .env expect:
MODEL1 = "../models/llama.gguf"
MODEL2 = "../models/mistral.gguf"
MODEL3 = "../models/zeph.gguf"
Edit .env if you want to swap models.
To run on AMD cards with Vulkan:
- Build llama.cpp with Vulkan:
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release- Make sure these env vars are set (already in
init.py):
os.environ['LLAMA_CPP_VULKAN'] = '1'
os.environ['GGML_VULKAN_DEVICE'] = '0'GGML_VULKAN_DEVICE picks which GPU if you have more than one.
- Adjust
n_gpu_layersininit.pyto push as much as your VRAM can handle. Example:
llm1 = Llama(model_path=MODEL1, n_gpu_layers=28, n_ctx=2048)Fastest way to test:
python main.pyIt will prompt you for a question in the terminal.
Gradio UI for easier interaction:
python app.pyThis opens a browser window at http://127.0.0.1:7860.
The UI shows:
- Each model’s clipped answer (first 4 lines)
- Any “condition” text if an outlier needed convincing
- The current status (
total_same,partial_same,disagreement) - The synthesized final decision
- Round 1: All three models answer independently.
- Deliberation: Each sees the others’ answers and revises.
- Judge step: Model 3 (or whichever is set) checks if the top-line choices are logically the same.
- Conditions: If one model disagrees, it can output a condition (what evidence would convince it). If that happens, Trinity defers or shows the condition.
- Final decision: If two or three agree, Trinity merges the answers into one definitive response.
- Boot time is slower in
app.pybecause Gradio spins up a local server and frontend. CLI (main.py) is quicker. - All prompts (system, deliberation, condition, final) are defined in
.env. Easy to swap personalities or tighten formats. - If you’re running bigger models, drop
n_ctxorn_gpu_layersininit.pyif you run out of VRAM. - By default, Vulkan runs on device 0. Change
GGML_VULKAN_DEVICEif you have more than one AMD card.