|
| 1 | +# Project Uni5 |
| 2 | +## From the Lab to Enterprise Grade Orchestration — Instantly. |
| 3 | + |
| 4 | +Most platforms make you choose. Either you're in the research and experimentation |
| 5 | +world with scrappy tooling, or you're in the enterprise world with heavyweight |
| 6 | +infrastructure requirements. The gap between those two is where most open source |
| 7 | +models go to die — the training works but nobody can actually use them in a real |
| 8 | +system without significant engineering. |
| 9 | + |
| 10 | +This platform closes that gap entirely. |
| 11 | + |
| 12 | +A model that has never been deployed anywhere — straight off a training run — |
| 13 | +plugs directly into a full-featured enterprise orchestration stack with zero |
| 14 | +infrastructure overhead. |
| 15 | + |
| 16 | +``` |
| 17 | +Train model → save weights → point platform at path → full orchestration |
| 18 | +``` |
| 19 | + |
| 20 | +No API deployment step. No vLLM setup. No distributed systems knowledge required. |
| 21 | +Just a path to a directory. |
| 22 | + |
| 23 | +The same platform already handles hosted providers (Hyperbolic, TogetherAI), |
| 24 | +local runtimes (Ollama), sandboxed shell execution, multi-agent delegation, |
| 25 | +file serving, and a streaming frontend. The model integration work completes |
| 26 | +the picture. |
| 27 | + |
| 28 | +--- |
| 29 | + |
| 30 | +## Why Single Machine First |
| 31 | + |
| 32 | +The platform's existing architecture maps directly to the single machine market: |
| 33 | + |
| 34 | +- Sandbox is FireJail on a single host — not a distributed system |
| 35 | +- Tool routing, shell execution, and file handling are all single machine concepts |
| 36 | +- Ollama (the dominant single machine runtime) is already a supported provider |
| 37 | +- The open source community most likely to build on this platform is running |
| 38 | + consumer or prosumer hardware, not GPU clusters |
| 39 | +- Enterprises with data privacy requirements cannot send data to an external API — |
| 40 | + local model execution is a hard requirement for them |
| 41 | + |
| 42 | +The `transformers` adapter is the specific missing piece. It means the gap |
| 43 | +between "I have weights" and "I have a running orchestration platform" is |
| 44 | +closed entirely. |
| 45 | + |
| 46 | +--- |
| 47 | + |
| 48 | +## The Model Integration Stack |
| 49 | + |
| 50 | +Three adapters, three markets, one platform: |
| 51 | + |
| 52 | +| Adapter | Format | Target User | Status | |
| 53 | +|---|---|---|---| |
| 54 | +| `transformers` | safetensors / bin | Researchers, AI labs, fine-tuners | **Build next** | |
| 55 | +| GGUF / llama.cpp | GGUF | Prosumers, quantized model users, Ollama power users | Phase 2 | |
| 56 | +| vLLM | any HF model | AI labs with GPU clusters | Phase 3 | |
| 57 | + |
| 58 | +All three feed into the existing normalization pipeline. The platform surface |
| 59 | +— tool routing, shell execution, file handling, streaming frontend — is |
| 60 | +untouched regardless of which adapter is active. |
| 61 | + |
| 62 | +--- |
| 63 | + |
| 64 | +## Phase 1 — `transformers` Adapter |
| 65 | + |
| 66 | +### What it is |
| 67 | + |
| 68 | +A provider adapter that loads a model from a local directory using the |
| 69 | +HuggingFace `transformers` library and pipes its token stream into the |
| 70 | +existing normalization pipeline via `TextIteratorStreamer`. |
| 71 | + |
| 72 | +### Loading |
| 73 | + |
| 74 | +```python |
| 75 | +from transformers import AutoModelForCausalLM, AutoTokenizer |
| 76 | +import torch |
| 77 | + |
| 78 | +model_path = "/path/to/model" # local directory or HF repo name |
| 79 | + |
| 80 | +tokenizer = AutoTokenizer.from_pretrained(model_path) |
| 81 | + |
| 82 | +model = AutoModelForCausalLM.from_pretrained( |
| 83 | + model_path, |
| 84 | + torch_dtype=torch.float32, # float32 for CPU stability |
| 85 | + device_map="auto", # auto-selects CPU / GPU / MPS |
| 86 | +) |
| 87 | +``` |
| 88 | + |
| 89 | +`Auto` classes read `config.json` and instantiate the correct architecture |
| 90 | +automatically. No model-family-specific code required. |
| 91 | + |
| 92 | +### Streaming |
| 93 | + |
| 94 | +```python |
| 95 | +from transformers import TextIteratorStreamer |
| 96 | +from threading import Thread |
| 97 | + |
| 98 | +streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=False) |
| 99 | + |
| 100 | +inputs = tokenizer.apply_chat_template( |
| 101 | + messages, |
| 102 | + return_tensors="pt", |
| 103 | + add_generation_prompt=True, |
| 104 | +) |
| 105 | + |
| 106 | +thread = Thread(target=model.generate, kwargs={ |
| 107 | + "input_ids": inputs, |
| 108 | + "streamer": streamer, |
| 109 | + "max_new_tokens": 512, |
| 110 | +}) |
| 111 | +thread.start() |
| 112 | + |
| 113 | +for token in streamer: |
| 114 | + yield token # feed directly into normalization pipeline |
| 115 | +``` |
| 116 | + |
| 117 | +`skip_special_tokens=False` is important — special tokens (`<|im_start|>`, |
| 118 | +`<tool_call>`, `</s>` etc.) are exactly what the normalization layer needs |
| 119 | +to detect tool call boundaries. |
| 120 | + |
| 121 | +### Integration point |
| 122 | + |
| 123 | +`TextIteratorStreamer` sits in the same position as the existing Ollama or |
| 124 | +OpenAI streaming client. The `for token in streamer` loop becomes the new |
| 125 | +provider adapter. Everything downstream is unchanged. |
| 126 | + |
| 127 | +### Key engineering challenges |
| 128 | + |
| 129 | +**Tool call boundary detection** |
| 130 | +Unlike provider APIs which emit pre-parsed `type: "call_arguments"` deltas, |
| 131 | +raw token streams require the normalization layer to detect when it has |
| 132 | +entered and exited a tool call block. Buffering logic needed. The existing |
| 133 | +regex healing in `_map_chunk_to_event` is the foundation to build from. |
| 134 | + |
| 135 | +**Chat template variance** |
| 136 | +Every model family formats the prompt differently. `apply_chat_template` |
| 137 | +handles this automatically using the template stored in |
| 138 | +`tokenizer_config.json`. No manual prompt formatting required per model. |
| 139 | + |
| 140 | +**CPU stability** |
| 141 | +Use `torch.float32` on CPU — `float16` can produce NaN values without a GPU. |
| 142 | +For machines with a GPU, `float16` or `bfloat16` halves memory usage. |
| 143 | + |
| 144 | +### Deliverables |
| 145 | + |
| 146 | +- [ ] `TransformersProviderAdapter` class implementing the existing provider |
| 147 | + interface |
| 148 | +- [ ] Token stream → normalization pipeline bridge |
| 149 | +- [ ] Tool call boundary detection and buffering |
| 150 | +- [ ] Chat template application |
| 151 | +- [ ] Config: accept local path or HF repo name interchangeably |
| 152 | +- [ ] Document minimum hardware requirements per model size class |
| 153 | + |
| 154 | +--- |
| 155 | + |
| 156 | +## Phase 2 — GGUF / llama.cpp Adapter |
| 157 | + |
| 158 | +### What it is |
| 159 | + |
| 160 | +An adapter for quantized models in GGUF format — the format used by llama.cpp |
| 161 | +and Ollama under the hood. Targets the large market of users running |
| 162 | +compressed models on consumer hardware. |
| 163 | + |
| 164 | +### Why it matters |
| 165 | + |
| 166 | +Quantization makes large models accessible on normal hardware: |
| 167 | + |
| 168 | +| Model size | float16 VRAM | 4-bit GGUF VRAM | |
| 169 | +|---|---|---| |
| 170 | +| 7B | ~14 GB | ~4 GB | |
| 171 | +| 13B | ~26 GB | ~8 GB | |
| 172 | +| 70B | ~140 GB | ~40 GB | |
| 173 | + |
| 174 | +A 70B model in 4-bit GGUF runs on a Mac Studio with 64GB unified memory. |
| 175 | +A 7B model runs on a gaming laptop. |
| 176 | + |
| 177 | +### Loading |
| 178 | + |
| 179 | +```python |
| 180 | +from llama_cpp import Llama |
| 181 | + |
| 182 | +model = Llama( |
| 183 | + model_path="/path/to/model.gguf", |
| 184 | + n_gpu_layers=-1, # offload all layers to GPU if available |
| 185 | + verbose=False, |
| 186 | +) |
| 187 | + |
| 188 | +for token in model("prompt here", stream=True): |
| 189 | + yield token["choices"][0]["text"] |
| 190 | +``` |
| 191 | + |
| 192 | +### Deliverables |
| 193 | + |
| 194 | +- [ ] `GGUFProviderAdapter` class |
| 195 | +- [ ] Layer offload configuration (CPU only / partial GPU / full GPU) |
| 196 | +- [ ] Same tool call boundary detection as Phase 1 |
| 197 | +- [ ] Document quantization format recommendations (GGUF Q4_K_M is the |
| 198 | + standard sweet spot) |
| 199 | + |
| 200 | +--- |
| 201 | + |
| 202 | +## Phase 3 — vLLM Adapter |
| 203 | + |
| 204 | +### What it is |
| 205 | + |
| 206 | +An adapter for AI lab customers who have GPU clusters and need production-grade |
| 207 | +throughput — tensor parallelism, continuous batching, PagedAttention KV cache. |
| 208 | + |
| 209 | +### Key insight |
| 210 | + |
| 211 | +vLLM in server mode exposes an OpenAI-compatible API: |
| 212 | + |
| 213 | +```bash |
| 214 | +python -m vllm.entrypoints.openai.api_server \ |
| 215 | + --model meta-llama/Llama-3-70b \ |
| 216 | + --tensor-parallel-size 8 |
| 217 | +``` |
| 218 | + |
| 219 | +This means the existing OpenAI-compatible provider adapter may already cover |
| 220 | +this case with zero new code. Verify compatibility before building anything. |
| 221 | + |
| 222 | +### Deliverables |
| 223 | + |
| 224 | +- [ ] Audit existing OpenAI adapter against vLLM endpoint behaviour |
| 225 | +- [ ] Document vLLM deployment configuration for platform users |
| 226 | +- [ ] If gaps exist, implement `vLLMProviderAdapter` |
| 227 | +- [ ] Tensor parallel size and GPU memory utilization configuration |
| 228 | + |
| 229 | +--- |
| 230 | + |
| 231 | +## Token Pattern Profiling |
| 232 | + |
| 233 | +Before building tool call detection, profile real token streams to understand |
| 234 | +what the normalization layer needs to handle. |
| 235 | + |
| 236 | +### Approach |
| 237 | + |
| 238 | +Use small quantized models locally — same tokenizer conventions and chat |
| 239 | +templates as their full-size counterparts, but run on any hardware: |
| 240 | + |
| 241 | +``` |
| 242 | +Qwen2.5-1.5B-Instruct — same family as Qwen3-80B you ran via TogetherAI |
| 243 | +Llama-3.2-1B — Meta family baseline |
| 244 | +SmolLM2 — extremely lightweight, good for rapid iteration |
| 245 | +``` |
| 246 | + |
| 247 | +### What to capture |
| 248 | + |
| 249 | +```python |
| 250 | +for token in streamer: |
| 251 | + print(repr(token)) # repr reveals special tokens, whitespace, control chars |
| 252 | +``` |
| 253 | + |
| 254 | +Specifically observe: |
| 255 | + |
| 256 | +- How the model signals entry into a tool call block |
| 257 | +- Whether tool call JSON arrives in one burst or character by character |
| 258 | +- How the model signals exit from tool call back to natural language |
| 259 | +- Special token conventions (`<tool_call>`, `<|python_tag|>`, `[TOOL_CALLS]` |
| 260 | + etc — these vary significantly across families) |
| 261 | +- Behaviour on malformed / partial tool calls |
| 262 | + |
| 263 | +Document findings per model family. This becomes the detection rule set for |
| 264 | +the normalization layer. |
| 265 | + |
| 266 | +--- |
| 267 | + |
| 268 | +## The Full Platform Story |
| 269 | + |
| 270 | +Once all three phases are complete the platform covers every deployment scenario: |
| 271 | + |
| 272 | +``` |
| 273 | +Got freshly trained weights? → transformers adapter |
| 274 | +Got a quantized GGUF model? → GGUF adapter |
| 275 | +Got a GPU cluster? → vLLM adapter |
| 276 | +Want a hosted provider? → already works (Hyperbolic, TogetherAI, etc) |
| 277 | +Running Ollama locally? → already works |
| 278 | +``` |
| 279 | + |
| 280 | +Five scenarios. One platform. **Uni5.** |
| 281 | + |
| 282 | +Same tool routing. Same shell execution. Same multi-agent delegation. Same |
| 283 | +streaming frontend. Same file handling. The model source becomes a |
| 284 | +configuration detail, not an architectural decision. |
| 285 | + |
| 286 | +That is the proposition: **From the lab to enterprise grade orchestration — instantly.** |
| 287 | + |
| 288 | +LangChain built an abstraction layer. Project Uni5 is an orchestration platform. |
| 289 | +The difference is everything. |
| 290 | + |
| 291 | +--- |
| 292 | + |
| 293 | +## Immediate Next Actions |
| 294 | + |
| 295 | +1. **Set up profiling environment** — install `transformers`, `torch`, |
| 296 | + `llama-cpp-python` in a local venv |
| 297 | +2. **Download a small model** — `Qwen2.5-1.5B-Instruct` from HuggingFace |
| 298 | +3. **Run raw token capture** — pipe `TextIteratorStreamer` output to a log |
| 299 | + file with `repr()` formatting, run a prompt that triggers a tool call |
| 300 | +4. **Document token patterns** — catalogue special tokens and tool call |
| 301 | + boundary signals for at least two model families |
| 302 | +5. **Design the provider adapter interface** — define the contract that |
| 303 | + `TransformersProviderAdapter` must satisfy to plug into the existing |
| 304 | + normalization pipeline |
| 305 | +6. **Build `TransformersProviderAdapter`** — implement, test against the |
| 306 | + existing tool routing stack |
| 307 | +7. **Repeat for GGUF** once Phase 1 is stable |
0 commit comments