Skip to content

Commit c133522

Browse files
author
Francisco
committed
feat: add vLLM orchestration support and documentation
1 parent 6e53b60 commit c133522

17 files changed

Lines changed: 2026 additions & 18 deletions

api_unhashed_reqs.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ websockets==15.0.1
1414
transformers>=4.33.0,<5.0.0
1515
# ------------------------------------------------------------------ #
1616
# Internal packages
17-
projectdavid[embeddings]==1.75.1
17+
projectdavid[embeddings]==1.75.2
1818

1919
# web search tools
2020

docs/Platform_vision_and_plan.md

Lines changed: 307 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,307 @@
1+
# Project Uni5
2+
## From the Lab to Enterprise Grade Orchestration — Instantly.
3+
4+
Most platforms make you choose. Either you're in the research and experimentation
5+
world with scrappy tooling, or you're in the enterprise world with heavyweight
6+
infrastructure requirements. The gap between those two is where most open source
7+
models go to die — the training works but nobody can actually use them in a real
8+
system without significant engineering.
9+
10+
This platform closes that gap entirely.
11+
12+
A model that has never been deployed anywhere — straight off a training run —
13+
plugs directly into a full-featured enterprise orchestration stack with zero
14+
infrastructure overhead.
15+
16+
```
17+
Train model → save weights → point platform at path → full orchestration
18+
```
19+
20+
No API deployment step. No vLLM setup. No distributed systems knowledge required.
21+
Just a path to a directory.
22+
23+
The same platform already handles hosted providers (Hyperbolic, TogetherAI),
24+
local runtimes (Ollama), sandboxed shell execution, multi-agent delegation,
25+
file serving, and a streaming frontend. The model integration work completes
26+
the picture.
27+
28+
---
29+
30+
## Why Single Machine First
31+
32+
The platform's existing architecture maps directly to the single machine market:
33+
34+
- Sandbox is FireJail on a single host — not a distributed system
35+
- Tool routing, shell execution, and file handling are all single machine concepts
36+
- Ollama (the dominant single machine runtime) is already a supported provider
37+
- The open source community most likely to build on this platform is running
38+
consumer or prosumer hardware, not GPU clusters
39+
- Enterprises with data privacy requirements cannot send data to an external API —
40+
local model execution is a hard requirement for them
41+
42+
The `transformers` adapter is the specific missing piece. It means the gap
43+
between "I have weights" and "I have a running orchestration platform" is
44+
closed entirely.
45+
46+
---
47+
48+
## The Model Integration Stack
49+
50+
Three adapters, three markets, one platform:
51+
52+
| Adapter | Format | Target User | Status |
53+
|---|---|---|---|
54+
| `transformers` | safetensors / bin | Researchers, AI labs, fine-tuners | **Build next** |
55+
| GGUF / llama.cpp | GGUF | Prosumers, quantized model users, Ollama power users | Phase 2 |
56+
| vLLM | any HF model | AI labs with GPU clusters | Phase 3 |
57+
58+
All three feed into the existing normalization pipeline. The platform surface
59+
— tool routing, shell execution, file handling, streaming frontend — is
60+
untouched regardless of which adapter is active.
61+
62+
---
63+
64+
## Phase 1 — `transformers` Adapter
65+
66+
### What it is
67+
68+
A provider adapter that loads a model from a local directory using the
69+
HuggingFace `transformers` library and pipes its token stream into the
70+
existing normalization pipeline via `TextIteratorStreamer`.
71+
72+
### Loading
73+
74+
```python
75+
from transformers import AutoModelForCausalLM, AutoTokenizer
76+
import torch
77+
78+
model_path = "/path/to/model" # local directory or HF repo name
79+
80+
tokenizer = AutoTokenizer.from_pretrained(model_path)
81+
82+
model = AutoModelForCausalLM.from_pretrained(
83+
model_path,
84+
torch_dtype=torch.float32, # float32 for CPU stability
85+
device_map="auto", # auto-selects CPU / GPU / MPS
86+
)
87+
```
88+
89+
`Auto` classes read `config.json` and instantiate the correct architecture
90+
automatically. No model-family-specific code required.
91+
92+
### Streaming
93+
94+
```python
95+
from transformers import TextIteratorStreamer
96+
from threading import Thread
97+
98+
streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=False)
99+
100+
inputs = tokenizer.apply_chat_template(
101+
messages,
102+
return_tensors="pt",
103+
add_generation_prompt=True,
104+
)
105+
106+
thread = Thread(target=model.generate, kwargs={
107+
"input_ids": inputs,
108+
"streamer": streamer,
109+
"max_new_tokens": 512,
110+
})
111+
thread.start()
112+
113+
for token in streamer:
114+
yield token # feed directly into normalization pipeline
115+
```
116+
117+
`skip_special_tokens=False` is important — special tokens (`<|im_start|>`,
118+
`<tool_call>`, `</s>` etc.) are exactly what the normalization layer needs
119+
to detect tool call boundaries.
120+
121+
### Integration point
122+
123+
`TextIteratorStreamer` sits in the same position as the existing Ollama or
124+
OpenAI streaming client. The `for token in streamer` loop becomes the new
125+
provider adapter. Everything downstream is unchanged.
126+
127+
### Key engineering challenges
128+
129+
**Tool call boundary detection**
130+
Unlike provider APIs which emit pre-parsed `type: "call_arguments"` deltas,
131+
raw token streams require the normalization layer to detect when it has
132+
entered and exited a tool call block. Buffering logic needed. The existing
133+
regex healing in `_map_chunk_to_event` is the foundation to build from.
134+
135+
**Chat template variance**
136+
Every model family formats the prompt differently. `apply_chat_template`
137+
handles this automatically using the template stored in
138+
`tokenizer_config.json`. No manual prompt formatting required per model.
139+
140+
**CPU stability**
141+
Use `torch.float32` on CPU — `float16` can produce NaN values without a GPU.
142+
For machines with a GPU, `float16` or `bfloat16` halves memory usage.
143+
144+
### Deliverables
145+
146+
- [ ] `TransformersProviderAdapter` class implementing the existing provider
147+
interface
148+
- [ ] Token stream → normalization pipeline bridge
149+
- [ ] Tool call boundary detection and buffering
150+
- [ ] Chat template application
151+
- [ ] Config: accept local path or HF repo name interchangeably
152+
- [ ] Document minimum hardware requirements per model size class
153+
154+
---
155+
156+
## Phase 2 — GGUF / llama.cpp Adapter
157+
158+
### What it is
159+
160+
An adapter for quantized models in GGUF format — the format used by llama.cpp
161+
and Ollama under the hood. Targets the large market of users running
162+
compressed models on consumer hardware.
163+
164+
### Why it matters
165+
166+
Quantization makes large models accessible on normal hardware:
167+
168+
| Model size | float16 VRAM | 4-bit GGUF VRAM |
169+
|---|---|---|
170+
| 7B | ~14 GB | ~4 GB |
171+
| 13B | ~26 GB | ~8 GB |
172+
| 70B | ~140 GB | ~40 GB |
173+
174+
A 70B model in 4-bit GGUF runs on a Mac Studio with 64GB unified memory.
175+
A 7B model runs on a gaming laptop.
176+
177+
### Loading
178+
179+
```python
180+
from llama_cpp import Llama
181+
182+
model = Llama(
183+
model_path="/path/to/model.gguf",
184+
n_gpu_layers=-1, # offload all layers to GPU if available
185+
verbose=False,
186+
)
187+
188+
for token in model("prompt here", stream=True):
189+
yield token["choices"][0]["text"]
190+
```
191+
192+
### Deliverables
193+
194+
- [ ] `GGUFProviderAdapter` class
195+
- [ ] Layer offload configuration (CPU only / partial GPU / full GPU)
196+
- [ ] Same tool call boundary detection as Phase 1
197+
- [ ] Document quantization format recommendations (GGUF Q4_K_M is the
198+
standard sweet spot)
199+
200+
---
201+
202+
## Phase 3 — vLLM Adapter
203+
204+
### What it is
205+
206+
An adapter for AI lab customers who have GPU clusters and need production-grade
207+
throughput — tensor parallelism, continuous batching, PagedAttention KV cache.
208+
209+
### Key insight
210+
211+
vLLM in server mode exposes an OpenAI-compatible API:
212+
213+
```bash
214+
python -m vllm.entrypoints.openai.api_server \
215+
--model meta-llama/Llama-3-70b \
216+
--tensor-parallel-size 8
217+
```
218+
219+
This means the existing OpenAI-compatible provider adapter may already cover
220+
this case with zero new code. Verify compatibility before building anything.
221+
222+
### Deliverables
223+
224+
- [ ] Audit existing OpenAI adapter against vLLM endpoint behaviour
225+
- [ ] Document vLLM deployment configuration for platform users
226+
- [ ] If gaps exist, implement `vLLMProviderAdapter`
227+
- [ ] Tensor parallel size and GPU memory utilization configuration
228+
229+
---
230+
231+
## Token Pattern Profiling
232+
233+
Before building tool call detection, profile real token streams to understand
234+
what the normalization layer needs to handle.
235+
236+
### Approach
237+
238+
Use small quantized models locally — same tokenizer conventions and chat
239+
templates as their full-size counterparts, but run on any hardware:
240+
241+
```
242+
Qwen2.5-1.5B-Instruct — same family as Qwen3-80B you ran via TogetherAI
243+
Llama-3.2-1B — Meta family baseline
244+
SmolLM2 — extremely lightweight, good for rapid iteration
245+
```
246+
247+
### What to capture
248+
249+
```python
250+
for token in streamer:
251+
print(repr(token)) # repr reveals special tokens, whitespace, control chars
252+
```
253+
254+
Specifically observe:
255+
256+
- How the model signals entry into a tool call block
257+
- Whether tool call JSON arrives in one burst or character by character
258+
- How the model signals exit from tool call back to natural language
259+
- Special token conventions (`<tool_call>`, `<|python_tag|>`, `[TOOL_CALLS]`
260+
etc — these vary significantly across families)
261+
- Behaviour on malformed / partial tool calls
262+
263+
Document findings per model family. This becomes the detection rule set for
264+
the normalization layer.
265+
266+
---
267+
268+
## The Full Platform Story
269+
270+
Once all three phases are complete the platform covers every deployment scenario:
271+
272+
```
273+
Got freshly trained weights? → transformers adapter
274+
Got a quantized GGUF model? → GGUF adapter
275+
Got a GPU cluster? → vLLM adapter
276+
Want a hosted provider? → already works (Hyperbolic, TogetherAI, etc)
277+
Running Ollama locally? → already works
278+
```
279+
280+
Five scenarios. One platform. **Uni5.**
281+
282+
Same tool routing. Same shell execution. Same multi-agent delegation. Same
283+
streaming frontend. Same file handling. The model source becomes a
284+
configuration detail, not an architectural decision.
285+
286+
That is the proposition: **From the lab to enterprise grade orchestration — instantly.**
287+
288+
LangChain built an abstraction layer. Project Uni5 is an orchestration platform.
289+
The difference is everything.
290+
291+
---
292+
293+
## Immediate Next Actions
294+
295+
1. **Set up profiling environment** — install `transformers`, `torch`,
296+
`llama-cpp-python` in a local venv
297+
2. **Download a small model**`Qwen2.5-1.5B-Instruct` from HuggingFace
298+
3. **Run raw token capture** — pipe `TextIteratorStreamer` output to a log
299+
file with `repr()` formatting, run a prompt that triggers a tool call
300+
4. **Document token patterns** — catalogue special tokens and tool call
301+
boundary signals for at least two model families
302+
5. **Design the provider adapter interface** — define the contract that
303+
`TransformersProviderAdapter` must satisfy to plug into the existing
304+
normalization pipeline
305+
6. **Build `TransformersProviderAdapter`** — implement, test against the
306+
existing tool routing stack
307+
7. **Repeat for GGUF** once Phase 1 is stable

docs/locallama_post.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
Title:
2+
3+
Show r/LocalLLaMA: I built an open source OpenAI API alternative that runs fully local — code interpreter, RAG, agents, computer use, now with vLLM + multimodal support
4+
5+
Body:
6+
7+
Solo dev, been building quietly for a while. It's a containerised orchestration platform with a Python SDK that gives you full OpenAI API feature parity — but runs entirely on your hardware with Ollama or vLLM.
8+
Features: code interpreter, web search, deep research, computer use, file search + vector stores, conversation state, GDPR data handling, signed URLs for tool files.
9+
Just shipped vLLM raw inference integration and multimodal input this week.
10+
No VC. No hype. Just working software. [link]
11+
12+
13+
14+
/v1/completions ← our raw profiling target
15+
/v1/completions/render ← rendered prompt preview
16+
/inference/v1/generate ← bonus raw generate endpoint
17+
/v1/messages ← Anthropic-style endpoint (vLLM added this!)

sandbox_reqs_unhashed.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,6 @@ websockets==15.0.1
1010
# transformers>=4.33.0,<5.0.0
1111
# ------------------------------------------------------------------ #
1212
# Internal packages
13-
projectdavid==1.75.1
13+
projectdavid==1.75.2
1414
jwt
1515

src/api/entities_api/cli/generate_docker_compose.py

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -138,6 +138,33 @@ def generate_dev_docker_compose() -> None:
138138
networks:
139139
- my_custom_network
140140
141+
vllm:
142+
image: vllm/vllm-openai:latest
143+
container_name: vllm_server
144+
restart: unless-stopped
145+
runtime: nvidia
146+
environment:
147+
- NVIDIA_VISIBLE_DEVICES=all
148+
- HF_TOKEN=${HF_TOKEN}
149+
volumes:
150+
- ${HF_CACHE_PATH:-~/.cache/huggingface}:/root/.cache/huggingface
151+
ports:
152+
- "8001:8000"
153+
command: >
154+
--model ${VLLM_MODEL:-Qwen/Qwen2.5-3B-Instruct}
155+
--dtype float16
156+
--max-model-len 4096
157+
--gpu-memory-utilization 0.85
158+
networks:
159+
- my_custom_network
160+
deploy:
161+
resources:
162+
reservations:
163+
devices:
164+
- driver: nvidia
165+
count: all
166+
capabilities: [gpu]
167+
141168
api:
142169
build:
143170
context: .
@@ -162,6 +189,7 @@ def generate_dev_docker_compose() -> None:
162189
- OTEL_METRICS_EXPORTER=none
163190
- OTEL_LOGS_EXPORTER=none
164191
- OLLAMA_BASE_URL=http://ollama:11434/v1
192+
- VLLM_BASE_URL=http://vllm_server:8000
165193
# Override the host-side SHARED_PATH with the container-internal mount point
166194
# so the purge daemon writes to the same directory the samba container serves
167195
- SHARED_PATH=/app/shared_data
@@ -191,6 +219,8 @@ def generate_dev_docker_compose() -> None:
191219
condition: service_started
192220
ollama:
193221
condition: service_started
222+
vllm:
223+
condition: service_started
194224
networks:
195225
- my_custom_network
196226

0 commit comments

Comments
 (0)