Skip to content

Add v2 local HTTP server#669

Merged
HenryNdubuaku merged 1 commit into
v2from
local-server-v3
May 29, 2026
Merged

Add v2 local HTTP server#669
HenryNdubuaku merged 1 commit into
v2from
local-server-v3

Conversation

@ncylich

@ncylich ncylich commented May 27, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Add a v2-native OpenAI-compatible local HTTP server for prepared Cactus bundles.
  • Add cactus serve with explicit bundle validation and direct Python FFI loading.
  • Implement model listing, chat completions, streaming responses, tool-call mapping, warm model slot management, and WAV transcription endpoints.

Testing

  • source ./venv/bin/activate && cactus build && python -m pytest python/tests/test_server.py -q
  • source ./venv/bin/activate && cactus build && python -m pytest python/tests/test_server_live.py -q

Notes

  • Live HTTP tests launch cactus serve weights/gemma-4-e2b-it and exercise real generation, streaming, multi-turn chat, concurrent requests, and transcription through the server.
  • The live-test model path is intentionally relative and fails with explicit preparation instructions if it is missing or not a prepared v2 bundle.

@ncylich ncylich force-pushed the local-server-v3 branch from e4d6ee2 to 661a3e2 Compare May 27, 2026 18:57
Introduce an OpenAI-compatible local server for prepared Cactus v2 bundles with chat completions, streaming responses, model discovery, tool-call mapping, WAV transcription, and warm model slot management.

Wire the server into the CLI as cactus serve, add serving dependencies, and keep bundle loading on the direct Python FFI path instead of shelling out through cactus run.

Add focused mocked server tests and live HTTP e2e tests that launch the server, exercise real generation, streaming, multi-turn chat, concurrency, and transcription paths.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>
@ncylich ncylich force-pushed the local-server-v3 branch from 661a3e2 to 5e1bb68 Compare May 27, 2026 19:06
@HenryNdubuaku HenryNdubuaku merged commit da383d3 into v2 May 29, 2026
2 of 3 checks passed
ncylich added a commit that referenced this pull request May 29, 2026
python/tests/test_server.py imports fastapi at module level, but the
workflow installs only python/[dev], so pytest collection fails with
ModuleNotFoundError and the job exits 2 before any test runs. fastapi
is correctly placed under the serve extras in python/pyproject.toml;
the workflow just wasn't installing them. Add [serve] alongside [dev]
so the server tests can collect and run.

This has been broken on every Python Tests run since the v2 HTTP
server PR (#669) landed test_server.py.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>
ncylich added a commit that referenced this pull request May 29, 2026
test_server_live.py boots a real server against a converted bundle at
weights/gemma-4-e2b-it and pytest.fails (not skips) when the bundle is
absent. CI doesn't ship pre-converted models, so every assertion in
that file errors out at fixture setup. This was also failing on every
Python Tests run since #669 added the file; the previous commit (the
[serve] extras fix) just got us far enough to surface it.

Match the existing ignore for tests/test_model.py, which is excluded
for the same reason.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>
HenryNdubuaku pushed a commit that referenced this pull request Jun 2, 2026
…ings (#676)

* engine: wire image/audio embeddings and tighten failure modes

The v2 transpiler refactor left Model::get_embeddings,
get_image_embeddings, and get_audio_embeddings as stubs. The former
returned an empty vector silently (a footgun: rag.cpp callers fell
through to dimension-mismatch warnings and chat completion silently
skipped RAG/tool-RAG). The latter two threw with a "not wired up yet"
message even though the vision_encoder and audio_encoder components
they need are already built and exercised by completion and
transcription.

Engine

- Model::get_embeddings now throws "Text embeddings not wired up for
  transpiled bundles yet" instead of returning {}. Matches the
  sibling pattern and surfaces the failure to callers instead of
  silently producing empty results.

- Implement Model::get_image_embeddings and Model::get_audio_embeddings:
  drive the existing run_vision_encoder / run_audio_encoder_messages
  paths, then dequantize the encoder output (FP32 / FP16 / INT8 -> FP32),
  mean-pool over the leading dims into the last (hidden_dim), and
  L2-normalize. A small anonymous-namespace helper
  (pool_and_normalize_media_feature) keeps the two methods symmetric.
  Both call load_component_graph after extracting the feature so the
  transcribe_whisper_seq2seq / transcribe_parakeet_tdt paths (which
  call bind_runtime_buffers directly and assume the encoder graph is
  persistently loaded) keep working when an embed call precedes a
  transcribe call.

- Make Model::run_vision_encoder family-aware. It was Gemma4-only:
  unconditionally writing pixel_values + pixel_position_ids, which
  produced NaN outputs on LFM2-VL (expects pixel_attention_mask, not
  pixel_position_ids) and Qwen3-VL. The new dispatch mirrors the one
  in run_chunk_prefill_path: LFM2-VL gets pixel_values +
  pixel_attention_mask; Qwen3-VL gets pixel_values only; Gemma4 keeps
  pixel_values + pixel_position_ids. All inputs go through
  write_typed_buffer / a precision-aware int writer instead of
  write_bytes_input so the per-buffer precision is honored. This also
  fixes a latent bug on the legacy lm_encoder_media_step fallback path
  (model.cpp run_chat_prefill loop) for non-Gemma4 multimodal models.

- rag.cpp: wrap the four unprotected get_embeddings call sites in
  retrieve_rag_context and select_relevant_tools with try/catch that
  logs a warning and returns the same fallback those functions
  already use (empty context / unfiltered tool set). Without this,
  enabling RAG or tool_rag_top_k>0 would now throw out through
  cactus_complete.

Python

- python/cactus/bindings/cactus.py: cactus_embed, cactus_image_embed,
  and cactus_audio_embed were passing the element count (4096) as
  buffer_size, but the C side treats it as bytes. The 4096-float
  buffer is actually 16384 bytes. Masked when hidden_dim <= 1024
  (LFM2-VL exactly at boundary); broke Gemma4 (1536 dims -> 6144
  bytes -> rejected as "Buffer too small") and Qwen3-VL (2048 dims).
  Fixed by passing ctypes.sizeof(buf).

- python/tests/test_model.py: un-skip test_image_embedding and
  test_audio_embedding now that the wrappers are implemented. Keep
  test_text_embedding skipped, narrow the skip reason to text-only.

Verified end-to-end on LFM2-VL-450M, Gemma-4-E2B-it, Qwen3-VL-2B-Instruct
(image embed -> dim 1024 / 1536 / 2048, all L2-normalized, no NaNs) and
Whisper-small (audio embed -> dim 768). Each model's multimodal
completion or transcription still works when called after the embed
extracts its feature, confirming the load-state restore.

Full pytest: 187 passed, 1 skipped (the remaining text-embedding stub).

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* ci: install [dev,serve] so test_server can collect

python/tests/test_server.py imports fastapi at module level, but the
workflow installs only python/[dev], so pytest collection fails with
ModuleNotFoundError and the job exits 2 before any test runs. fastapi
is correctly placed under the serve extras in python/pyproject.toml;
the workflow just wasn't installing them. Add [serve] alongside [dev]
so the server tests can collect and run.

This has been broken on every Python Tests run since the v2 HTTP
server PR (#669) landed test_server.py.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* ci: ignore test_server_live.py in Python Tests

test_server_live.py boots a real server against a converted bundle at
weights/gemma-4-e2b-it and pytest.fails (not skips) when the bundle is
absent. CI doesn't ship pre-converted models, so every assertion in
that file errors out at fixture setup. This was also failing on every
Python Tests run since #669 added the file; the previous commit (the
[serve] extras fix) just got us far enough to surface it.

Match the existing ignore for tests/test_model.py, which is excluded
for the same reason.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* engine: wire nomic text embeddings end-to-end (transpile, runtime, server)

The v2 transpiler refactor left Model::get_embeddings throwing "Text
embeddings not wired up for transpiled bundles yet", so RAG, the corpus
index, and the FFI cactus_embed had no working text-embedding path. This
ports the nomic-embed-text-v2-moe model (the same one main shipped) onto
the v2 transpiled-bundle architecture and turns the stub into a real
encoder run, verified against the HuggingFace reference.

Transpile

- Add a text_embedding task plus NomicTextEmbeddingAdapter, an
  export-friendly reimplementation of the nomic-bert encoder. It reuses
  the HF submodule weights but reimplements the forward to survive
  torch.export: rotary is recomputed from seq_len (no lazy cos/sin
  cache), the additive attention mask is built with scalar sub/mul (no
  rsub), and the MoE is dense -- two fused matmuls over the packed
  expert weights gated by a top-k softmax built from iterative
  amax/masking (torch.topk, index_add_, and slicing a quantized weight
  do not lower). The graph emits last_hidden_state; pooling and
  normalization happen in the engine.

- Route nomic through the component pipeline: _family_key detection,
  canonicalize_model_interface + build_component_module_specs dispatch,
  the "text_embedding" component in the bundle manifest order, task
  auto-inference in component_plan.py and hf_model.py, a fixed-length
  text-embedding input builder, and a loader that forces
  trust_remote_code (the transformers-native nomic_bert is a different
  architecture and does not match the converted weights).

- canonicalize/cleanup.py: stop fp16-legalizing the embedding op's
  weight input. The embedding kernel dequantizes CQ/FP16 weights
  directly; the inserted precision_cast tried CQ4 -> FP16 at runtime
  and failed.

Convert

- NomicAdapter now emits one fused tensor per HF parameter (Wqkv and
  experts.mlp.w1/w2 are no longer split into q/k/v or per-expert
  files) so the transpiled graph, which binds weights by HF name, gets
  a 1:1 match. experts.mlp.w2 is stored transposed so the second
  expert matmul consumes it as a direct linear weight. The tiny MoE
  router stays FP16 -- 4-bit quantizing [num_experts, hidden] corrupts
  routing.

- Tokenizer conversion handles Unigram (XLM-RoBERTa, used by nomic):
  classify it as SentencePiece, emit per-token Viterbi scores into
  vocab.txt (id<TAB>token<TAB>score), and write the unigram runtime
  config (sp_model_type, sp_add_dummy_prefix, metaspace). Previously it
  was misdetected as BPE and mis-tokenized ("Paris is the capital of
  France." -> 31 tokens vs the reference 9).

Engine

- Implement Model::get_embeddings for transpiled bundles: load the
  text_embedding component, wrap the tokens with BOS/EOS (matching the
  reference add_special_tokens), run the graph, mean-pool over the real
  tokens, and optionally L2-normalize. Add an embedding-only init path
  (no decode route) and map model_type bert/nomic to ModelType::NOMIC
  so the config loader stops requiring Gemma4 fields.

- sp.cpp: the SentencePiece (non-BPE) path now honors
  sp_add_dummy_prefix, prepending the metaspace marker so unigram
  segmentation matches the reference.

Server

- Add an OpenAI-compatible POST /v1/embeddings endpoint (string or list
  input) backed by cactus_embed, plus EMBED_MODEL_TYPES. create_app can
  now serve a non-LLM bundle as its default model so an embedding-only
  server boots.

Tests

- python/tests/test_nomic_text_embedding.py: task/family routing.
- python/tests/test_model.py: TestNomicEmbedding covers determinism,
  retrieval discrimination, and HF parity (asserts cosine vs the
  reference last-hidden mean-pooled embedding).
- python/tests/test_server_live.py: live /v1/embeddings (string, list,
  rejects non-embedding models).
- Redistribute the convert NomicAdapter tests out of the catch-all
  test_nomic_adapter.py into their topical homes (test_policy.py,
  test_naming_qdq.py) and move the unrelated LFM2 tests into a new
  test_lfm2_adapter.py; delete test_nomic_adapter.py.

Verified end-to-end on nomic-embed-text-v2-moe: cactus vs HF cosine
0.92-0.94 (the gap is purely CQ4 4-bit weight quantization -- FP parity
is ~1.0, and cactus-vs-HF is identical whether HF runs fp16 or fp32),
with retrieval ranking preserved (query.relevant 0.72 >> query.unrelated
0.29). Convert/transpile produces a text_embedding component with all
146 weights bound. Full server_live suite green.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* test: drop obsolete VLM text-embedding skip

TestVLMModel.test_text_embedding was a permanently-skipped stub that
called cactus_embed on an LFM2-VL bundle, which has no text_embedding
component. Text embeddings now have real coverage in TestNomicEmbedding
(shape/determinism, retrieval discrimination, HF parity), so remove the
dead test and its skip constant. The remaining skips are all conditional
runtime guards (missing image/audio assets, transformers not installed,
no embedding bundle present).

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

---------

Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants