Add v2 local HTTP server#669
Merged
Merged
Conversation
Introduce an OpenAI-compatible local server for prepared Cactus v2 bundles with chat completions, streaming responses, model discovery, tool-call mapping, WAV transcription, and warm model slot management. Wire the server into the CLI as cactus serve, add serving dependencies, and keep bundle loading on the direct Python FFI path instead of shelling out through cactus run. Add focused mocked server tests and live HTTP e2e tests that launch the server, exercise real generation, streaming, multi-turn chat, concurrency, and transcription paths. Signed-off-by: Noah Cylich <noahcylich@gmail.com>
ncylich
added a commit
that referenced
this pull request
May 29, 2026
python/tests/test_server.py imports fastapi at module level, but the workflow installs only python/[dev], so pytest collection fails with ModuleNotFoundError and the job exits 2 before any test runs. fastapi is correctly placed under the serve extras in python/pyproject.toml; the workflow just wasn't installing them. Add [serve] alongside [dev] so the server tests can collect and run. This has been broken on every Python Tests run since the v2 HTTP server PR (#669) landed test_server.py. Signed-off-by: Noah Cylich <noahcylich@gmail.com>
ncylich
added a commit
that referenced
this pull request
May 29, 2026
test_server_live.py boots a real server against a converted bundle at weights/gemma-4-e2b-it and pytest.fails (not skips) when the bundle is absent. CI doesn't ship pre-converted models, so every assertion in that file errors out at fixture setup. This was also failing on every Python Tests run since #669 added the file; the previous commit (the [serve] extras fix) just got us far enough to surface it. Match the existing ignore for tests/test_model.py, which is excluded for the same reason. Signed-off-by: Noah Cylich <noahcylich@gmail.com>
HenryNdubuaku
pushed a commit
that referenced
this pull request
Jun 2, 2026
…ings (#676) * engine: wire image/audio embeddings and tighten failure modes The v2 transpiler refactor left Model::get_embeddings, get_image_embeddings, and get_audio_embeddings as stubs. The former returned an empty vector silently (a footgun: rag.cpp callers fell through to dimension-mismatch warnings and chat completion silently skipped RAG/tool-RAG). The latter two threw with a "not wired up yet" message even though the vision_encoder and audio_encoder components they need are already built and exercised by completion and transcription. Engine - Model::get_embeddings now throws "Text embeddings not wired up for transpiled bundles yet" instead of returning {}. Matches the sibling pattern and surfaces the failure to callers instead of silently producing empty results. - Implement Model::get_image_embeddings and Model::get_audio_embeddings: drive the existing run_vision_encoder / run_audio_encoder_messages paths, then dequantize the encoder output (FP32 / FP16 / INT8 -> FP32), mean-pool over the leading dims into the last (hidden_dim), and L2-normalize. A small anonymous-namespace helper (pool_and_normalize_media_feature) keeps the two methods symmetric. Both call load_component_graph after extracting the feature so the transcribe_whisper_seq2seq / transcribe_parakeet_tdt paths (which call bind_runtime_buffers directly and assume the encoder graph is persistently loaded) keep working when an embed call precedes a transcribe call. - Make Model::run_vision_encoder family-aware. It was Gemma4-only: unconditionally writing pixel_values + pixel_position_ids, which produced NaN outputs on LFM2-VL (expects pixel_attention_mask, not pixel_position_ids) and Qwen3-VL. The new dispatch mirrors the one in run_chunk_prefill_path: LFM2-VL gets pixel_values + pixel_attention_mask; Qwen3-VL gets pixel_values only; Gemma4 keeps pixel_values + pixel_position_ids. All inputs go through write_typed_buffer / a precision-aware int writer instead of write_bytes_input so the per-buffer precision is honored. This also fixes a latent bug on the legacy lm_encoder_media_step fallback path (model.cpp run_chat_prefill loop) for non-Gemma4 multimodal models. - rag.cpp: wrap the four unprotected get_embeddings call sites in retrieve_rag_context and select_relevant_tools with try/catch that logs a warning and returns the same fallback those functions already use (empty context / unfiltered tool set). Without this, enabling RAG or tool_rag_top_k>0 would now throw out through cactus_complete. Python - python/cactus/bindings/cactus.py: cactus_embed, cactus_image_embed, and cactus_audio_embed were passing the element count (4096) as buffer_size, but the C side treats it as bytes. The 4096-float buffer is actually 16384 bytes. Masked when hidden_dim <= 1024 (LFM2-VL exactly at boundary); broke Gemma4 (1536 dims -> 6144 bytes -> rejected as "Buffer too small") and Qwen3-VL (2048 dims). Fixed by passing ctypes.sizeof(buf). - python/tests/test_model.py: un-skip test_image_embedding and test_audio_embedding now that the wrappers are implemented. Keep test_text_embedding skipped, narrow the skip reason to text-only. Verified end-to-end on LFM2-VL-450M, Gemma-4-E2B-it, Qwen3-VL-2B-Instruct (image embed -> dim 1024 / 1536 / 2048, all L2-normalized, no NaNs) and Whisper-small (audio embed -> dim 768). Each model's multimodal completion or transcription still works when called after the embed extracts its feature, confirming the load-state restore. Full pytest: 187 passed, 1 skipped (the remaining text-embedding stub). Signed-off-by: Noah Cylich <noahcylich@gmail.com> * ci: install [dev,serve] so test_server can collect python/tests/test_server.py imports fastapi at module level, but the workflow installs only python/[dev], so pytest collection fails with ModuleNotFoundError and the job exits 2 before any test runs. fastapi is correctly placed under the serve extras in python/pyproject.toml; the workflow just wasn't installing them. Add [serve] alongside [dev] so the server tests can collect and run. This has been broken on every Python Tests run since the v2 HTTP server PR (#669) landed test_server.py. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * ci: ignore test_server_live.py in Python Tests test_server_live.py boots a real server against a converted bundle at weights/gemma-4-e2b-it and pytest.fails (not skips) when the bundle is absent. CI doesn't ship pre-converted models, so every assertion in that file errors out at fixture setup. This was also failing on every Python Tests run since #669 added the file; the previous commit (the [serve] extras fix) just got us far enough to surface it. Match the existing ignore for tests/test_model.py, which is excluded for the same reason. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * engine: wire nomic text embeddings end-to-end (transpile, runtime, server) The v2 transpiler refactor left Model::get_embeddings throwing "Text embeddings not wired up for transpiled bundles yet", so RAG, the corpus index, and the FFI cactus_embed had no working text-embedding path. This ports the nomic-embed-text-v2-moe model (the same one main shipped) onto the v2 transpiled-bundle architecture and turns the stub into a real encoder run, verified against the HuggingFace reference. Transpile - Add a text_embedding task plus NomicTextEmbeddingAdapter, an export-friendly reimplementation of the nomic-bert encoder. It reuses the HF submodule weights but reimplements the forward to survive torch.export: rotary is recomputed from seq_len (no lazy cos/sin cache), the additive attention mask is built with scalar sub/mul (no rsub), and the MoE is dense -- two fused matmuls over the packed expert weights gated by a top-k softmax built from iterative amax/masking (torch.topk, index_add_, and slicing a quantized weight do not lower). The graph emits last_hidden_state; pooling and normalization happen in the engine. - Route nomic through the component pipeline: _family_key detection, canonicalize_model_interface + build_component_module_specs dispatch, the "text_embedding" component in the bundle manifest order, task auto-inference in component_plan.py and hf_model.py, a fixed-length text-embedding input builder, and a loader that forces trust_remote_code (the transformers-native nomic_bert is a different architecture and does not match the converted weights). - canonicalize/cleanup.py: stop fp16-legalizing the embedding op's weight input. The embedding kernel dequantizes CQ/FP16 weights directly; the inserted precision_cast tried CQ4 -> FP16 at runtime and failed. Convert - NomicAdapter now emits one fused tensor per HF parameter (Wqkv and experts.mlp.w1/w2 are no longer split into q/k/v or per-expert files) so the transpiled graph, which binds weights by HF name, gets a 1:1 match. experts.mlp.w2 is stored transposed so the second expert matmul consumes it as a direct linear weight. The tiny MoE router stays FP16 -- 4-bit quantizing [num_experts, hidden] corrupts routing. - Tokenizer conversion handles Unigram (XLM-RoBERTa, used by nomic): classify it as SentencePiece, emit per-token Viterbi scores into vocab.txt (id<TAB>token<TAB>score), and write the unigram runtime config (sp_model_type, sp_add_dummy_prefix, metaspace). Previously it was misdetected as BPE and mis-tokenized ("Paris is the capital of France." -> 31 tokens vs the reference 9). Engine - Implement Model::get_embeddings for transpiled bundles: load the text_embedding component, wrap the tokens with BOS/EOS (matching the reference add_special_tokens), run the graph, mean-pool over the real tokens, and optionally L2-normalize. Add an embedding-only init path (no decode route) and map model_type bert/nomic to ModelType::NOMIC so the config loader stops requiring Gemma4 fields. - sp.cpp: the SentencePiece (non-BPE) path now honors sp_add_dummy_prefix, prepending the metaspace marker so unigram segmentation matches the reference. Server - Add an OpenAI-compatible POST /v1/embeddings endpoint (string or list input) backed by cactus_embed, plus EMBED_MODEL_TYPES. create_app can now serve a non-LLM bundle as its default model so an embedding-only server boots. Tests - python/tests/test_nomic_text_embedding.py: task/family routing. - python/tests/test_model.py: TestNomicEmbedding covers determinism, retrieval discrimination, and HF parity (asserts cosine vs the reference last-hidden mean-pooled embedding). - python/tests/test_server_live.py: live /v1/embeddings (string, list, rejects non-embedding models). - Redistribute the convert NomicAdapter tests out of the catch-all test_nomic_adapter.py into their topical homes (test_policy.py, test_naming_qdq.py) and move the unrelated LFM2 tests into a new test_lfm2_adapter.py; delete test_nomic_adapter.py. Verified end-to-end on nomic-embed-text-v2-moe: cactus vs HF cosine 0.92-0.94 (the gap is purely CQ4 4-bit weight quantization -- FP parity is ~1.0, and cactus-vs-HF is identical whether HF runs fp16 or fp32), with retrieval ranking preserved (query.relevant 0.72 >> query.unrelated 0.29). Convert/transpile produces a text_embedding component with all 146 weights bound. Full server_live suite green. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * test: drop obsolete VLM text-embedding skip TestVLMModel.test_text_embedding was a permanently-skipped stub that called cactus_embed on an LFM2-VL bundle, which has no text_embedding component. Text embeddings now have real coverage in TestNomicEmbedding (shape/determinism, retrieval discrimination, HF parity), so remove the dead test and its skip constant. The remaining skips are all conditional runtime guards (missing image/audio assets, transformers not installed, no embedding bundle present). Signed-off-by: Noah Cylich <noahcylich@gmail.com> --------- Signed-off-by: Noah Cylich <noahcylich@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
cactus servewith explicit bundle validation and direct Python FFI loading.Testing
source ./venv/bin/activate && cactus build && python -m pytest python/tests/test_server.py -qsource ./venv/bin/activate && cactus build && python -m pytest python/tests/test_server_live.py -qNotes
cactus serve weights/gemma-4-e2b-itand exercise real generation, streaming, multi-turn chat, concurrent requests, and transcription through the server.