Wire text/image/audio embeddings on the v2 transpiler; add /v1/embeddings by ncylich · Pull Request #676 · cactus-compute/cactus

ncylich · 2026-05-29T23:07:39Z

Summary

Brings the three embedding entry points in Model up on the v2 transpiler. The refactor had left them as stubs: get_embeddings returned {} silently (so RAG/tool-RAG callers in rag.cpp fell through to dimension-mismatch warnings instead of failing), and get_image_embeddings / get_audio_embeddings threw "not wired up yet" even though the vision_encoder / audio_encoder components were already built and exercised by completion and transcription.

This PR:

Ports nomic-embed-text-v2-moe (the same model main shipped) onto the v2 transpiled-bundle architecture and turns get_embeddings into a real encoder run, verified against the HuggingFace reference.
Adds an OpenAI-compatible POST /v1/embeddings endpoint.
Implements image/audio embeddings and fixes a latent vision-encoder family bug.

Text embeddings (nomic-embed-text-v2-moe)

Transpile

New text_embedding task + NomicTextEmbeddingAdapter, an export-friendly reimplementation of the nomic-bert encoder. It reuses the HF submodule weights but reimplements the forward to survive torch.export: rotary is recomputed from seq_len (no lazy cos/sin cache), the additive attention mask uses scalar sub/mul (no rsub), and the MoE is dense — two fused matmuls over the packed expert weights gated by a top-k softmax built from iterative amax/masking (torch.topk, index_add_, and slicing a quantized weight do not lower). The graph emits last_hidden_state; pooling/normalization happen in the engine.
Routing: _family_key detection, canonicalize_model_interface + build_component_module_specs dispatch, the text_embedding component in the bundle manifest, task auto-inference (component_plan.py, hf_model.py), a fixed-length input builder, and a loader that forces trust_remote_code (the transformers-native nomic_bert is a different architecture and won't match the converted weights).
canonicalize/cleanup.py: stop fp16-legalizing the embedding op's weight input — the kernel dequantizes CQ/FP16 weights directly, and the inserted precision_cast tried CQ4 → FP16 at runtime and failed.

Convert

NomicAdapter now emits one fused tensor per HF parameter (Wqkv and experts.mlp.w1/w2 are no longer split into q/k/v or per-expert files) so the transpiled graph — which binds weights by HF name — gets a 1:1 match. experts.mlp.w2 is stored transposed so the second expert matmul consumes it as a direct linear weight. The tiny MoE router stays FP16 (4-bit quantizing [num_experts, hidden] corrupts routing).
Tokenizer conversion handles Unigram (XLM-RoBERTa, used by nomic): classify it as SentencePiece, emit per-token Viterbi scores into vocab.txt, and write the unigram runtime config (sp_model_type, sp_add_dummy_prefix, metaspace). It was previously misdetected as BPE and mis-tokenized ("Paris is the capital of France." → 31 tokens vs the reference 9).

Engine

Implement Model::get_embeddings for transpiled bundles: load the text_embedding component, wrap tokens with BOS/EOS (matching add_special_tokens), run the graph, mean-pool over real tokens, optionally L2-normalize. Add an embedding-only init path (no decode route) and map model_type bert/nomic → ModelType::NOMIC so the config loader stops requiring Gemma4 fields.
sp.cpp: the SentencePiece (non-BPE) path honors sp_add_dummy_prefix, prepending the metaspace marker so unigram segmentation matches the reference.

Server

POST /v1/embeddings (string or list input) backed by cactus_embed, plus EMBED_MODEL_TYPES. create_app can now serve a non-LLM bundle as its default model, so an embedding-only server boots.

HF parity

cactus vs HF cosine 0.92–0.94. The gap is purely CQ4 4-bit weight quantization: FP parity of the adapter is ~1.0, and cactus-vs-HF is identical whether HF runs fp16 or fp32. Retrieval ranking is preserved (query·relevant 0.72 ≫ query·unrelated 0.29). Convert/transpile produces a text_embedding component with all 146 weights bound.

Image / audio embeddings + vision-encoder fix

Model::get_embeddings now throws (mirrors the sibling methods) instead of returning {}; surfaces the failure to cactus_embed and cactus_rag_query (both already wrap in try/catch).
rag.cpp: wrap the four unprotected get_embeddings callsites in retrieve_rag_context / select_relevant_tools with try/catch so chat completion degrades gracefully (empty RAG context / unfiltered tool set) instead of throwing out.
Implement Model::get_image_embeddings / Model::get_audio_embeddings: run the existing encoder component, dequantize FP32/FP16/INT8 → FP32, mean-pool over leading dims into hidden_dim, L2-normalize. Both call load_component_graph after extracting the output so transcribe_whisper_seq2seq / transcribe_parakeet_tdt keep working when an embed precedes a transcribe.
Make Model::run_vision_encoder family-aware. Was Gemma4-only — unconditionally wrote pixel_values + pixel_position_ids, producing NaN outputs on LFM2-VL (expects pixel_attention_mask) and Qwen3-VL. Dispatch now mirrors run_chunk_prefill_path (LFM2-VL / Qwen3-VL / default), with precision-aware writes instead of raw write_bytes_input. Also fixes a latent bug on the legacy lm_encoder_media_step fallback path for non-Gemma4 multimodal models.
python/cactus/bindings/cactus.py: cactus_embed / cactus_image_embed / cactus_audio_embed passed the element count (4096) as buffer_size, but the C side treats it as bytes. Masked when hidden_dim ≤ 1024; broke Gemma4 (1536) and Qwen3-VL (2048). Fixed by passing ctypes.sizeof(buf).

Tests

python/tests/test_nomic_text_embedding.py: task/family routing.
python/tests/test_model.py: TestNomicEmbedding (determinism, retrieval discrimination, HF-parity cosine assertion); image/audio embedding tests un-skipped.
python/tests/test_server_live.py: live /v1/embeddings (string, list, rejects non-embedding models).
Redistribute the convert NomicAdapter tests out of the catch-all test_nomic_adapter.py into their topical homes (test_policy.py, test_naming_qdq.py), move the unrelated LFM2 tests into a new test_lfm2_adapter.py, and delete test_nomic_adapter.py.

Verification

Model	Embed dim	Norm	Completion/Transcribe after
nomic-embed-text-v2-moe (text)	768	1.0	HF cosine 0.92–0.94
LFM2-VL-450M (image)	1024	1.0	✅
Gemma-4-E2B-it (image)	1536	1.0	✅
Qwen3-VL-2B-Instruct (image)	2048	1.0	✅
Whisper-small (audio)	768	1.0	✅

Test plan

pytest python/cactus/convert/tests/ python/tests/test_nomic_text_embedding.py — green
pytest python/tests/test_model.py::TestNomicEmbedding -s — determinism, retrieval, HF parity
pytest python/tests/test_server_live.py — 19 pass (chat, transcription, embeddings)
cactus convert nomic-ai/nomic-embed-text-v2-moe → text_embedding component, 146 weights bound
Manual: image embed + completion on LFM2-VL, Gemma-4, Qwen3-VL; audio embed + transcription on Whisper-small

The v2 transpiler refactor left Model::get_embeddings, get_image_embeddings, and get_audio_embeddings as stubs. The former returned an empty vector silently (a footgun: rag.cpp callers fell through to dimension-mismatch warnings and chat completion silently skipped RAG/tool-RAG). The latter two threw with a "not wired up yet" message even though the vision_encoder and audio_encoder components they need are already built and exercised by completion and transcription. Engine - Model::get_embeddings now throws "Text embeddings not wired up for transpiled bundles yet" instead of returning {}. Matches the sibling pattern and surfaces the failure to callers instead of silently producing empty results. - Implement Model::get_image_embeddings and Model::get_audio_embeddings: drive the existing run_vision_encoder / run_audio_encoder_messages paths, then dequantize the encoder output (FP32 / FP16 / INT8 -> FP32), mean-pool over the leading dims into the last (hidden_dim), and L2-normalize. A small anonymous-namespace helper (pool_and_normalize_media_feature) keeps the two methods symmetric. Both call load_component_graph after extracting the feature so the transcribe_whisper_seq2seq / transcribe_parakeet_tdt paths (which call bind_runtime_buffers directly and assume the encoder graph is persistently loaded) keep working when an embed call precedes a transcribe call. - Make Model::run_vision_encoder family-aware. It was Gemma4-only: unconditionally writing pixel_values + pixel_position_ids, which produced NaN outputs on LFM2-VL (expects pixel_attention_mask, not pixel_position_ids) and Qwen3-VL. The new dispatch mirrors the one in run_chunk_prefill_path: LFM2-VL gets pixel_values + pixel_attention_mask; Qwen3-VL gets pixel_values only; Gemma4 keeps pixel_values + pixel_position_ids. All inputs go through write_typed_buffer / a precision-aware int writer instead of write_bytes_input so the per-buffer precision is honored. This also fixes a latent bug on the legacy lm_encoder_media_step fallback path (model.cpp run_chat_prefill loop) for non-Gemma4 multimodal models. - rag.cpp: wrap the four unprotected get_embeddings call sites in retrieve_rag_context and select_relevant_tools with try/catch that logs a warning and returns the same fallback those functions already use (empty context / unfiltered tool set). Without this, enabling RAG or tool_rag_top_k>0 would now throw out through cactus_complete. Python - python/cactus/bindings/cactus.py: cactus_embed, cactus_image_embed, and cactus_audio_embed were passing the element count (4096) as buffer_size, but the C side treats it as bytes. The 4096-float buffer is actually 16384 bytes. Masked when hidden_dim <= 1024 (LFM2-VL exactly at boundary); broke Gemma4 (1536 dims -> 6144 bytes -> rejected as "Buffer too small") and Qwen3-VL (2048 dims). Fixed by passing ctypes.sizeof(buf). - python/tests/test_model.py: un-skip test_image_embedding and test_audio_embedding now that the wrappers are implemented. Keep test_text_embedding skipped, narrow the skip reason to text-only. Verified end-to-end on LFM2-VL-450M, Gemma-4-E2B-it, Qwen3-VL-2B-Instruct (image embed -> dim 1024 / 1536 / 2048, all L2-normalized, no NaNs) and Whisper-small (audio embed -> dim 768). Each model's multimodal completion or transcription still works when called after the embed extracts its feature, confirming the load-state restore. Full pytest: 187 passed, 1 skipped (the remaining text-embedding stub). Signed-off-by: Noah Cylich <noahcylich@gmail.com>

python/tests/test_server.py imports fastapi at module level, but the workflow installs only python/[dev], so pytest collection fails with ModuleNotFoundError and the job exits 2 before any test runs. fastapi is correctly placed under the serve extras in python/pyproject.toml; the workflow just wasn't installing them. Add [serve] alongside [dev] so the server tests can collect and run. This has been broken on every Python Tests run since the v2 HTTP server PR (#669) landed test_server.py. Signed-off-by: Noah Cylich <noahcylich@gmail.com>

test_server_live.py boots a real server against a converted bundle at weights/gemma-4-e2b-it and pytest.fails (not skips) when the bundle is absent. CI doesn't ship pre-converted models, so every assertion in that file errors out at fixture setup. This was also failing on every Python Tests run since #669 added the file; the previous commit (the [serve] extras fix) just got us far enough to surface it. Match the existing ignore for tests/test_model.py, which is excluded for the same reason. Signed-off-by: Noah Cylich <noahcylich@gmail.com>

…rver) The v2 transpiler refactor left Model::get_embeddings throwing "Text embeddings not wired up for transpiled bundles yet", so RAG, the corpus index, and the FFI cactus_embed had no working text-embedding path. This ports the nomic-embed-text-v2-moe model (the same one main shipped) onto the v2 transpiled-bundle architecture and turns the stub into a real encoder run, verified against the HuggingFace reference. Transpile - Add a text_embedding task plus NomicTextEmbeddingAdapter, an export-friendly reimplementation of the nomic-bert encoder. It reuses the HF submodule weights but reimplements the forward to survive torch.export: rotary is recomputed from seq_len (no lazy cos/sin cache), the additive attention mask is built with scalar sub/mul (no rsub), and the MoE is dense -- two fused matmuls over the packed expert weights gated by a top-k softmax built from iterative amax/masking (torch.topk, index_add_, and slicing a quantized weight do not lower). The graph emits last_hidden_state; pooling and normalization happen in the engine. - Route nomic through the component pipeline: _family_key detection, canonicalize_model_interface + build_component_module_specs dispatch, the "text_embedding" component in the bundle manifest order, task auto-inference in component_plan.py and hf_model.py, a fixed-length text-embedding input builder, and a loader that forces trust_remote_code (the transformers-native nomic_bert is a different architecture and does not match the converted weights). - canonicalize/cleanup.py: stop fp16-legalizing the embedding op's weight input. The embedding kernel dequantizes CQ/FP16 weights directly; the inserted precision_cast tried CQ4 -> FP16 at runtime and failed. Convert - NomicAdapter now emits one fused tensor per HF parameter (Wqkv and experts.mlp.w1/w2 are no longer split into q/k/v or per-expert files) so the transpiled graph, which binds weights by HF name, gets a 1:1 match. experts.mlp.w2 is stored transposed so the second expert matmul consumes it as a direct linear weight. The tiny MoE router stays FP16 -- 4-bit quantizing [num_experts, hidden] corrupts routing. - Tokenizer conversion handles Unigram (XLM-RoBERTa, used by nomic): classify it as SentencePiece, emit per-token Viterbi scores into vocab.txt (id<TAB>token<TAB>score), and write the unigram runtime config (sp_model_type, sp_add_dummy_prefix, metaspace). Previously it was misdetected as BPE and mis-tokenized ("Paris is the capital of France." -> 31 tokens vs the reference 9). Engine - Implement Model::get_embeddings for transpiled bundles: load the text_embedding component, wrap the tokens with BOS/EOS (matching the reference add_special_tokens), run the graph, mean-pool over the real tokens, and optionally L2-normalize. Add an embedding-only init path (no decode route) and map model_type bert/nomic to ModelType::NOMIC so the config loader stops requiring Gemma4 fields. - sp.cpp: the SentencePiece (non-BPE) path now honors sp_add_dummy_prefix, prepending the metaspace marker so unigram segmentation matches the reference. Server - Add an OpenAI-compatible POST /v1/embeddings endpoint (string or list input) backed by cactus_embed, plus EMBED_MODEL_TYPES. create_app can now serve a non-LLM bundle as its default model so an embedding-only server boots. Tests - python/tests/test_nomic_text_embedding.py: task/family routing. - python/tests/test_model.py: TestNomicEmbedding covers determinism, retrieval discrimination, and HF parity (asserts cosine vs the reference last-hidden mean-pooled embedding). - python/tests/test_server_live.py: live /v1/embeddings (string, list, rejects non-embedding models). - Redistribute the convert NomicAdapter tests out of the catch-all test_nomic_adapter.py into their topical homes (test_policy.py, test_naming_qdq.py) and move the unrelated LFM2 tests into a new test_lfm2_adapter.py; delete test_nomic_adapter.py. Verified end-to-end on nomic-embed-text-v2-moe: cactus vs HF cosine 0.92-0.94 (the gap is purely CQ4 4-bit weight quantization -- FP parity is ~1.0, and cactus-vs-HF is identical whether HF runs fp16 or fp32), with retrieval ranking preserved (query.relevant 0.72 >> query.unrelated 0.29). Convert/transpile produces a text_embedding component with all 146 weights bound. Full server_live suite green. Signed-off-by: Noah Cylich <noahcylich@gmail.com>

TestVLMModel.test_text_embedding was a permanently-skipped stub that called cactus_embed on an LFM2-VL bundle, which has no text_embedding component. Text embeddings now have real coverage in TestNomicEmbedding (shape/determinism, retrieval discrimination, HF parity), so remove the dead test and its skip constant. The remaining skips are all conditional runtime guards (missing image/audio assets, transformers not installed, no embedding bundle present). Signed-off-by: Noah Cylich <noahcylich@gmail.com>

ncylich added 4 commits May 29, 2026 16:06

ncylich changed the title ~~Wire image/audio embeddings; throw on text embeddings; fix latent vision-encoder family bug~~ Wire text/image/audio embeddings on the v2 transpiler; add /v1/embeddings Jun 2, 2026

HenryNdubuaku merged commit add442d into v2 Jun 2, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wire text/image/audio embeddings on the v2 transpiler; add /v1/embeddings#676

Wire text/image/audio embeddings on the v2 transpiler; add /v1/embeddings#676
HenryNdubuaku merged 5 commits into
v2from
test-model-fixes

ncylich commented May 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ncylich commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Text embeddings (nomic-embed-text-v2-moe)

Transpile

Convert

Engine

Server

HF parity

Image / audio embeddings + vision-encoder fix

Tests

Verification

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ncylich commented May 29, 2026 •

edited

Loading