Wire text/image/audio embeddings on the v2 transpiler; add /v1/embeddings#676
Merged
Conversation
The v2 transpiler refactor left Model::get_embeddings,
get_image_embeddings, and get_audio_embeddings as stubs. The former
returned an empty vector silently (a footgun: rag.cpp callers fell
through to dimension-mismatch warnings and chat completion silently
skipped RAG/tool-RAG). The latter two threw with a "not wired up yet"
message even though the vision_encoder and audio_encoder components
they need are already built and exercised by completion and
transcription.
Engine
- Model::get_embeddings now throws "Text embeddings not wired up for
transpiled bundles yet" instead of returning {}. Matches the
sibling pattern and surfaces the failure to callers instead of
silently producing empty results.
- Implement Model::get_image_embeddings and Model::get_audio_embeddings:
drive the existing run_vision_encoder / run_audio_encoder_messages
paths, then dequantize the encoder output (FP32 / FP16 / INT8 -> FP32),
mean-pool over the leading dims into the last (hidden_dim), and
L2-normalize. A small anonymous-namespace helper
(pool_and_normalize_media_feature) keeps the two methods symmetric.
Both call load_component_graph after extracting the feature so the
transcribe_whisper_seq2seq / transcribe_parakeet_tdt paths (which
call bind_runtime_buffers directly and assume the encoder graph is
persistently loaded) keep working when an embed call precedes a
transcribe call.
- Make Model::run_vision_encoder family-aware. It was Gemma4-only:
unconditionally writing pixel_values + pixel_position_ids, which
produced NaN outputs on LFM2-VL (expects pixel_attention_mask, not
pixel_position_ids) and Qwen3-VL. The new dispatch mirrors the one
in run_chunk_prefill_path: LFM2-VL gets pixel_values +
pixel_attention_mask; Qwen3-VL gets pixel_values only; Gemma4 keeps
pixel_values + pixel_position_ids. All inputs go through
write_typed_buffer / a precision-aware int writer instead of
write_bytes_input so the per-buffer precision is honored. This also
fixes a latent bug on the legacy lm_encoder_media_step fallback path
(model.cpp run_chat_prefill loop) for non-Gemma4 multimodal models.
- rag.cpp: wrap the four unprotected get_embeddings call sites in
retrieve_rag_context and select_relevant_tools with try/catch that
logs a warning and returns the same fallback those functions
already use (empty context / unfiltered tool set). Without this,
enabling RAG or tool_rag_top_k>0 would now throw out through
cactus_complete.
Python
- python/cactus/bindings/cactus.py: cactus_embed, cactus_image_embed,
and cactus_audio_embed were passing the element count (4096) as
buffer_size, but the C side treats it as bytes. The 4096-float
buffer is actually 16384 bytes. Masked when hidden_dim <= 1024
(LFM2-VL exactly at boundary); broke Gemma4 (1536 dims -> 6144
bytes -> rejected as "Buffer too small") and Qwen3-VL (2048 dims).
Fixed by passing ctypes.sizeof(buf).
- python/tests/test_model.py: un-skip test_image_embedding and
test_audio_embedding now that the wrappers are implemented. Keep
test_text_embedding skipped, narrow the skip reason to text-only.
Verified end-to-end on LFM2-VL-450M, Gemma-4-E2B-it, Qwen3-VL-2B-Instruct
(image embed -> dim 1024 / 1536 / 2048, all L2-normalized, no NaNs) and
Whisper-small (audio embed -> dim 768). Each model's multimodal
completion or transcription still works when called after the embed
extracts its feature, confirming the load-state restore.
Full pytest: 187 passed, 1 skipped (the remaining text-embedding stub).
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
python/tests/test_server.py imports fastapi at module level, but the workflow installs only python/[dev], so pytest collection fails with ModuleNotFoundError and the job exits 2 before any test runs. fastapi is correctly placed under the serve extras in python/pyproject.toml; the workflow just wasn't installing them. Add [serve] alongside [dev] so the server tests can collect and run. This has been broken on every Python Tests run since the v2 HTTP server PR (#669) landed test_server.py. Signed-off-by: Noah Cylich <noahcylich@gmail.com>
test_server_live.py boots a real server against a converted bundle at weights/gemma-4-e2b-it and pytest.fails (not skips) when the bundle is absent. CI doesn't ship pre-converted models, so every assertion in that file errors out at fixture setup. This was also failing on every Python Tests run since #669 added the file; the previous commit (the [serve] extras fix) just got us far enough to surface it. Match the existing ignore for tests/test_model.py, which is excluded for the same reason. Signed-off-by: Noah Cylich <noahcylich@gmail.com>
…rver)
The v2 transpiler refactor left Model::get_embeddings throwing "Text
embeddings not wired up for transpiled bundles yet", so RAG, the corpus
index, and the FFI cactus_embed had no working text-embedding path. This
ports the nomic-embed-text-v2-moe model (the same one main shipped) onto
the v2 transpiled-bundle architecture and turns the stub into a real
encoder run, verified against the HuggingFace reference.
Transpile
- Add a text_embedding task plus NomicTextEmbeddingAdapter, an
export-friendly reimplementation of the nomic-bert encoder. It reuses
the HF submodule weights but reimplements the forward to survive
torch.export: rotary is recomputed from seq_len (no lazy cos/sin
cache), the additive attention mask is built with scalar sub/mul (no
rsub), and the MoE is dense -- two fused matmuls over the packed
expert weights gated by a top-k softmax built from iterative
amax/masking (torch.topk, index_add_, and slicing a quantized weight
do not lower). The graph emits last_hidden_state; pooling and
normalization happen in the engine.
- Route nomic through the component pipeline: _family_key detection,
canonicalize_model_interface + build_component_module_specs dispatch,
the "text_embedding" component in the bundle manifest order, task
auto-inference in component_plan.py and hf_model.py, a fixed-length
text-embedding input builder, and a loader that forces
trust_remote_code (the transformers-native nomic_bert is a different
architecture and does not match the converted weights).
- canonicalize/cleanup.py: stop fp16-legalizing the embedding op's
weight input. The embedding kernel dequantizes CQ/FP16 weights
directly; the inserted precision_cast tried CQ4 -> FP16 at runtime
and failed.
Convert
- NomicAdapter now emits one fused tensor per HF parameter (Wqkv and
experts.mlp.w1/w2 are no longer split into q/k/v or per-expert
files) so the transpiled graph, which binds weights by HF name, gets
a 1:1 match. experts.mlp.w2 is stored transposed so the second
expert matmul consumes it as a direct linear weight. The tiny MoE
router stays FP16 -- 4-bit quantizing [num_experts, hidden] corrupts
routing.
- Tokenizer conversion handles Unigram (XLM-RoBERTa, used by nomic):
classify it as SentencePiece, emit per-token Viterbi scores into
vocab.txt (id<TAB>token<TAB>score), and write the unigram runtime
config (sp_model_type, sp_add_dummy_prefix, metaspace). Previously it
was misdetected as BPE and mis-tokenized ("Paris is the capital of
France." -> 31 tokens vs the reference 9).
Engine
- Implement Model::get_embeddings for transpiled bundles: load the
text_embedding component, wrap the tokens with BOS/EOS (matching the
reference add_special_tokens), run the graph, mean-pool over the real
tokens, and optionally L2-normalize. Add an embedding-only init path
(no decode route) and map model_type bert/nomic to ModelType::NOMIC
so the config loader stops requiring Gemma4 fields.
- sp.cpp: the SentencePiece (non-BPE) path now honors
sp_add_dummy_prefix, prepending the metaspace marker so unigram
segmentation matches the reference.
Server
- Add an OpenAI-compatible POST /v1/embeddings endpoint (string or list
input) backed by cactus_embed, plus EMBED_MODEL_TYPES. create_app can
now serve a non-LLM bundle as its default model so an embedding-only
server boots.
Tests
- python/tests/test_nomic_text_embedding.py: task/family routing.
- python/tests/test_model.py: TestNomicEmbedding covers determinism,
retrieval discrimination, and HF parity (asserts cosine vs the
reference last-hidden mean-pooled embedding).
- python/tests/test_server_live.py: live /v1/embeddings (string, list,
rejects non-embedding models).
- Redistribute the convert NomicAdapter tests out of the catch-all
test_nomic_adapter.py into their topical homes (test_policy.py,
test_naming_qdq.py) and move the unrelated LFM2 tests into a new
test_lfm2_adapter.py; delete test_nomic_adapter.py.
Verified end-to-end on nomic-embed-text-v2-moe: cactus vs HF cosine
0.92-0.94 (the gap is purely CQ4 4-bit weight quantization -- FP parity
is ~1.0, and cactus-vs-HF is identical whether HF runs fp16 or fp32),
with retrieval ranking preserved (query.relevant 0.72 >> query.unrelated
0.29). Convert/transpile produces a text_embedding component with all
146 weights bound. Full server_live suite green.
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
TestVLMModel.test_text_embedding was a permanently-skipped stub that called cactus_embed on an LFM2-VL bundle, which has no text_embedding component. Text embeddings now have real coverage in TestNomicEmbedding (shape/determinism, retrieval discrimination, HF parity), so remove the dead test and its skip constant. The remaining skips are all conditional runtime guards (missing image/audio assets, transformers not installed, no embedding bundle present). Signed-off-by: Noah Cylich <noahcylich@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings the three embedding entry points in
Modelup on the v2 transpiler. The refactor had left them as stubs:get_embeddingsreturned{}silently (so RAG/tool-RAG callers inrag.cppfell through to dimension-mismatch warnings instead of failing), andget_image_embeddings/get_audio_embeddingsthrew "not wired up yet" even though thevision_encoder/audio_encodercomponents were already built and exercised by completion and transcription.This PR:
mainshipped) onto the v2 transpiled-bundle architecture and turnsget_embeddingsinto a real encoder run, verified against the HuggingFace reference.POST /v1/embeddingsendpoint.Text embeddings (nomic-embed-text-v2-moe)
Transpile
text_embeddingtask +NomicTextEmbeddingAdapter, an export-friendly reimplementation of the nomic-bert encoder. It reuses the HF submodule weights but reimplements the forward to survivetorch.export: rotary is recomputed fromseq_len(no lazy cos/sin cache), the additive attention mask uses scalar sub/mul (norsub), and the MoE is dense — two fused matmuls over the packed expert weights gated by a top-k softmax built from iterativeamax/masking (torch.topk,index_add_, and slicing a quantized weight do not lower). The graph emitslast_hidden_state; pooling/normalization happen in the engine._family_keydetection,canonicalize_model_interface+build_component_module_specsdispatch, thetext_embeddingcomponent in the bundle manifest, task auto-inference (component_plan.py,hf_model.py), a fixed-length input builder, and a loader that forcestrust_remote_code(the transformers-nativenomic_bertis a different architecture and won't match the converted weights).canonicalize/cleanup.py: stop fp16-legalizing theembeddingop's weight input — the kernel dequantizes CQ/FP16 weights directly, and the insertedprecision_casttried CQ4 → FP16 at runtime and failed.Convert
NomicAdapternow emits one fused tensor per HF parameter (Wqkvandexperts.mlp.w1/w2are no longer split into q/k/v or per-expert files) so the transpiled graph — which binds weights by HF name — gets a 1:1 match.experts.mlp.w2is stored transposed so the second expert matmul consumes it as a direct linear weight. The tiny MoE router stays FP16 (4-bit quantizing[num_experts, hidden]corrupts routing).vocab.txt, and write the unigram runtime config (sp_model_type,sp_add_dummy_prefix, metaspace). It was previously misdetected as BPE and mis-tokenized ("Paris is the capital of France." → 31 tokens vs the reference 9).Engine
Model::get_embeddingsfor transpiled bundles: load thetext_embeddingcomponent, wrap tokens with BOS/EOS (matchingadd_special_tokens), run the graph, mean-pool over real tokens, optionally L2-normalize. Add an embedding-only init path (no decode route) and mapmodel_typebert/nomic→ModelType::NOMICso the config loader stops requiring Gemma4 fields.sp.cpp: the SentencePiece (non-BPE) path honorssp_add_dummy_prefix, prepending the metaspace marker so unigram segmentation matches the reference.Server
POST /v1/embeddings(string or list input) backed bycactus_embed, plusEMBED_MODEL_TYPES.create_appcan now serve a non-LLM bundle as its default model, so an embedding-only server boots.HF parity
cactus vs HF cosine 0.92–0.94. The gap is purely CQ4 4-bit weight quantization: FP parity of the adapter is ~1.0, and cactus-vs-HF is identical whether HF runs fp16 or fp32. Retrieval ranking is preserved (query·relevant
0.72≫ query·unrelated0.29). Convert/transpile produces atext_embeddingcomponent with all 146 weights bound.Image / audio embeddings + vision-encoder fix
Model::get_embeddingsnow throws (mirrors the sibling methods) instead of returning{}; surfaces the failure tocactus_embedandcactus_rag_query(both already wrap in try/catch).rag.cpp: wrap the four unprotectedget_embeddingscallsites inretrieve_rag_context/select_relevant_toolswith try/catch so chat completion degrades gracefully (empty RAG context / unfiltered tool set) instead of throwing out.Model::get_image_embeddings/Model::get_audio_embeddings: run the existing encoder component, dequantize FP32/FP16/INT8 → FP32, mean-pool over leading dims intohidden_dim, L2-normalize. Both callload_component_graphafter extracting the output sotranscribe_whisper_seq2seq/transcribe_parakeet_tdtkeep working when an embed precedes a transcribe.Model::run_vision_encoderfamily-aware. Was Gemma4-only — unconditionally wrotepixel_values+pixel_position_ids, producing NaN outputs on LFM2-VL (expectspixel_attention_mask) and Qwen3-VL. Dispatch now mirrorsrun_chunk_prefill_path(LFM2-VL / Qwen3-VL / default), with precision-aware writes instead of rawwrite_bytes_input. Also fixes a latent bug on the legacylm_encoder_media_stepfallback path for non-Gemma4 multimodal models.python/cactus/bindings/cactus.py:cactus_embed/cactus_image_embed/cactus_audio_embedpassed the element count (4096) asbuffer_size, but the C side treats it as bytes. Masked whenhidden_dim ≤ 1024; broke Gemma4 (1536) and Qwen3-VL (2048). Fixed by passingctypes.sizeof(buf).Tests
python/tests/test_nomic_text_embedding.py: task/family routing.python/tests/test_model.py:TestNomicEmbedding(determinism, retrieval discrimination, HF-parity cosine assertion); image/audio embedding tests un-skipped.python/tests/test_server_live.py: live/v1/embeddings(string, list, rejects non-embedding models).NomicAdaptertests out of the catch-alltest_nomic_adapter.pyinto their topical homes (test_policy.py,test_naming_qdq.py), move the unrelated LFM2 tests into a newtest_lfm2_adapter.py, and deletetest_nomic_adapter.py.Verification
Test plan
pytest python/cactus/convert/tests/ python/tests/test_nomic_text_embedding.py— greenpytest python/tests/test_model.py::TestNomicEmbedding -s— determinism, retrieval, HF paritypytest python/tests/test_server_live.py— 19 pass (chat, transcription, embeddings)cactus convert nomic-ai/nomic-embed-text-v2-moe→text_embeddingcomponent, 146 weights bound