Skip to content

Wire text/image/audio embeddings on the v2 transpiler; add /v1/embeddings#676

Merged
HenryNdubuaku merged 5 commits into
v2from
test-model-fixes
Jun 2, 2026
Merged

Wire text/image/audio embeddings on the v2 transpiler; add /v1/embeddings#676
HenryNdubuaku merged 5 commits into
v2from
test-model-fixes

Conversation

@ncylich

@ncylich ncylich commented May 29, 2026

Copy link
Copy Markdown
Collaborator

Summary

Brings the three embedding entry points in Model up on the v2 transpiler. The refactor had left them as stubs: get_embeddings returned {} silently (so RAG/tool-RAG callers in rag.cpp fell through to dimension-mismatch warnings instead of failing), and get_image_embeddings / get_audio_embeddings threw "not wired up yet" even though the vision_encoder / audio_encoder components were already built and exercised by completion and transcription.

This PR:

  • Ports nomic-embed-text-v2-moe (the same model main shipped) onto the v2 transpiled-bundle architecture and turns get_embeddings into a real encoder run, verified against the HuggingFace reference.
  • Adds an OpenAI-compatible POST /v1/embeddings endpoint.
  • Implements image/audio embeddings and fixes a latent vision-encoder family bug.

Text embeddings (nomic-embed-text-v2-moe)

Transpile

  • New text_embedding task + NomicTextEmbeddingAdapter, an export-friendly reimplementation of the nomic-bert encoder. It reuses the HF submodule weights but reimplements the forward to survive torch.export: rotary is recomputed from seq_len (no lazy cos/sin cache), the additive attention mask uses scalar sub/mul (no rsub), and the MoE is dense — two fused matmuls over the packed expert weights gated by a top-k softmax built from iterative amax/masking (torch.topk, index_add_, and slicing a quantized weight do not lower). The graph emits last_hidden_state; pooling/normalization happen in the engine.
  • Routing: _family_key detection, canonicalize_model_interface + build_component_module_specs dispatch, the text_embedding component in the bundle manifest, task auto-inference (component_plan.py, hf_model.py), a fixed-length input builder, and a loader that forces trust_remote_code (the transformers-native nomic_bert is a different architecture and won't match the converted weights).
  • canonicalize/cleanup.py: stop fp16-legalizing the embedding op's weight input — the kernel dequantizes CQ/FP16 weights directly, and the inserted precision_cast tried CQ4 → FP16 at runtime and failed.

Convert

  • NomicAdapter now emits one fused tensor per HF parameter (Wqkv and experts.mlp.w1/w2 are no longer split into q/k/v or per-expert files) so the transpiled graph — which binds weights by HF name — gets a 1:1 match. experts.mlp.w2 is stored transposed so the second expert matmul consumes it as a direct linear weight. The tiny MoE router stays FP16 (4-bit quantizing [num_experts, hidden] corrupts routing).
  • Tokenizer conversion handles Unigram (XLM-RoBERTa, used by nomic): classify it as SentencePiece, emit per-token Viterbi scores into vocab.txt, and write the unigram runtime config (sp_model_type, sp_add_dummy_prefix, metaspace). It was previously misdetected as BPE and mis-tokenized ("Paris is the capital of France." → 31 tokens vs the reference 9).

Engine

  • Implement Model::get_embeddings for transpiled bundles: load the text_embedding component, wrap tokens with BOS/EOS (matching add_special_tokens), run the graph, mean-pool over real tokens, optionally L2-normalize. Add an embedding-only init path (no decode route) and map model_type bert/nomicModelType::NOMIC so the config loader stops requiring Gemma4 fields.
  • sp.cpp: the SentencePiece (non-BPE) path honors sp_add_dummy_prefix, prepending the metaspace marker so unigram segmentation matches the reference.

Server

  • POST /v1/embeddings (string or list input) backed by cactus_embed, plus EMBED_MODEL_TYPES. create_app can now serve a non-LLM bundle as its default model, so an embedding-only server boots.

HF parity

cactus vs HF cosine 0.92–0.94. The gap is purely CQ4 4-bit weight quantization: FP parity of the adapter is ~1.0, and cactus-vs-HF is identical whether HF runs fp16 or fp32. Retrieval ranking is preserved (query·relevant 0.72 ≫ query·unrelated 0.29). Convert/transpile produces a text_embedding component with all 146 weights bound.

Image / audio embeddings + vision-encoder fix

  • Model::get_embeddings now throws (mirrors the sibling methods) instead of returning {}; surfaces the failure to cactus_embed and cactus_rag_query (both already wrap in try/catch).
  • rag.cpp: wrap the four unprotected get_embeddings callsites in retrieve_rag_context / select_relevant_tools with try/catch so chat completion degrades gracefully (empty RAG context / unfiltered tool set) instead of throwing out.
  • Implement Model::get_image_embeddings / Model::get_audio_embeddings: run the existing encoder component, dequantize FP32/FP16/INT8 → FP32, mean-pool over leading dims into hidden_dim, L2-normalize. Both call load_component_graph after extracting the output so transcribe_whisper_seq2seq / transcribe_parakeet_tdt keep working when an embed precedes a transcribe.
  • Make Model::run_vision_encoder family-aware. Was Gemma4-only — unconditionally wrote pixel_values + pixel_position_ids, producing NaN outputs on LFM2-VL (expects pixel_attention_mask) and Qwen3-VL. Dispatch now mirrors run_chunk_prefill_path (LFM2-VL / Qwen3-VL / default), with precision-aware writes instead of raw write_bytes_input. Also fixes a latent bug on the legacy lm_encoder_media_step fallback path for non-Gemma4 multimodal models.
  • python/cactus/bindings/cactus.py: cactus_embed / cactus_image_embed / cactus_audio_embed passed the element count (4096) as buffer_size, but the C side treats it as bytes. Masked when hidden_dim ≤ 1024; broke Gemma4 (1536) and Qwen3-VL (2048). Fixed by passing ctypes.sizeof(buf).

Tests

  • python/tests/test_nomic_text_embedding.py: task/family routing.
  • python/tests/test_model.py: TestNomicEmbedding (determinism, retrieval discrimination, HF-parity cosine assertion); image/audio embedding tests un-skipped.
  • python/tests/test_server_live.py: live /v1/embeddings (string, list, rejects non-embedding models).
  • Redistribute the convert NomicAdapter tests out of the catch-all test_nomic_adapter.py into their topical homes (test_policy.py, test_naming_qdq.py), move the unrelated LFM2 tests into a new test_lfm2_adapter.py, and delete test_nomic_adapter.py.

Verification

Model Embed dim Norm NaNs Completion/Transcribe after
nomic-embed-text-v2-moe (text) 768 1.0 0 HF cosine 0.92–0.94
LFM2-VL-450M (image) 1024 1.0 0
Gemma-4-E2B-it (image) 1536 1.0 0
Qwen3-VL-2B-Instruct (image) 2048 1.0 0
Whisper-small (audio) 768 1.0 0

Test plan

  • pytest python/cactus/convert/tests/ python/tests/test_nomic_text_embedding.py — green
  • pytest python/tests/test_model.py::TestNomicEmbedding -s — determinism, retrieval, HF parity
  • pytest python/tests/test_server_live.py — 19 pass (chat, transcription, embeddings)
  • cactus convert nomic-ai/nomic-embed-text-v2-moetext_embedding component, 146 weights bound
  • Manual: image embed + completion on LFM2-VL, Gemma-4, Qwen3-VL; audio embed + transcription on Whisper-small

ncylich added 4 commits May 29, 2026 16:06
The v2 transpiler refactor left Model::get_embeddings,
get_image_embeddings, and get_audio_embeddings as stubs. The former
returned an empty vector silently (a footgun: rag.cpp callers fell
through to dimension-mismatch warnings and chat completion silently
skipped RAG/tool-RAG). The latter two threw with a "not wired up yet"
message even though the vision_encoder and audio_encoder components
they need are already built and exercised by completion and
transcription.

Engine

- Model::get_embeddings now throws "Text embeddings not wired up for
  transpiled bundles yet" instead of returning {}. Matches the
  sibling pattern and surfaces the failure to callers instead of
  silently producing empty results.

- Implement Model::get_image_embeddings and Model::get_audio_embeddings:
  drive the existing run_vision_encoder / run_audio_encoder_messages
  paths, then dequantize the encoder output (FP32 / FP16 / INT8 -> FP32),
  mean-pool over the leading dims into the last (hidden_dim), and
  L2-normalize. A small anonymous-namespace helper
  (pool_and_normalize_media_feature) keeps the two methods symmetric.
  Both call load_component_graph after extracting the feature so the
  transcribe_whisper_seq2seq / transcribe_parakeet_tdt paths (which
  call bind_runtime_buffers directly and assume the encoder graph is
  persistently loaded) keep working when an embed call precedes a
  transcribe call.

- Make Model::run_vision_encoder family-aware. It was Gemma4-only:
  unconditionally writing pixel_values + pixel_position_ids, which
  produced NaN outputs on LFM2-VL (expects pixel_attention_mask, not
  pixel_position_ids) and Qwen3-VL. The new dispatch mirrors the one
  in run_chunk_prefill_path: LFM2-VL gets pixel_values +
  pixel_attention_mask; Qwen3-VL gets pixel_values only; Gemma4 keeps
  pixel_values + pixel_position_ids. All inputs go through
  write_typed_buffer / a precision-aware int writer instead of
  write_bytes_input so the per-buffer precision is honored. This also
  fixes a latent bug on the legacy lm_encoder_media_step fallback path
  (model.cpp run_chat_prefill loop) for non-Gemma4 multimodal models.

- rag.cpp: wrap the four unprotected get_embeddings call sites in
  retrieve_rag_context and select_relevant_tools with try/catch that
  logs a warning and returns the same fallback those functions
  already use (empty context / unfiltered tool set). Without this,
  enabling RAG or tool_rag_top_k>0 would now throw out through
  cactus_complete.

Python

- python/cactus/bindings/cactus.py: cactus_embed, cactus_image_embed,
  and cactus_audio_embed were passing the element count (4096) as
  buffer_size, but the C side treats it as bytes. The 4096-float
  buffer is actually 16384 bytes. Masked when hidden_dim <= 1024
  (LFM2-VL exactly at boundary); broke Gemma4 (1536 dims -> 6144
  bytes -> rejected as "Buffer too small") and Qwen3-VL (2048 dims).
  Fixed by passing ctypes.sizeof(buf).

- python/tests/test_model.py: un-skip test_image_embedding and
  test_audio_embedding now that the wrappers are implemented. Keep
  test_text_embedding skipped, narrow the skip reason to text-only.

Verified end-to-end on LFM2-VL-450M, Gemma-4-E2B-it, Qwen3-VL-2B-Instruct
(image embed -> dim 1024 / 1536 / 2048, all L2-normalized, no NaNs) and
Whisper-small (audio embed -> dim 768). Each model's multimodal
completion or transcription still works when called after the embed
extracts its feature, confirming the load-state restore.

Full pytest: 187 passed, 1 skipped (the remaining text-embedding stub).

Signed-off-by: Noah Cylich <noahcylich@gmail.com>
python/tests/test_server.py imports fastapi at module level, but the
workflow installs only python/[dev], so pytest collection fails with
ModuleNotFoundError and the job exits 2 before any test runs. fastapi
is correctly placed under the serve extras in python/pyproject.toml;
the workflow just wasn't installing them. Add [serve] alongside [dev]
so the server tests can collect and run.

This has been broken on every Python Tests run since the v2 HTTP
server PR (#669) landed test_server.py.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>
test_server_live.py boots a real server against a converted bundle at
weights/gemma-4-e2b-it and pytest.fails (not skips) when the bundle is
absent. CI doesn't ship pre-converted models, so every assertion in
that file errors out at fixture setup. This was also failing on every
Python Tests run since #669 added the file; the previous commit (the
[serve] extras fix) just got us far enough to surface it.

Match the existing ignore for tests/test_model.py, which is excluded
for the same reason.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>
…rver)

The v2 transpiler refactor left Model::get_embeddings throwing "Text
embeddings not wired up for transpiled bundles yet", so RAG, the corpus
index, and the FFI cactus_embed had no working text-embedding path. This
ports the nomic-embed-text-v2-moe model (the same one main shipped) onto
the v2 transpiled-bundle architecture and turns the stub into a real
encoder run, verified against the HuggingFace reference.

Transpile

- Add a text_embedding task plus NomicTextEmbeddingAdapter, an
  export-friendly reimplementation of the nomic-bert encoder. It reuses
  the HF submodule weights but reimplements the forward to survive
  torch.export: rotary is recomputed from seq_len (no lazy cos/sin
  cache), the additive attention mask is built with scalar sub/mul (no
  rsub), and the MoE is dense -- two fused matmuls over the packed
  expert weights gated by a top-k softmax built from iterative
  amax/masking (torch.topk, index_add_, and slicing a quantized weight
  do not lower). The graph emits last_hidden_state; pooling and
  normalization happen in the engine.

- Route nomic through the component pipeline: _family_key detection,
  canonicalize_model_interface + build_component_module_specs dispatch,
  the "text_embedding" component in the bundle manifest order, task
  auto-inference in component_plan.py and hf_model.py, a fixed-length
  text-embedding input builder, and a loader that forces
  trust_remote_code (the transformers-native nomic_bert is a different
  architecture and does not match the converted weights).

- canonicalize/cleanup.py: stop fp16-legalizing the embedding op's
  weight input. The embedding kernel dequantizes CQ/FP16 weights
  directly; the inserted precision_cast tried CQ4 -> FP16 at runtime
  and failed.

Convert

- NomicAdapter now emits one fused tensor per HF parameter (Wqkv and
  experts.mlp.w1/w2 are no longer split into q/k/v or per-expert
  files) so the transpiled graph, which binds weights by HF name, gets
  a 1:1 match. experts.mlp.w2 is stored transposed so the second
  expert matmul consumes it as a direct linear weight. The tiny MoE
  router stays FP16 -- 4-bit quantizing [num_experts, hidden] corrupts
  routing.

- Tokenizer conversion handles Unigram (XLM-RoBERTa, used by nomic):
  classify it as SentencePiece, emit per-token Viterbi scores into
  vocab.txt (id<TAB>token<TAB>score), and write the unigram runtime
  config (sp_model_type, sp_add_dummy_prefix, metaspace). Previously it
  was misdetected as BPE and mis-tokenized ("Paris is the capital of
  France." -> 31 tokens vs the reference 9).

Engine

- Implement Model::get_embeddings for transpiled bundles: load the
  text_embedding component, wrap the tokens with BOS/EOS (matching the
  reference add_special_tokens), run the graph, mean-pool over the real
  tokens, and optionally L2-normalize. Add an embedding-only init path
  (no decode route) and map model_type bert/nomic to ModelType::NOMIC
  so the config loader stops requiring Gemma4 fields.

- sp.cpp: the SentencePiece (non-BPE) path now honors
  sp_add_dummy_prefix, prepending the metaspace marker so unigram
  segmentation matches the reference.

Server

- Add an OpenAI-compatible POST /v1/embeddings endpoint (string or list
  input) backed by cactus_embed, plus EMBED_MODEL_TYPES. create_app can
  now serve a non-LLM bundle as its default model so an embedding-only
  server boots.

Tests

- python/tests/test_nomic_text_embedding.py: task/family routing.
- python/tests/test_model.py: TestNomicEmbedding covers determinism,
  retrieval discrimination, and HF parity (asserts cosine vs the
  reference last-hidden mean-pooled embedding).
- python/tests/test_server_live.py: live /v1/embeddings (string, list,
  rejects non-embedding models).
- Redistribute the convert NomicAdapter tests out of the catch-all
  test_nomic_adapter.py into their topical homes (test_policy.py,
  test_naming_qdq.py) and move the unrelated LFM2 tests into a new
  test_lfm2_adapter.py; delete test_nomic_adapter.py.

Verified end-to-end on nomic-embed-text-v2-moe: cactus vs HF cosine
0.92-0.94 (the gap is purely CQ4 4-bit weight quantization -- FP parity
is ~1.0, and cactus-vs-HF is identical whether HF runs fp16 or fp32),
with retrieval ranking preserved (query.relevant 0.72 >> query.unrelated
0.29). Convert/transpile produces a text_embedding component with all
146 weights bound. Full server_live suite green.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>
@ncylich ncylich changed the title Wire image/audio embeddings; throw on text embeddings; fix latent vision-encoder family bug Wire text/image/audio embeddings on the v2 transpiler; add /v1/embeddings Jun 2, 2026
TestVLMModel.test_text_embedding was a permanently-skipped stub that
called cactus_embed on an LFM2-VL bundle, which has no text_embedding
component. Text embeddings now have real coverage in TestNomicEmbedding
(shape/determinism, retrieval discrimination, HF parity), so remove the
dead test and its skip constant. The remaining skips are all conditional
runtime guards (missing image/audio assets, transformers not installed,
no embedding bundle present).

Signed-off-by: Noah Cylich <noahcylich@gmail.com>
@HenryNdubuaku HenryNdubuaku merged commit add442d into v2 Jun 2, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants