You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Image-modality embedding rules fire correctly when a chat completion arrives via the gRPC ExtProc path with an OpenAI-shaped image_url content array.
However, the HTTP API surface for direct embedding-related operations doesn't accept image input on any endpoint. Every embedding-adjacent handler in pkg/apiserver/ accepts text-only request schemas, even though the underlying service methods and the multimodal FFI could compute image embeddings if given them.
Scope of the gap (file:line citations against main as of 2026-05-15):
The runtime evaluator gates the image-modality path on imageURL != "" (pkg/classification/classifier_signal_context.go:182). Because none of these HTTP request schemas surface an image field, that gate never opens from the HTTP path, and image-modality embedding rules never fire from any of these endpoints regardless of what rules ship in the config. (/api/v1/eval reuses IntentRequest verbatim, so one extension closes both endpoints in a single change.)
Primary layer
global level
Why this layer?
The signal layer's image-modality plumbing already exists, the runtime FFI exists, and the gap is at the HTTP entry point that fans into both. Extending the HTTP request types is a request-API change rather than a signal-layer feature, which puts it in the "intentionally cross-cutting behavior" bucket the template describes for global level. If maintainers prefer signal because the motivation is unblocking image-modality embedding signals end-to-end, the re-tagging is fine; the engineering work is unchanged.
Why do you need this feature?
Author/operator validation of image-modality embedding rules. A pack like config/signal/embedding/image-routing.yaml (shipped in [Router][Docs] Add opt-in image-modality embedding pack #1896) defines three image-modality rules. Confirming those rules fire on representative images today requires either (a) standing up a full Envoy + ExtProc + backend chain to send chat completions, or (b) writing a custom gRPC ExtProc client. Both are heavier than running curl against /api/v1/classify/intent.
Computing the embedding vector of an image for downstream use (indexing, storage, retrieval). The multimodal model is loaded; the FFI supports it; the HTTP API doesn't expose it.
Cross-modal similarity ("which of these phrases is most similar to this image?"). Common in vision-language workflows; the runtime supports it via ClassifyDetailedMultimodal; no HTTP surface exposes it.
Image-to-image similarity. Same shape as above between two images.
Additional context
Proposed shape (aligned with the codebase's existing image-content convention):
The runtime's existing image accept set is documented at pkg/extproc/utils_fast.go:182-200: inline data:image/...;base64,... URIs only, no http/https URLs (intentional, the ExtProc path closes an SSRF-class concern there). The new HTTP fields should match that accept set. A string field carrying the data URI is the lightest option; an object-typed Image { URL string } mirroring OpenAI Chat Completions is also defensible. The shape below uses the string form; happy to switch to the typed object if maintainers prefer it for cross-product tooling alignment.
IntentRequest (covers /api/v1/classify/intent and /api/v1/eval):
ClassifyIntent populates the imageURL argument that EvaluateAllSignalsWithContext already takes; nothing downstream changes.
BatchClassificationRequest: add a parallel Images []string field alongside Texts. Exactly one of Texts / Images set per request in v1; mixed batches are deferred.
EmbeddingRequest: same shape, add Images []string parallel to Texts.
SimilarityRequest: generalize to {Text1, Text2, Image1, Image2} with exactly-one-of {text, image} per side. Enables text-text (existing), image-image, and cross-modal text-image similarity.
BatchSimilarityRequest (/api/v1/similarity/batch): its shape is {Query, Candidates []string, TopK} (top-k retrieval). Generalize Query to accept text OR image, add a sibling CandidateImages []string field, with the same exactly-one-of constraint on the corpus side. Mixed text+image candidates in a single batch are deferred.
Open question (please steer):
Three plausible shapes:
Additive (drafted above): extend existing request types with optional image fields. Smallest diff. Mixes concerns inside each request type but each addition is narrow.
Typed-union request body on a new sibling endpoint set (InputA, InputB where each is oneof {Text, Image}): cleanest semantics; largest single-PR diff; sets a convention that doesn't match the rest of the apiserver today.
The additive shape is the smallest delta from today's surface. Happy to redo the draft in either of the others if maintainers prefer.
/api/v1/classify/batch - same plumbing, batched form.
/api/v1/embeddings - enables image-embedding extraction for downstream pipelines.
/api/v1/similarity* (both pairwise and batch) - enables cross-modal similarity.
Each step is independently shippable behind the next.
Out of scope for this issue:
Audio modality.MultiModalEncodeAudio is exposed at candle-binding/semantic-router.go:1106 (takes a pre-computed Mel spectrogram), but the byte-stream variants (FromBytes / FromBase64 / FromURL) that would let the HTTP API accept inline audio are not yet exposed. The existing validator already rejects audio rules at config-load for this reason (pkg/config/validator_embedding.go:64-67); a separate issue can track exposing the byte-stream variants if there's demand.
Remote (http/https) image URLs. The runtime explicitly rejects http URLs in the ExtProc image path (pkg/extproc/utils_fast.go:183: "Only inline data URIs are accepted (no HTTP URLs)"); the HTTP API should match. If remote-URL fetching becomes desirable later, it warrants its own design conversation (allow-lists, size caps, content-type sniffing) separate from this gap.
Multi-image batching efficiency. The first version can iterate per-image. Batched FFI calls are a perf optimization, not a correctness requirement.
Motivating PR:#1896 ships an opt-in image-modality embedding pack at config/signal/embedding/image-routing.yaml. Its "What's NOT in this PR" section names this gap on a single endpoint (/api/v1/classify/intent) and explicitly defers a follow-on issue to propose the shape - this is that follow-on, scoped across the full embedding-related HTTP surface (6 endpoints once /api/v1/classify/batch is included) rather than just one, because the gap is structural.
Describe the feature
The runtime supports image-modality embedding operations end-to-end:
MultiModalEncodeImageFromBase64).queryModalityfield onIntelligentRouteembedding rules (PR [Operator] Expose queryModality on IntelligentRoute EmbeddingSignal CRD #1880); reconcile-time validation in [Operator] Validate embedding modality contracts on IntelligentRoute reconcile #1895.Image-modality embedding rules fire correctly when a chat completion arrives via the gRPC ExtProc path with an OpenAI-shaped
image_urlcontent array.However, the HTTP API surface for direct embedding-related operations doesn't accept image input on any endpoint. Every embedding-adjacent handler in
pkg/apiserver/accepts text-only request schemas, even though the underlying service methods and the multimodal FFI could compute image embeddings if given them.Scope of the gap (file:line citations against
mainas of 2026-05-15):POST /api/v1/classify/intentroute_classify.go:18(handleIntentClassification)services.IntentRequest{Text, Messages, Options}(pkg/services/classification_signal_contract.go:34)POST /api/v1/classify/batchroute_classify.go:96(handleBatchClassification)BatchClassificationRequest{Texts, TaskType}(pkg/apiserver/config.go:49)POST /api/v1/evalroute_classify.go:38(handleEvalClassification)IntentRequestPOST /api/v1/embeddingsroute_embeddings.go:14(handleEmbeddings)EmbeddingRequest{Texts: []string, Model}(pkg/apiserver/config.go:86)POST /api/v1/similarityroute_embeddings.go:131(handleSimilarity)SimilarityRequest{Text1, Text2, Model}(pkg/apiserver/config.go:114)POST /api/v1/similarity/batchroute_embeddings.go:190(handleBatchSimilarity)BatchSimilarityRequest{Query, Candidates, TopK}(pkg/apiserver/config.go:132)The runtime evaluator gates the image-modality path on
imageURL != ""(pkg/classification/classifier_signal_context.go:182). Because none of these HTTP request schemas surface an image field, that gate never opens from the HTTP path, and image-modality embedding rules never fire from any of these endpoints regardless of what rules ship in the config. (/api/v1/evalreusesIntentRequestverbatim, so one extension closes both endpoints in a single change.)Primary layer
global levelWhy this layer?
The signal layer's image-modality plumbing already exists, the runtime FFI exists, and the gap is at the HTTP entry point that fans into both. Extending the HTTP request types is a request-API change rather than a signal-layer feature, which puts it in the "intentionally cross-cutting behavior" bucket the template describes for
global level. If maintainers prefersignalbecause the motivation is unblocking image-modality embedding signals end-to-end, the re-tagging is fine; the engineering work is unchanged.Why do you need this feature?
Author/operator validation of image-modality embedding rules. A pack like
config/signal/embedding/image-routing.yaml(shipped in [Router][Docs] Add opt-in image-modality embedding pack #1896) defines three image-modality rules. Confirming those rules fire on representative images today requires either (a) standing up a full Envoy + ExtProc + backend chain to send chat completions, or (b) writing a custom gRPC ExtProc client. Both are heavier than runningcurlagainst/api/v1/classify/intent.Computing the embedding vector of an image for downstream use (indexing, storage, retrieval). The multimodal model is loaded; the FFI supports it; the HTTP API doesn't expose it.
Cross-modal similarity ("which of these phrases is most similar to this image?"). Common in vision-language workflows; the runtime supports it via
ClassifyDetailedMultimodal; no HTTP surface exposes it.Image-to-image similarity. Same shape as above between two images.
Additional context
Proposed shape (aligned with the codebase's existing image-content convention):
The runtime's existing image accept set is documented at
pkg/extproc/utils_fast.go:182-200: inlinedata:image/...;base64,...URIs only, no http/https URLs (intentional, the ExtProc path closes an SSRF-class concern there). The new HTTP fields should match that accept set. Astringfield carrying the data URI is the lightest option; an object-typedImage { URL string }mirroring OpenAI Chat Completions is also defensible. The shape below uses the string form; happy to switch to the typed object if maintainers prefer it for cross-product tooling alignment.IntentRequest(covers/api/v1/classify/intentand/api/v1/eval):ClassifyIntentpopulates theimageURLargument thatEvaluateAllSignalsWithContextalready takes; nothing downstream changes.BatchClassificationRequest: add a parallelImages []stringfield alongsideTexts. Exactly one ofTexts/Imagesset per request in v1; mixed batches are deferred.EmbeddingRequest: same shape, addImages []stringparallel toTexts.SimilarityRequest: generalize to{Text1, Text2, Image1, Image2}with exactly-one-of{text, image}per side. Enables text-text (existing), image-image, and cross-modal text-image similarity.BatchSimilarityRequest(/api/v1/similarity/batch): its shape is{Query, Candidates []string, TopK}(top-k retrieval). GeneralizeQueryto accept text OR image, add a siblingCandidateImages []stringfield, with the same exactly-one-of constraint on the corpus side. Mixed text+image candidates in a single batch are deferred.Open question (please steer):
Three plausible shapes:
/api/v1/classify/multimodal-intent,/api/v1/classify/multimodal-batch,/api/v1/embeddings/multimodal,/api/v1/similarity/multimodal,/api/v1/similarity/batch/multimodal. Cleaner separation; more endpoints to discover; doubles route registration.InputA,InputBwhere each isoneof {Text, Image}): cleanest semantics; largest single-PR diff; sets a convention that doesn't match the rest of the apiserver today.The additive shape is the smallest delta from today's surface. Happy to redo the draft in either of the others if maintainers prefer.
Staged delivery (if maintainers prefer focused PRs):
/api/v1/classify/intent+/api/v1/eval(one PR; same request type) - immediately unblocks fixture-based testing for [Router][Docs] Add opt-in image-modality embedding pack #1896./api/v1/classify/batch- same plumbing, batched form./api/v1/embeddings- enables image-embedding extraction for downstream pipelines./api/v1/similarity*(both pairwise and batch) - enables cross-modal similarity.Each step is independently shippable behind the next.
Out of scope for this issue:
MultiModalEncodeAudiois exposed atcandle-binding/semantic-router.go:1106(takes a pre-computed Mel spectrogram), but the byte-stream variants (FromBytes/FromBase64/FromURL) that would let the HTTP API accept inline audio are not yet exposed. The existing validator already rejects audio rules at config-load for this reason (pkg/config/validator_embedding.go:64-67); a separate issue can track exposing the byte-stream variants if there's demand.pkg/extproc/utils_fast.go:183: "Only inline data URIs are accepted (no HTTP URLs)"); the HTTP API should match. If remote-URL fetching becomes desirable later, it warrants its own design conversation (allow-lists, size caps, content-type sniffing) separate from this gap.Motivating PR: #1896 ships an opt-in image-modality embedding pack at
config/signal/embedding/image-routing.yaml. Its "What's NOT in this PR" section names this gap on a single endpoint (/api/v1/classify/intent) and explicitly defers a follow-on issue to propose the shape - this is that follow-on, scoped across the full embedding-related HTTP surface (6 endpoints once/api/v1/classify/batchis included) rather than just one, because the gap is structural.