Skip to content

feature: extend HTTP API to accept image input for embedding-related endpoints #1911

@shraderdm

Description

@shraderdm

Describe the feature

The runtime supports image-modality embedding operations end-to-end:

Image-modality embedding rules fire correctly when a chat completion arrives via the gRPC ExtProc path with an OpenAI-shaped image_url content array.

However, the HTTP API surface for direct embedding-related operations doesn't accept image input on any endpoint. Every embedding-adjacent handler in pkg/apiserver/ accepts text-only request schemas, even though the underlying service methods and the multimodal FFI could compute image embeddings if given them.

Scope of the gap (file:line citations against main as of 2026-05-15):

Endpoint Handler Request type Image-capable today?
POST /api/v1/classify/intent route_classify.go:18 (handleIntentClassification) services.IntentRequest{Text, Messages, Options} (pkg/services/classification_signal_contract.go:34) No
POST /api/v1/classify/batch route_classify.go:96 (handleBatchClassification) BatchClassificationRequest{Texts, TaskType} (pkg/apiserver/config.go:49) No
POST /api/v1/eval route_classify.go:38 (handleEvalClassification) reuses IntentRequest No
POST /api/v1/embeddings route_embeddings.go:14 (handleEmbeddings) EmbeddingRequest{Texts: []string, Model} (pkg/apiserver/config.go:86) No
POST /api/v1/similarity route_embeddings.go:131 (handleSimilarity) SimilarityRequest{Text1, Text2, Model} (pkg/apiserver/config.go:114) No
POST /api/v1/similarity/batch route_embeddings.go:190 (handleBatchSimilarity) BatchSimilarityRequest{Query, Candidates, TopK} (pkg/apiserver/config.go:132) No

The runtime evaluator gates the image-modality path on imageURL != "" (pkg/classification/classifier_signal_context.go:182). Because none of these HTTP request schemas surface an image field, that gate never opens from the HTTP path, and image-modality embedding rules never fire from any of these endpoints regardless of what rules ship in the config. (/api/v1/eval reuses IntentRequest verbatim, so one extension closes both endpoints in a single change.)

Primary layer

global level

Why this layer?

The signal layer's image-modality plumbing already exists, the runtime FFI exists, and the gap is at the HTTP entry point that fans into both. Extending the HTTP request types is a request-API change rather than a signal-layer feature, which puts it in the "intentionally cross-cutting behavior" bucket the template describes for global level. If maintainers prefer signal because the motivation is unblocking image-modality embedding signals end-to-end, the re-tagging is fine; the engineering work is unchanged.

Why do you need this feature?

  1. Author/operator validation of image-modality embedding rules. A pack like config/signal/embedding/image-routing.yaml (shipped in [Router][Docs] Add opt-in image-modality embedding pack #1896) defines three image-modality rules. Confirming those rules fire on representative images today requires either (a) standing up a full Envoy + ExtProc + backend chain to send chat completions, or (b) writing a custom gRPC ExtProc client. Both are heavier than running curl against /api/v1/classify/intent.

  2. Computing the embedding vector of an image for downstream use (indexing, storage, retrieval). The multimodal model is loaded; the FFI supports it; the HTTP API doesn't expose it.

  3. Cross-modal similarity ("which of these phrases is most similar to this image?"). Common in vision-language workflows; the runtime supports it via ClassifyDetailedMultimodal; no HTTP surface exposes it.

  4. Image-to-image similarity. Same shape as above between two images.

Additional context

Proposed shape (aligned with the codebase's existing image-content convention):

The runtime's existing image accept set is documented at pkg/extproc/utils_fast.go:182-200: inline data:image/...;base64,... URIs only, no http/https URLs (intentional, the ExtProc path closes an SSRF-class concern there). The new HTTP fields should match that accept set. A string field carrying the data URI is the lightest option; an object-typed Image { URL string } mirroring OpenAI Chat Completions is also defensible. The shape below uses the string form; happy to switch to the typed object if maintainers prefer it for cross-product tooling alignment.

IntentRequest (covers /api/v1/classify/intent and /api/v1/eval):

type IntentRequest struct {
    Text     string          `json:"text"`
    Messages []IntentMessage `json:"messages,omitempty"`
    Image    string          `json:"image,omitempty"`     // NEW: data:image/...;base64,... URI
    Options  *IntentOptions  `json:"options,omitempty"`
}

ClassifyIntent populates the imageURL argument that EvaluateAllSignalsWithContext already takes; nothing downstream changes.

BatchClassificationRequest: add a parallel Images []string field alongside Texts. Exactly one of Texts / Images set per request in v1; mixed batches are deferred.

EmbeddingRequest: same shape, add Images []string parallel to Texts.

SimilarityRequest: generalize to {Text1, Text2, Image1, Image2} with exactly-one-of {text, image} per side. Enables text-text (existing), image-image, and cross-modal text-image similarity.

BatchSimilarityRequest (/api/v1/similarity/batch): its shape is {Query, Candidates []string, TopK} (top-k retrieval). Generalize Query to accept text OR image, add a sibling CandidateImages []string field, with the same exactly-one-of constraint on the corpus side. Mixed text+image candidates in a single batch are deferred.

Open question (please steer):

Three plausible shapes:

  1. Additive (drafted above): extend existing request types with optional image fields. Smallest diff. Mixes concerns inside each request type but each addition is narrow.
  2. Sibling endpoints: keep existing endpoints text-only, add /api/v1/classify/multimodal-intent, /api/v1/classify/multimodal-batch, /api/v1/embeddings/multimodal, /api/v1/similarity/multimodal, /api/v1/similarity/batch/multimodal. Cleaner separation; more endpoints to discover; doubles route registration.
  3. Typed-union request body on a new sibling endpoint set (InputA, InputB where each is oneof {Text, Image}): cleanest semantics; largest single-PR diff; sets a convention that doesn't match the rest of the apiserver today.

The additive shape is the smallest delta from today's surface. Happy to redo the draft in either of the others if maintainers prefer.

Staged delivery (if maintainers prefer focused PRs):

  1. /api/v1/classify/intent + /api/v1/eval (one PR; same request type) - immediately unblocks fixture-based testing for [Router][Docs] Add opt-in image-modality embedding pack #1896.
  2. /api/v1/classify/batch - same plumbing, batched form.
  3. /api/v1/embeddings - enables image-embedding extraction for downstream pipelines.
  4. /api/v1/similarity* (both pairwise and batch) - enables cross-modal similarity.

Each step is independently shippable behind the next.

Out of scope for this issue:

  • Audio modality. MultiModalEncodeAudio is exposed at candle-binding/semantic-router.go:1106 (takes a pre-computed Mel spectrogram), but the byte-stream variants (FromBytes / FromBase64 / FromURL) that would let the HTTP API accept inline audio are not yet exposed. The existing validator already rejects audio rules at config-load for this reason (pkg/config/validator_embedding.go:64-67); a separate issue can track exposing the byte-stream variants if there's demand.
  • Remote (http/https) image URLs. The runtime explicitly rejects http URLs in the ExtProc image path (pkg/extproc/utils_fast.go:183: "Only inline data URIs are accepted (no HTTP URLs)"); the HTTP API should match. If remote-URL fetching becomes desirable later, it warrants its own design conversation (allow-lists, size caps, content-type sniffing) separate from this gap.
  • Multi-image batching efficiency. The first version can iterate per-image. Batched FFI calls are a perf optimization, not a correctness requirement.

Motivating PR: #1896 ships an opt-in image-modality embedding pack at config/signal/embedding/image-routing.yaml. Its "What's NOT in this PR" section names this gap on a single endpoint (/api/v1/classify/intent) and explicitly defers a follow-on issue to propose the shape - this is that follow-on, scoped across the full embedding-related HTTP surface (6 endpoints once /api/v1/classify/batch is included) rather than just one, because the gap is structural.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions