feature: extend HTTP API to accept image input for embedding-related endpoints

### Describe the feature

The runtime supports image-modality embedding operations end-to-end:

- Candle-binding FFI for image encoding (PR #1414, `MultiModalEncodeImageFromBase64`).
- Request-path image extractor + signal-evaluator wiring (PRs #1867, #1868).
- CRD-side `queryModality` field on `IntelligentRoute` embedding rules (PR #1880); reconcile-time validation in #1895.

Image-modality embedding rules fire correctly when a chat completion arrives via the gRPC ExtProc path with an OpenAI-shaped `image_url` content array.

However, the HTTP API surface for direct embedding-related operations doesn't accept image input on any endpoint. Every embedding-adjacent handler in `pkg/apiserver/` accepts text-only request schemas, even though the underlying service methods and the multimodal FFI could compute image embeddings if given them.

**Scope of the gap** (file:line citations against `main` as of 2026-05-15):

| Endpoint | Handler | Request type | Image-capable today? |
|---|---|---|---|
| `POST /api/v1/classify/intent` | `route_classify.go:18` (`handleIntentClassification`) | `services.IntentRequest{Text, Messages, Options}` (`pkg/services/classification_signal_contract.go:34`) | No |
| `POST /api/v1/classify/batch` | `route_classify.go:96` (`handleBatchClassification`) | `BatchClassificationRequest{Texts, TaskType}` (`pkg/apiserver/config.go:49`) | No |
| `POST /api/v1/eval` | `route_classify.go:38` (`handleEvalClassification`) | reuses `IntentRequest` | No |
| `POST /api/v1/embeddings` | `route_embeddings.go:14` (`handleEmbeddings`) | `EmbeddingRequest{Texts: []string, Model}` (`pkg/apiserver/config.go:86`) | No |
| `POST /api/v1/similarity` | `route_embeddings.go:131` (`handleSimilarity`) | `SimilarityRequest{Text1, Text2, Model}` (`pkg/apiserver/config.go:114`) | No |
| `POST /api/v1/similarity/batch` | `route_embeddings.go:190` (`handleBatchSimilarity`) | `BatchSimilarityRequest{Query, Candidates, TopK}` (`pkg/apiserver/config.go:132`) | No |

The runtime evaluator gates the image-modality path on `imageURL != ""` (`pkg/classification/classifier_signal_context.go:182`). Because none of these HTTP request schemas surface an image field, that gate never opens from the HTTP path, and image-modality embedding rules never fire from any of these endpoints regardless of what rules ship in the config. (`/api/v1/eval` reuses `IntentRequest` verbatim, so one extension closes both endpoints in a single change.)

### Primary layer

`global level`

### Why this layer?

The signal layer's image-modality plumbing already exists, the runtime FFI exists, and the gap is at the HTTP entry point that fans into both. Extending the HTTP request types is a request-API change rather than a signal-layer feature, which puts it in the "intentionally cross-cutting behavior" bucket the template describes for `global level`. If maintainers prefer `signal` because the motivation is unblocking image-modality embedding signals end-to-end, the re-tagging is fine; the engineering work is unchanged.

### Why do you need this feature?

1. **Author/operator validation of image-modality embedding rules.** A pack like `config/signal/embedding/image-routing.yaml` (shipped in #1896) defines three image-modality rules. Confirming those rules fire on representative images today requires either (a) standing up a full Envoy + ExtProc + backend chain to send chat completions, or (b) writing a custom gRPC ExtProc client. Both are heavier than running `curl` against `/api/v1/classify/intent`.

2. **Computing the embedding vector of an image for downstream use** (indexing, storage, retrieval). The multimodal model is loaded; the FFI supports it; the HTTP API doesn't expose it.

3. **Cross-modal similarity** ("which of these phrases is most similar to this image?"). Common in vision-language workflows; the runtime supports it via `ClassifyDetailedMultimodal`; no HTTP surface exposes it.

4. **Image-to-image similarity.** Same shape as above between two images.

### Additional context

**Proposed shape (aligned with the codebase's existing image-content convention):**

The runtime's existing image accept set is documented at `pkg/extproc/utils_fast.go:182-200`: inline `data:image/...;base64,...` URIs only, no http/https URLs (intentional, the ExtProc path closes an SSRF-class concern there). The new HTTP fields should match that accept set. A `string` field carrying the data URI is the lightest option; an object-typed `Image { URL string }` mirroring OpenAI Chat Completions is also defensible. The shape below uses the string form; happy to switch to the typed object if maintainers prefer it for cross-product tooling alignment.

`IntentRequest` (covers `/api/v1/classify/intent` and `/api/v1/eval`):

```go
type IntentRequest struct {
    Text     string          `json:"text"`
    Messages []IntentMessage `json:"messages,omitempty"`
    Image    string          `json:"image,omitempty"`     // NEW: data:image/...;base64,... URI
    Options  *IntentOptions  `json:"options,omitempty"`
}
```

`ClassifyIntent` populates the `imageURL` argument that `EvaluateAllSignalsWithContext` already takes; nothing downstream changes.

`BatchClassificationRequest`: add a parallel `Images []string` field alongside `Texts`. Exactly one of `Texts` / `Images` set per request in v1; mixed batches are deferred.

`EmbeddingRequest`: same shape, add `Images []string` parallel to `Texts`.

`SimilarityRequest`: generalize to `{Text1, Text2, Image1, Image2}` with exactly-one-of `{text, image}` per side. Enables text-text (existing), image-image, and cross-modal text-image similarity.

`BatchSimilarityRequest` (`/api/v1/similarity/batch`): its shape is `{Query, Candidates []string, TopK}` (top-k retrieval). Generalize `Query` to accept text OR image, add a sibling `CandidateImages []string` field, with the same exactly-one-of constraint on the corpus side. Mixed text+image candidates in a single batch are deferred.

**Open question (please steer):**

Three plausible shapes:

1. **Additive** (drafted above): extend existing request types with optional image fields. Smallest diff. Mixes concerns inside each request type but each addition is narrow.
2. **Sibling endpoints**: keep existing endpoints text-only, add `/api/v1/classify/multimodal-intent`, `/api/v1/classify/multimodal-batch`, `/api/v1/embeddings/multimodal`, `/api/v1/similarity/multimodal`, `/api/v1/similarity/batch/multimodal`. Cleaner separation; more endpoints to discover; doubles route registration.
3. **Typed-union request body on a new sibling endpoint set** (`InputA`, `InputB` where each is `oneof {Text, Image}`): cleanest semantics; largest single-PR diff; sets a convention that doesn't match the rest of the apiserver today.

The additive shape is the smallest delta from today's surface. Happy to redo the draft in either of the others if maintainers prefer.

**Staged delivery (if maintainers prefer focused PRs):**

1. `/api/v1/classify/intent` + `/api/v1/eval` (one PR; same request type) - immediately unblocks fixture-based testing for #1896.
2. `/api/v1/classify/batch` - same plumbing, batched form.
3. `/api/v1/embeddings` - enables image-embedding extraction for downstream pipelines.
4. `/api/v1/similarity*` (both pairwise and batch) - enables cross-modal similarity.

Each step is independently shippable behind the next.

**Out of scope for this issue:**

- **Audio modality.** `MultiModalEncodeAudio` is exposed at `candle-binding/semantic-router.go:1106` (takes a pre-computed Mel spectrogram), but the byte-stream variants (`FromBytes` / `FromBase64` / `FromURL`) that would let the HTTP API accept inline audio are not yet exposed. The existing validator already rejects audio rules at config-load for this reason (`pkg/config/validator_embedding.go:64-67`); a separate issue can track exposing the byte-stream variants if there's demand.
- **Remote (http/https) image URLs.** The runtime explicitly rejects http URLs in the ExtProc image path (`pkg/extproc/utils_fast.go:183`: *"Only inline data URIs are accepted (no HTTP URLs)"*); the HTTP API should match. If remote-URL fetching becomes desirable later, it warrants its own design conversation (allow-lists, size caps, content-type sniffing) separate from this gap.
- **Multi-image batching efficiency.** The first version can iterate per-image. Batched FFI calls are a perf optimization, not a correctness requirement.

**Motivating PR:** #1896 ships an opt-in image-modality embedding pack at `config/signal/embedding/image-routing.yaml`. Its "What's NOT in this PR" section names this gap on a single endpoint (`/api/v1/classify/intent`) and explicitly defers a follow-on issue to propose the shape - this is that follow-on, scoped across the full embedding-related HTTP surface (6 endpoints once `/api/v1/classify/batch` is included) rather than just one, because the gap is structural.


Endpoint	Handler	Request type	Image-capable today?
`POST /api/v1/classify/intent`	`route_classify.go:18` (`handleIntentClassification`)	`services.IntentRequest{Text, Messages, Options}` (`pkg/services/classification_signal_contract.go:34`)	No
`POST /api/v1/classify/batch`	`route_classify.go:96` (`handleBatchClassification`)	`BatchClassificationRequest{Texts, TaskType}` (`pkg/apiserver/config.go:49`)	No
`POST /api/v1/eval`	`route_classify.go:38` (`handleEvalClassification`)	reuses `IntentRequest`	No
`POST /api/v1/embeddings`	`route_embeddings.go:14` (`handleEmbeddings`)	`EmbeddingRequest{Texts: []string, Model}` (`pkg/apiserver/config.go:86`)	No
`POST /api/v1/similarity`	`route_embeddings.go:131` (`handleSimilarity`)	`SimilarityRequest{Text1, Text2, Model}` (`pkg/apiserver/config.go:114`)	No
`POST /api/v1/similarity/batch`	`route_embeddings.go:190` (`handleBatchSimilarity`)	`BatchSimilarityRequest{Query, Candidates, TopK}` (`pkg/apiserver/config.go:132`)	No

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: extend HTTP API to accept image input for embedding-related endpoints #1911

Describe the feature

Primary layer

Why this layer?

Why do you need this feature?

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feature: extend HTTP API to accept image input for embedding-related endpoints #1911

Description

Describe the feature

Primary layer

Why this layer?

Why do you need this feature?

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions