Commit 25abc9e
authored
feat(inference): allow local embeddings route (#1774)
* feat(inference): allow local embeddings route
Route OpenAI-compatible embeddings through the local inference proxy so
sandboxed vector workloads reach a configured provider via the same
route classification and auth path that chat, completion, and model
discovery already use.
- Add openai_embeddings to the OpenAI-compatible protocol set so
providers (openai, nvidia) advertise embeddings routing.
- Classify POST /v1/embeddings as the openai_embeddings protocol in the
sandbox L7 patterns.
- Serve embeddings buffered with an accurate Content-Length, since the
response is a single JSON object rather than an SSE token stream. The
streaming path appends an SSE error frame on a size-cap or idle-timeout
truncation, which would corrupt a one-object body the client parses
whole. protocol_returns_buffered_body() selects the path.
- Probe an embeddings-only backend against /v1/embeddings during
validation, after the chat and completion protocols so a multi-protocol
route still prefers those.
- Extract two shared helpers. http_status_text() backs both response
formatters and adds 401/422/429/503 for embeddings passthrough and
router error mapping; write_inference_router_error() backs the streaming
and buffered routing paths.
- Return an OpenAI-shaped embeddings body from the mock route.
Tests cover profile lookup, L7 pattern detection, the mock body, and
buffered Content-Length framing with no chunked transfer-encoding and no
SSE error frame.
Signed-off-by: Shiju <shiju@nvidia.com>
* fix(inference): cap buffered inference response body
The buffered proxy path read the whole upstream response into memory with
no size bound. The route timeout bounds elapsed time but not memory, so a
misbehaving or oversized upstream could force unbounded allocation in the
sandbox proxy. The streaming path already caps each response at 32 MiB;
the buffered path did not.
Cap the buffered read at the same 32 MiB. An advertised over-cap body is
rejected from its Content-Length before any bytes are read, and chunks
accumulate under the same bound so a chunked or mislabeled body cannot
slip past. An over-cap response fails as an upstream protocol error,
surfaced as HTTP 502 at the proxy boundary, and is never partially
returned.
Tests
- cargo test -p openshell-router \
proxy_to_backend_rejects_over_cap_response_body
Signed-off-by: Shiju <shiju@nvidia.com>
* fix(inference): validate embeddings models against all advertised protocols
A managed route resolves to its provider profile's full protocol set, so
an embeddings model such as text-embedding-3-small lists
openai_chat_completions alongside openai_embeddings. Route verification
probed only the first writable protocol and stopped on its failure. It
sent a chat probe with the embedding model, the provider rejected it as
wrong-shape, and the route failed validation before the embeddings probe
ran. Embeddings-only configs could not be verified.
Try the advertised protocols in preference order. A request-shape
rejection (HTTP 400, 404, 405, 422) falls through to the next protocol,
so an embeddings model validates against /v1/embeddings even when the
chat probe rejects it. Credential, rate-limit, connectivity, and
upstream-health failures stay terminal and stop validation at the first
probe, so a bad key or a down backend is reported as itself rather than
masked by a later probe.
validation_probe becomes validation_probes, which returns the ordered
list, and the per-probe fallback retry (max_completion_tokens versus
max_tokens) moves into a shared helper.
Tests
- cargo test -p openshell-router \
verify_embeddings_model_falls_through_chat_probe
- cargo test -p openshell-router verify_stops_on_credentials_failure
Signed-off-by: Shiju <shiju@nvidia.com>
* fix(inference): serve model discovery responses buffered
GET /v1/models returns a single JSON model list, the same response shape
as embeddings. The sandbox inference proxy was routing it through the SSE
streaming path. A streaming size-cap or idle-timeout truncation appends
an SSE error frame to the body, which corrupts a payload the client
parses as one JSON object.
Make response framing a property of the protocol. A new ResponseFraming
field on InferenceApiPattern is set once per pattern in default_patterns.
model_discovery and openai_embeddings are now Buffered, while chat
completions, completions, responses, and Anthropic messages stay
Streaming. The proxy dispatch gates on pattern.is_buffered(), which
replaces the stringly-typed protocol_returns_buffered_body predicate so
the streaming-versus-buffered decision lives in one place and cannot
drift across the sites that read it.
Model discovery now flows through the same buffered path as embeddings,
framed with an accurate Content-Length and bounded by the buffered-read
size cap that path already enforces.
Tests
- cargo test -p openshell-sandbox protocol_framing_classification
- cargo test -p openshell-sandbox \
inference_model_discovery_served_buffered_with_content_length
Signed-off-by: Shiju <shiju@nvidia.com>
* docs(inference): document embeddings route in supported patterns
Signed-off-by: Shiju <shiju@nvidia.com>
---------
Signed-off-by: Shiju <shiju@nvidia.com>1 parent 3558888 commit 25abc9e
7 files changed
Lines changed: 1048 additions & 124 deletions
File tree
- crates
- openshell-core/src
- openshell-router/src
- openshell-sandbox/src
- l7
- openshell-server/src
- docs/sandboxes
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
56 | 56 | | |
57 | 57 | | |
58 | 58 | | |
| 59 | + | |
59 | 60 | | |
60 | 61 | | |
61 | 62 | | |
| |||
305 | 306 | | |
306 | 307 | | |
307 | 308 | | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
308 | 320 | | |
309 | 321 | | |
310 | 322 | | |
| |||
0 commit comments