Skip to content

Commit 25abc9e

Browse files
authored
feat(inference): allow local embeddings route (#1774)
* feat(inference): allow local embeddings route Route OpenAI-compatible embeddings through the local inference proxy so sandboxed vector workloads reach a configured provider via the same route classification and auth path that chat, completion, and model discovery already use. - Add openai_embeddings to the OpenAI-compatible protocol set so providers (openai, nvidia) advertise embeddings routing. - Classify POST /v1/embeddings as the openai_embeddings protocol in the sandbox L7 patterns. - Serve embeddings buffered with an accurate Content-Length, since the response is a single JSON object rather than an SSE token stream. The streaming path appends an SSE error frame on a size-cap or idle-timeout truncation, which would corrupt a one-object body the client parses whole. protocol_returns_buffered_body() selects the path. - Probe an embeddings-only backend against /v1/embeddings during validation, after the chat and completion protocols so a multi-protocol route still prefers those. - Extract two shared helpers. http_status_text() backs both response formatters and adds 401/422/429/503 for embeddings passthrough and router error mapping; write_inference_router_error() backs the streaming and buffered routing paths. - Return an OpenAI-shaped embeddings body from the mock route. Tests cover profile lookup, L7 pattern detection, the mock body, and buffered Content-Length framing with no chunked transfer-encoding and no SSE error frame. Signed-off-by: Shiju <shiju@nvidia.com> * fix(inference): cap buffered inference response body The buffered proxy path read the whole upstream response into memory with no size bound. The route timeout bounds elapsed time but not memory, so a misbehaving or oversized upstream could force unbounded allocation in the sandbox proxy. The streaming path already caps each response at 32 MiB; the buffered path did not. Cap the buffered read at the same 32 MiB. An advertised over-cap body is rejected from its Content-Length before any bytes are read, and chunks accumulate under the same bound so a chunked or mislabeled body cannot slip past. An over-cap response fails as an upstream protocol error, surfaced as HTTP 502 at the proxy boundary, and is never partially returned. Tests - cargo test -p openshell-router \ proxy_to_backend_rejects_over_cap_response_body Signed-off-by: Shiju <shiju@nvidia.com> * fix(inference): validate embeddings models against all advertised protocols A managed route resolves to its provider profile's full protocol set, so an embeddings model such as text-embedding-3-small lists openai_chat_completions alongside openai_embeddings. Route verification probed only the first writable protocol and stopped on its failure. It sent a chat probe with the embedding model, the provider rejected it as wrong-shape, and the route failed validation before the embeddings probe ran. Embeddings-only configs could not be verified. Try the advertised protocols in preference order. A request-shape rejection (HTTP 400, 404, 405, 422) falls through to the next protocol, so an embeddings model validates against /v1/embeddings even when the chat probe rejects it. Credential, rate-limit, connectivity, and upstream-health failures stay terminal and stop validation at the first probe, so a bad key or a down backend is reported as itself rather than masked by a later probe. validation_probe becomes validation_probes, which returns the ordered list, and the per-probe fallback retry (max_completion_tokens versus max_tokens) moves into a shared helper. Tests - cargo test -p openshell-router \ verify_embeddings_model_falls_through_chat_probe - cargo test -p openshell-router verify_stops_on_credentials_failure Signed-off-by: Shiju <shiju@nvidia.com> * fix(inference): serve model discovery responses buffered GET /v1/models returns a single JSON model list, the same response shape as embeddings. The sandbox inference proxy was routing it through the SSE streaming path. A streaming size-cap or idle-timeout truncation appends an SSE error frame to the body, which corrupts a payload the client parses as one JSON object. Make response framing a property of the protocol. A new ResponseFraming field on InferenceApiPattern is set once per pattern in default_patterns. model_discovery and openai_embeddings are now Buffered, while chat completions, completions, responses, and Anthropic messages stay Streaming. The proxy dispatch gates on pattern.is_buffered(), which replaces the stringly-typed protocol_returns_buffered_body predicate so the streaming-versus-buffered decision lives in one place and cannot drift across the sites that read it. Model discovery now flows through the same buffered path as embeddings, framed with an accurate Content-Length and bounded by the buffered-read size cap that path already enforces. Tests - cargo test -p openshell-sandbox protocol_framing_classification - cargo test -p openshell-sandbox \ inference_model_discovery_served_buffered_with_content_length Signed-off-by: Shiju <shiju@nvidia.com> * docs(inference): document embeddings route in supported patterns Signed-off-by: Shiju <shiju@nvidia.com> --------- Signed-off-by: Shiju <shiju@nvidia.com>
1 parent 3558888 commit 25abc9e

7 files changed

Lines changed: 1048 additions & 124 deletions

File tree

crates/openshell-core/src/inference.rs

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ const OPENAI_PROTOCOLS: &[&str] = &[
5656
"openai_chat_completions",
5757
"openai_completions",
5858
"openai_responses",
59+
"openai_embeddings",
5960
"model_discovery",
6061
];
6162

@@ -305,6 +306,17 @@ mod tests {
305306
assert!(profile_for("OpenAI").is_some()); // case insensitive
306307
}
307308

309+
#[test]
310+
fn openai_compatible_profiles_include_embeddings() {
311+
for provider_type in ["openai", "nvidia"] {
312+
let profile = profile_for(provider_type).expect("provider profile should exist");
313+
assert!(
314+
profile.protocols.contains(&"openai_embeddings"),
315+
"{provider_type} should route OpenAI-compatible embeddings"
316+
);
317+
}
318+
}
319+
308320
#[test]
309321
fn profile_for_unknown_types() {
310322
assert!(profile_for("github").is_none());

0 commit comments

Comments
 (0)