Retry gateway model-name discovery for KAITO llama.cpp#167
Draft
sozercan wants to merge 1 commit into
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This draft PR retries gateway model-name discovery for KAITO
llama.cppdeployments when the controller has to fall back tospec.model.idbecause the model server is not ready to answer/v1/modelsyet.Concretely, it:
reconcileGatewaymodelNameResolutionresult so model-name resolution can carry both the resolved name and whether the controller should try again laterModelDeploymentreconcile after 1 minute when the fallback path is hit for KAITOllama.cppWhy
054ba06changed gateway model-name resolution so KAITOllama.cppdeployments do not trustspec.model.servedNameand instead prefer runtime discovery from/v1/models. That makes sense because AIKit / LocalAI can expose a served model name derived from the downloaded GGUF file rather than the original Hugging Face ID.The remaining gap is timing:
ModelDeploymentreachesRunning/v1/modelsis not ready yet, so discovery failsspec.model.idHTTPRoutegets created or updated with the fallback header matchAt that point there is no guaranteed later reconcile to correct the route header once the model server becomes ready.
Relevant current behavior:
ModelDeploymentand ownedInferencePools, but notHTTPRouteairunway.ai/httproute-createdSo this PR is trying to close a controller timing hole where the route can stay pinned to the fallback model name even though the runtime later exposes the correct model name.
What Changed
Controller behavior
reconcileGatewaynow returns(bool, error)where the boolean means "please requeue for model-name discovery"ctrl.Result{RequeueAfter: time.Minute}when gateway reconciliation requests a retrykaitollamacppspec.gateway.modelNameoverrideModel-name resolution
modelNameResolutionwith:nameretryspec.gateway.modelNamestill wins and suppresses retry/v1/modelsdiscovery returns the discovered runtime model name and no retryspec.model.idcan now also request a later retryTests
Added coverage for:
llama.cppfallback requesting retryreconcileGatewaysignatureInvestigation Context
This PR came out of reconstructing some local unstaged changes after the original session history was lost.
While investigating, I also checked the current live cluster state. That cluster does not have this patch deployed yet, and the live failure I reproduced there appears to be a separate problem:
curl http://102.133.128.103/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "NVIDIA-Nemotron-3-Nano-4B-UD-IQ2_M", "messages": [{"role": "user", "content": "Hello"}]}'503 Service Unavailableupstream connect error or disconnect/reset before headers. reset reason: connection terminationGET /v1/modelsfrom the model service returned200withNVIDIA-Nemotron-3-Nano-4B-UD-IQ2_MPOST /v1/chat/completionsalso failedOOMKilled/ exit code137CrashLoopBackOffresources: {}Default capability (no GPU detected)That means the currently observed cluster failure is most likely an upstream model crash / resource issue, not something caused by this PR.
I am including that context here because it is easy to conflate the two:
llama.cppgateway model-name discoveryValidation
Ran locally in
controller/:go build ./...go test ./...Open Questions For Review
Because this is a draft, a few follow-ups are worth deciding explicitly:
spec.gateway.httpRouteRefand owns route updates externally?