Skip to content

Retry gateway model-name discovery for KAITO llama.cpp#167

Draft
sozercan wants to merge 1 commit into
mainfrom
fix/kaito-llamacpp-gateway-model-name-retry
Draft

Retry gateway model-name discovery for KAITO llama.cpp#167
sozercan wants to merge 1 commit into
mainfrom
fix/kaito-llamacpp-gateway-model-name-retry

Conversation

@sozercan

Copy link
Copy Markdown
Member

Summary

This draft PR retries gateway model-name discovery for KAITO llama.cpp deployments when the controller has to fall back to spec.model.id because the model server is not ready to answer /v1/models yet.

Concretely, it:

  • returns a retry signal from reconcileGateway
  • adds a modelNameResolution result so model-name resolution can carry both the resolved name and whether the controller should try again later
  • requeues the ModelDeployment reconcile after 1 minute when the fallback path is hit for KAITO llama.cpp
  • adds unit coverage for the retry and explicit-override cases

Why

054ba06 changed gateway model-name resolution so KAITO llama.cpp deployments do not trust spec.model.servedName and instead prefer runtime discovery from /v1/models. That makes sense because AIKit / LocalAI can expose a served model name derived from the downloaded GGUF file rather than the original Hugging Face ID.

The remaining gap is timing:

  1. the ModelDeployment reaches Running
  2. gateway reconciliation runs
  3. /v1/models is not ready yet, so discovery fails
  4. the controller falls back to spec.model.id
  5. the HTTPRoute gets created or updated with the fallback header match

At that point there is no guaranteed later reconcile to correct the route header once the model server becomes ready.

Relevant current behavior:

  • the controller watches ModelDeployment and owned InferencePools, but not HTTPRoute
  • auto-created routes are annotated with airunway.ai/httproute-created
  • missing routes are intentionally not recreated after user deletion
  • existing routes are updated on reconcile, but only if a later reconcile actually happens

So this PR is trying to close a controller timing hole where the route can stay pinned to the fallback model name even though the runtime later exposes the correct model name.

What Changed

Controller behavior

  • reconcileGateway now returns (bool, error) where the boolean means "please requeue for model-name discovery"
  • the main reconcile loop uses ctrl.Result{RequeueAfter: time.Minute} when gateway reconciliation requests a retry
  • the retry signal is limited to:
    • provider = kaito
    • engine = llamacpp
    • no explicit spec.gateway.modelName override

Model-name resolution

  • introduced modelNameResolution with:
    • name
    • retry
  • explicit spec.gateway.modelName still wins and suppresses retry
  • successful /v1/models discovery returns the discovered runtime model name and no retry
  • fallback to spec.model.id can now also request a later retry

Tests

Added coverage for:

  • KAITO llama.cpp fallback requesting retry
  • explicit gateway model-name override suppressing retry
  • existing gateway tests updated for the new reconcileGateway signature

Investigation Context

This PR came out of reconstructing some local unstaged changes after the original session history was lost.

While investigating, I also checked the current live cluster state. That cluster does not have this patch deployed yet, and the live failure I reproduced there appears to be a separate problem:

  • current test request:
    • curl http://102.133.128.103/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "NVIDIA-Nemotron-3-Nano-4B-UD-IQ2_M", "messages": [{"role": "user", "content": "Hello"}]}'
  • gateway response:
    • 503 Service Unavailable
    • upstream connect error or disconnect/reset before headers. reset reason: connection termination
  • direct service checks:
    • GET /v1/models from the model service returned 200 with NVIDIA-Nemotron-3-Nano-4B-UD-IQ2_M
    • direct POST /v1/chat/completions also failed
  • pod state at the time of investigation:
    • model pod restarted multiple times
    • last termination reason was OOMKilled / exit code 137
    • pod briefly entered CrashLoopBackOff
    • current pod spec showed resources: {}
    • runner log showed Default capability (no GPU detected)

That means the currently observed cluster failure is most likely an upstream model crash / resource issue, not something caused by this PR.

I am including that context here because it is easy to conflate the two:

  • this PR addresses a controller retry / route-correction gap for KAITO llama.cpp gateway model-name discovery
  • the current live cluster failure looked like a runtime OOM / inference crash path

Validation

Ran locally in controller/:

  • go build ./...
  • go test ./...

Open Questions For Review

Because this is a draft, a few follow-ups are worth deciding explicitly:

  • Should the retry be capped instead of continuing indefinitely while discovery keeps failing?
  • Should retry be skipped when the user provides spec.gateway.httpRouteRef and owns route updates externally?
  • Is 1 minute the right retry delay, or should it be shorter / longer for AIKit startup behavior?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant