fix(gorch): add HAProxy timeout annotations to prevent 504 errors#692
Conversation
The Guardrails Orchestrator routes were returning 504 Gateway Timeout errors after 30 seconds when LLM-based guardrails detection took longer to process. This occurred because OpenShift's HAProxy router enforces a default 30-second timeout on routes without explicit annotations. This fix adds a 5-minute timeout annotation to three LLM-facing routes: - Orchestrator route (port 8032) - main API endpoint for chat/completions-detection - Gateway route - handles sidecar gateway LLM traffic - Built-in detector route - handles built-in detector LLM calls Changes: - Added Annotations field to RouteConfig struct (controllers/utils/route.go) - Updated route template to conditionally render annotations (controllers/gorch/templates/route.tmpl.yaml) - Set haproxy.router.openshift.io/timeout: 5m on three route reconciliation functions The 5-minute timeout allows slow LLM inference to complete while preventing truly hung connections from lingering indefinitely. This is a backward-compatible change that takes effect when the operator reconciles existing GuardrailsOrchestrator resources. Fixes RHOAIENG-33054
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughAdded Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
controllers/gorch/route.go (1)
25-29: Deduplicate repeated timeout annotation literals.The same annotation key/value and map literal is repeated in three places; extracting constants + helper reduces drift risk.
♻️ Proposed refactor
const ( routeTemplatePath = "route.tmpl.yaml" + haproxyTimeoutAnnotationKey = "haproxy.router.openshift.io/timeout" + llmRouteTimeout = "5m" ) + +func llmRouteAnnotations() map[string]string { + return map[string]string{ + haproxyTimeoutAnnotationKey: llmRouteTimeout, + } +} @@ - Annotations: map[string]string{ - // Fix for RHOAIENG-33054: Set HAProxy timeout to 5 minutes - // Gateway route handles LLM traffic that can exceed default 30s timeout - "haproxy.router.openshift.io/timeout": "5m", - }, + Annotations: llmRouteAnnotations(), @@ - Annotations: map[string]string{ - // Fix for RHOAIENG-33054: Set HAProxy timeout to 5 minutes - // LLM-based guardrails detection can take longer than the default 30s - "haproxy.router.openshift.io/timeout": "5m", - }, + Annotations: llmRouteAnnotations(), @@ - Annotations: map[string]string{ - // Fix for RHOAIENG-33054: Set HAProxy timeout to 5 minutes - // Built-in detector route handles LLM traffic that can exceed default 30s timeout - "haproxy.router.openshift.io/timeout": "5m", - }, + Annotations: llmRouteAnnotations(),Also applies to: 41-45, 61-65
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@controllers/gorch/route.go` around lines 25 - 29, The repeated annotation literal ("haproxy.router.openshift.io/timeout": "5m") used in the Annotations map is duplicated in three places; extract it into a shared constant (e.g., HAProxyTimeoutAnnotationKey and HAProxyTimeoutAnnotationValue) and replace the inline map entry with a reuse pattern (either a small helper function like buildHAProxyTimeoutAnnotations() that returns the map entry or a utility function addHAProxyTimeoutAnnotation(annotations map[string]string)) and update all occurrences (the Annotations map literals around the route creation) to use the constant/helper to avoid drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@controllers/gorch/route.go`:
- Around line 25-29: The repeated annotation literal
("haproxy.router.openshift.io/timeout": "5m") used in the Annotations map is
duplicated in three places; extract it into a shared constant (e.g.,
HAProxyTimeoutAnnotationKey and HAProxyTimeoutAnnotationValue) and replace the
inline map entry with a reuse pattern (either a small helper function like
buildHAProxyTimeoutAnnotations() that returns the map entry or a utility
function addHAProxyTimeoutAnnotation(annotations map[string]string)) and update
all occurrences (the Annotations map literals around the route creation) to use
the constant/helper to avoid drift.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 338453e1-ebd2-4ffb-a0b2-9e2491d0e330
📒 Files selected for processing (3)
controllers/gorch/route.gocontrollers/gorch/templates/route.tmpl.yamlcontrollers/utils/route.go
|
@sheltoncyril: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
The Guardrails Orchestrator routes were returning 504 Gateway Timeout errors after 30 seconds when LLM-based guardrails detection took longer to process. This occurred because OpenShift's HAProxy router enforces a default 30-second timeout on routes without explicit annotations.
This fix adds a 5-minute timeout annotation to three LLM-facing routes:
Changes:
The 5-minute timeout allows slow LLM inference to complete while preventing truly hung connections from lingering indefinitely. This is a backward-compatible change that takes effect when the operator reconciles existing GuardrailsOrchestrator resources.
Fixes RHOAIENG-33054
Summary by CodeRabbit