Skip to content

fix: use correct GPU node selector labels and improve prod probe timeouts#112

Open
anfredette wants to merge 3 commits intoredhat-et:mainfrom
anfredette:yaml-gen-test
Open

fix: use correct GPU node selector labels and improve prod probe timeouts#112
anfredette wants to merge 3 commits intoredhat-et:mainfrom
anfredette:yaml-gen-test

Conversation

@anfredette
Copy link
Member

@anfredette anfredette commented Mar 10, 2026

The generated YAML used short GPU names (e.g., "L4") for the nvidia.com/gpu.product node selector, but K8s nodes use labels like "NVIDIA-L4". Add node_selector_label field to GPU catalog and use it in YAML generation. Also increase production liveness probe delay from 120s to 600s to allow time for model loading, and update vLLM image to latest.

Summary by CodeRabbit

  • Chores

    • Updated vLLM to the latest release.
  • Configuration

    • Added explicit GPU node selector labels to all GPU catalog entries for clearer resource targeting.
    • Extended production health checks: liveness initial delay increased to 600s with higher failure threshold; readiness initial delay increased to 120s.
  • Bug Fixes

    • Deployments now surface clear errors for unknown or unsupported GPU types and validate GPU pricing/configuration.
  • Style

    • Minor UI string and layout formatting improvements for clearer messages.

@coderabbitai
Copy link

coderabbitai bot commented Mar 10, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 86d721ef-d3ae-45df-b91b-3fca362f6c84

📥 Commits

Reviewing files that changed from the base of the PR and between 80e0b5d and a27b1d2.

📒 Files selected for processing (4)
  • ui/app.py
  • ui/components/deployment.py
  • ui/components/deployment_management.py
  • ui/components/settings.py

📝 Walkthrough

Walkthrough

Adds a new node_selector_label to GPU entries in the catalog and model layer, surfaces it in the deployment generator context, enforces strict GPU pricing lookup (raises on unknown GPU), updates VLLM version to "latest", and adjusts production liveness/readiness probe timings and nodeSelector key in the KServe template.

Changes

Cohort / File(s) Summary
GPU Catalog Data
data/configuration/model_catalog.json
Added node_selector_label to each gpu_types entry; values mirror GPU alias names for L4, A10G, A100-40, A100-80, H100, H200, B200, MI300X.
Model / Data Model
src/neuralnav/knowledge_base/model_catalog.py
Added node_selector_label attribute on GPUType, defaulting to the primary alias when absent; included in to_dict() output.
Deployment Generator
src/neuralnav/configuration/generator.py
Bumped VLLM_VERSION v0.6.2→latest; GPU lookup now errors on unknown types and derives gpu_hourly_rate and gpu_node_selector_label from ModelCatalog; gpu_node_selector_label exposed in template context.
KServe Inference Template
src/neuralnav/configuration/templates/kserve-inferenceservice.yaml.j2
Production probe timing changes: liveness initialDelaySeconds 120→600 and added failureThreshold: 5; readiness initialDelaySeconds 60→120. In production path, nodeSelector key changed from gpu_type to gpu_node_selector_label.
UI — layout & strings
ui/app.py, ui/components/deployment.py, ui/components/deployment_management.py, ui/components/settings.py
Cosmetic/formatting changes: multiline splitting of tab labels, reformatted multi-line strings for messages and commands, removed one imported symbol from public imports; no functional behavior changes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the two main changes: adding correct GPU node selector labels and improving production probe timeouts.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/neuralnav/knowledge_base/model_catalog.py (1)

56-57: Prefer the first alias as the fallback selector label.

If a fixture or external catalog entry omits node_selector_label, this falls back to short names like L4 and reintroduces the scheduling bug this PR is fixing. self.aliases[0] is the safer default here.

Suggested change
         self.gpu_type = data["gpu_type"]
         self.aliases = data.get("aliases", [data["gpu_type"]])  # Default to primary name
-        self.node_selector_label = data.get("node_selector_label", data["gpu_type"])
+        self.node_selector_label = data.get("node_selector_label", self.aliases[0])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/neuralnav/knowledge_base/model_catalog.py` around lines 56 - 57, The
node_selector_label fallback uses data["gpu_type"] which can be short names and
reintroduce the scheduling bug; change the fallback to use the first alias
instead. Update the initialization in model_catalog.ModelEntry (or wherever
self.aliases and self.node_selector_label are set) so self.aliases =
data.get("aliases", [data["gpu_type"]]) remains, but set
self.node_selector_label = data.get("node_selector_label", self.aliases[0]) to
prefer the first alias when node_selector_label is omitted. Ensure this uses the
already-computed self.aliases value so the safer alias default is applied.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/neuralnav/configuration/generator.py`:
- Around line 123-126: The GPU catalog lookup currently silently defaults when
missing; change the behavior in the generator where
self._catalog.get_gpu_type(gpu_config.gpu_type) is called so that if gpu_info is
None you raise a clear exception (e.g., ValueError or RuntimeError) including
the missing gpu_type string instead of assigning defaults to gpu_hourly_rate or
gpu_node_selector_label; update the code that sets gpu_hourly_rate and
gpu_node_selector_label (the gpu_info, gpu_hourly_rate, gpu_node_selector_label
logic) to fail fast with a descriptive error message so downstream YAML is not
emitted for unknown GPUs.

In `@src/neuralnav/configuration/templates/kserve-inferenceservice.yaml.j2`:
- Around line 98-99: The template hardcodes NVIDIA keys (nvidia.com/gpu and
nvidia.com/gpu.product) which prevents AMD GPUs; add a vendor identifier to the
template context (e.g., vendor_prefix) and replace the three hardcoded
occurrences with dynamic keys: use nodeSelector key "{{ vendor_prefix
}}.com/gpu.product" instead of nvidia.com/gpu.product, replace resource
request/limit keys "nvidia.com/gpu" with "{{ vendor_prefix }}.com/gpu", and
replace the toleration key "nvidia.com/gpu" with "{{ vendor_prefix }}.com/gpu";
update the code that renders this Jinja template to pass vendor_prefix (derived
from the catalog GPU type) alongside gpu_node_selector_label so all GPU
vendor-specific keys are consistent.

---

Nitpick comments:
In `@src/neuralnav/knowledge_base/model_catalog.py`:
- Around line 56-57: The node_selector_label fallback uses data["gpu_type"]
which can be short names and reintroduce the scheduling bug; change the fallback
to use the first alias instead. Update the initialization in
model_catalog.ModelEntry (or wherever self.aliases and self.node_selector_label
are set) so self.aliases = data.get("aliases", [data["gpu_type"]]) remains, but
set self.node_selector_label = data.get("node_selector_label", self.aliases[0])
to prefer the first alias when node_selector_label is omitted. Ensure this uses
the already-computed self.aliases value so the safer alias default is applied.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2eb6d700-e5cc-4655-8681-a09881c3b7af

📥 Commits

Reviewing files that changed from the base of the PR and between 2eb667f and 379ad55.

📒 Files selected for processing (4)
  • data/configuration/model_catalog.json
  • src/neuralnav/configuration/generator.py
  • src/neuralnav/configuration/templates/kserve-inferenceservice.yaml.j2
  • src/neuralnav/knowledge_base/model_catalog.py

Comment on lines 98 to +99
nodeSelector:
nvidia.com/gpu.product: {{ gpu_type }}
nvidia.com/gpu.product: {{ gpu_node_selector_label }}
Copy link

@coderabbitai coderabbitai bot Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n== GPU catalog entries with selector labels ==\n'
rg -n '"gpu_type":|"node_selector_label":' data/configuration/model_catalog.json

printf '\n== Hardcoded accelerator selector/resource keys ==\n'
rg -n 'nvidia\.com/gpu(\.product)?' src/neuralnav/configuration src/neuralnav/knowledge_base

Repository: redhat-et/neuralnav

Length of output: 1580


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n== Full kserve-inferenceservice.yaml.j2 template ==\n'
cat -n src/neuralnav/configuration/templates/kserve-inferenceservice.yaml.j2 | head -120

printf '\n== Context around lines 98-99 (node selector) ==\n'
sed -n '90,110p' src/neuralnav/configuration/templates/kserve-inferenceservice.yaml.j2 | cat -n

printf '\n== Check template variables passed to this template ==\n'
rg -B5 -A5 'kserve-inferenceservice.yaml.j2' src/neuralnav/configuration

Repository: redhat-et/neuralnav

Length of output: 6173


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n== Template context preparation function ==\n'
sed -n '/_prepare_template_context/,/^[[:space:]]*def /p' src/neuralnav/configuration/generator.py | head -80

printf '\n== Check for any gpu_selector_key usage ==\n'
rg -n 'gpu_selector_key' src/neuralnav/configuration

Repository: redhat-et/neuralnav

Length of output: 3349


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n== Full context dict in _prepare_template_context ==\n'
sed -n '/_prepare_template_context/,/^[[:space:]]*def /p' src/neuralnav/configuration/generator.py | tail -50

printf '\n== Check ModelCatalog GPU info structure ==\n'
rg -B10 -A10 'class.*GPU.*Info' src/neuralnav/configuration

Repository: redhat-et/neuralnav

Length of output: 2075


🏁 Script executed:

#!/bin/bash
set -eulo pipefail

printf '\n== Find ModelCatalog and GPUInfo definitions ==\n'
fd -e py | xargs grep -l 'class.*ModelCatalog\|class.*GPUInfo' | head -5

printf '\n== Check model_catalog.json structure (GPU entries) ==\n'
sed -n '1275,1385p' data/configuration/model_catalog.json | head -40

printf '\n== Full context dict (end of function) ==\n'
sed -n '/_prepare_template_context/,/return context/p' src/neuralnav/configuration/generator.py | tail -30

Repository: redhat-et/neuralnav

Length of output: 2768


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n== Complete _prepare_template_context function ==\n'
python3 << 'PYEOF'
with open('src/neuralnav/configuration/generator.py', 'r') as f:
    content = f.read()
    start = content.find('def _prepare_template_context(')
    if start == -1:
        print("Function not found")
    else:
        # Find the return statement
        remaining = content[start:]
        # Look for "return context" at the beginning of a line
        import re
        match = re.search(r'\n\s*return context\n', remaining)
        if match:
            end = start + match.end()
            func = content[start:end]
            print(func)
        else:
            print("Return statement not found")
PYEOF

printf '\n== Check if gpu_selector_key is used anywhere ==\n'
rg -n 'gpu_selector_key' src/

Repository: redhat-et/neuralnav

Length of output: 5645


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n== Check model_catalog.py for GPU info structure ==\n'
sed -n '/class.*GPU/,/^class /p' src/neuralnav/knowledge_base/model_catalog.py | head -40

printf '\n== Check if AMD entries need different resource keys ==\n'
rg -i 'amd\.com|AMD.*gpu' src/neuralnav data/ || echo "No AMD resource keys found"

Repository: redhat-et/neuralnav

Length of output: 1903


Vendor-specific selector and resource keys prevent AMD GPU deployments.

The node selector fix addresses the key name on line 99, but the template has three vendor-specific hardcodes that block AMD catalog entries like MI300X:

  • Line 99: nvidia.com/gpu.product (node selector key)
  • Lines 64, 68: nvidia.com/gpu (resource requests/limits)
  • Line 101: nvidia.com/gpu (toleration key)

AMD GPUs require different key names. Extract the vendor prefix from the catalog GPU type and use it consistently across the template. This requires adding a vendor identifier to the template context (not just the label value).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/neuralnav/configuration/templates/kserve-inferenceservice.yaml.j2` around
lines 98 - 99, The template hardcodes NVIDIA keys (nvidia.com/gpu and
nvidia.com/gpu.product) which prevents AMD GPUs; add a vendor identifier to the
template context (e.g., vendor_prefix) and replace the three hardcoded
occurrences with dynamic keys: use nodeSelector key "{{ vendor_prefix
}}.com/gpu.product" instead of nvidia.com/gpu.product, replace resource
request/limit keys "nvidia.com/gpu" with "{{ vendor_prefix }}.com/gpu", and
replace the toleration key "nvidia.com/gpu" with "{{ vendor_prefix }}.com/gpu";
update the code that renders this Jinja template to pass vendor_prefix (derived
from the catalog GPU type) alongside gpu_node_selector_label so all GPU
vendor-specific keys are consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping this for now. The NVIDIA and AMD GPU operators use fundamentally different resource keys and conventions — it's not as simple as swapping a vendor prefix string. For example, AMD uses amd.com/gpu with different toleration and node selector patterns. Adding a vendor_prefix variable would give a false sense of AMD support without actually working end-to-end. We'll address this properly when AMD deployments are needed.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anfredette, understood — that's a fair point. A superficial vendor_prefix substitution would be misleading since AMD's device plugin uses different toleration patterns, node selector keys, and resource naming conventions than the NVIDIA GPU operator. Deferring until AMD deployments are concretely needed is the right call.


📝 Learnings were identified and not saved (knowledge base disabled). Enable

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/neuralnav/configuration/generator.py`:
- Around line 125-129: Replace the bare ValueError raised when gpu_info is None
with a domain-specific typed exception so the API can map it to a 4xx error; for
example, define or reuse an exception like InvalidGenerationInputError or
InvalidGPUTypeError (subclassing a common GenerationError or BadRequestError)
and raise that instead of ValueError in the block that checks gpu_info
(referencing gpu_config.gpu_type and the gpu_info lookup in generator.py);
ensure the new exception is imported/defined in the module so callers or global
handlers can translate it to HTTP 400/422.
- Line 24: The VLLM_VERSION constant is currently set to "latest" which makes
manifests non-reproducible; change the VLLM_VERSION value in generator.py to a
specific release tag or image digest (e.g., "vX.Y.Z" or "sha256:<digest>")
instead of "latest", and update any code that composes the image string to use
this pinned value so generated YAML references the fixed tag/digest; ensure the
chosen tag/digest corresponds to a tested vLLM release before committing.

In `@src/neuralnav/knowledge_base/model_catalog.py`:
- Around line 54-58: The constructor currently silently defaults
node_selector_label from aliases which can reintroduce broken scheduling; update
the __init__ in the class that parses the GPU entry (the __init__ that sets
gpu_type, aliases, node_selector_label, memory_gb) to require
data["node_selector_label"] and raise a clear exception (e.g., ValueError) if
it's missing instead of using data.get(..., self.aliases[0]); do not fallback to
aliases and ensure the error message names the missing node_selector_label and
the gpu_type (or data) to aid debugging.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3871ce84-3717-47a7-9e0e-2584478a73d8

📥 Commits

Reviewing files that changed from the base of the PR and between 379ad55 and 1f21b47.

📒 Files selected for processing (4)
  • data/configuration/model_catalog.json
  • src/neuralnav/configuration/generator.py
  • src/neuralnav/configuration/templates/kserve-inferenceservice.yaml.j2
  • src/neuralnav/knowledge_base/model_catalog.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/neuralnav/configuration/templates/kserve-inferenceservice.yaml.j2


# vLLM version to use
VLLM_VERSION = "v0.6.2"
VLLM_VERSION = "latest"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

What is the current recommended way to pin vLLM container images for production deployments: a specific release tag, an immutable digest, or latest? Please check the official vLLM image/release documentation.

💡 Result:

For production, pin the vLLM image to a specific released version tag (e.g., vllm/vllm-openai:vX.Y.Z) and, if you need strict immutability/reproducibility, pin that tag’s digest (e.g., vllm/vllm-openai:vX.Y.Z@sha256:...). Avoid relying on latest.

  • The official Docker docs use :latest in run examples, but when showing how to extend the image they use a versioned tag and explicitly warn to match the vLLM version to the base image (implying version pinning for correctness/reproducibility). [1]
  • vLLM maintainers note that Docker Hub images/tags can be delayed/missing, and that the Docker Hub latest tag is not always promptly updated to the newest GitHub release—creating reproducibility issues. [2]

Sources: official “Using Docker” docs [1]; vLLM forum maintainer response on Docker Hub tagging/latest lag [2].

[1] https://docs.vllm.ai/en/stable/deployment/docker/
[2] https://discuss.vllm.ai/t/vllm-openai-dockerhub-missing-0-16-tags/2399


Pin the vLLM image to a specific release version tag instead of latest.

Using latest makes generated manifests non-reproducible and creates supply-chain risk. The Docker Hub latest tag is not always promptly updated to the newest vLLM release, and pods can pull different images on restart or scale-out without code changes. For production YAML, pin to a tested release tag (e.g., vX.Y.Z) or, for strict immutability, include the image digest.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/neuralnav/configuration/generator.py` at line 24, The VLLM_VERSION
constant is currently set to "latest" which makes manifests non-reproducible;
change the VLLM_VERSION value in generator.py to a specific release tag or image
digest (e.g., "vX.Y.Z" or "sha256:<digest>") instead of "latest", and update any
code that composes the image string to use this pinned value so generated YAML
references the fixed tag/digest; ensure the chosen tag/digest corresponds to a
tested vLLM release before committing.

Comment on lines +125 to +129
if gpu_info is None:
raise ValueError(
f"Unknown GPU type '{gpu_config.gpu_type}'. "
f"Add it to the GPU catalog in data/configuration/model_catalog.json."
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use a typed generation error that the API can map to 4xx.

Fail-fast is the right behavior, but raising a bare ValueError here now bubbles into the configuration routes’ generic Exception handlers and returns HTTP 500 for what is effectively invalid input/catalog drift. Please raise a domain-specific exception here, or ensure callers translate this path to 400/422 instead of 500.

🧰 Tools
🪛 Ruff (0.15.5)

[warning] 126-129: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/neuralnav/configuration/generator.py` around lines 125 - 129, Replace the
bare ValueError raised when gpu_info is None with a domain-specific typed
exception so the API can map it to a 4xx error; for example, define or reuse an
exception like InvalidGenerationInputError or InvalidGPUTypeError (subclassing a
common GenerationError or BadRequestError) and raise that instead of ValueError
in the block that checks gpu_info (referencing gpu_config.gpu_type and the
gpu_info lookup in generator.py); ensure the new exception is imported/defined
in the module so callers or global handlers can translate it to HTTP 400/422.

Comment on lines 54 to 58
def __init__(self, data: dict):
self.gpu_type = data["gpu_type"]
self.aliases = data.get("aliases", [data["gpu_type"]]) # Default to primary name
self.node_selector_label = data.get("node_selector_label", self.aliases[0])
self.memory_gb = data["memory_gb"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Require node_selector_label instead of silently defaulting it.

This fallback can reintroduce the exact broken scheduling path this PR is fixing: if a future GPU entry omits node_selector_label, YAML generation will quietly fall back to an alias and may produce an unschedulable nvidia.com/gpu.product selector. Please fail fast when the field is missing.

Suggested change
     def __init__(self, data: dict):
         self.gpu_type = data["gpu_type"]
         self.aliases = data.get("aliases", [data["gpu_type"]])  # Default to primary name
-        self.node_selector_label = data.get("node_selector_label", self.aliases[0])
+        try:
+            self.node_selector_label = data["node_selector_label"]
+        except KeyError as e:
+            raise ValueError(
+                f"GPU type '{self.gpu_type}' is missing required node_selector_label"
+            ) from e
         self.memory_gb = data["memory_gb"]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def __init__(self, data: dict):
self.gpu_type = data["gpu_type"]
self.aliases = data.get("aliases", [data["gpu_type"]]) # Default to primary name
self.node_selector_label = data.get("node_selector_label", self.aliases[0])
self.memory_gb = data["memory_gb"]
def __init__(self, data: dict):
self.gpu_type = data["gpu_type"]
self.aliases = data.get("aliases", [data["gpu_type"]]) # Default to primary name
try:
self.node_selector_label = data["node_selector_label"]
except KeyError as e:
raise ValueError(
f"GPU type '{self.gpu_type}' is missing required node_selector_label"
) from e
self.memory_gb = data["memory_gb"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/neuralnav/knowledge_base/model_catalog.py` around lines 54 - 58, The
constructor currently silently defaults node_selector_label from aliases which
can reintroduce broken scheduling; update the __init__ in the class that parses
the GPU entry (the __init__ that sets gpu_type, aliases, node_selector_label,
memory_gb) to require data["node_selector_label"] and raise a clear exception
(e.g., ValueError) if it's missing instead of using data.get(...,
self.aliases[0]); do not fallback to aliases and ensure the error message names
the missing node_selector_label and the gpu_type (or data) to aid debugging.

@anfredette
Copy link
Member Author

Re: nitpick on model_catalog.py (node_selector_label fallback):

Fixed — updated the fallback to use self.aliases[0] instead of data["gpu_type"], so it picks up the full name (e.g., "NVIDIA-L4") rather than the short name ("L4") if node_selector_label is omitted from the catalog.

…outs

The generated YAML used short GPU names (e.g., "L4") for the
nvidia.com/gpu.product node selector, but K8s nodes use labels like
"NVIDIA-L4". Add node_selector_label field to GPU catalog and use it
in YAML generation. Also increase production liveness probe delay
from 120s to 600s to allow time for model loading, and update vLLM
image to latest.

Signed-off-by: Andre Fredette <afredette@redhat.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/neuralnav/configuration/templates/kserve-inferenceservice.yaml.j2 (1)

81-85: Liveness probe delay increased significantly; verify recovery time is acceptable.

The 600-second initial delay allows large models to load, but after startup, an unhealthy pod still takes up to 150 seconds (5 failures × 30s period) to be restarted. This seems reasonable for LLM workloads where brief transient issues shouldn't trigger restarts.

However, consider whether startupProbe would be more appropriate here—it would allow a long startup window while keeping faster liveness detection after the container becomes ready.

Alternative: Use startupProbe for model loading
      startupProbe:
        httpGet:
          path: /health
          port: 8080
        initialDelaySeconds: 30
        periodSeconds: 30
        failureThreshold: 20  # 30 + 20*30 = 630s max startup time
      livenessProbe:
        httpGet:
          path: /health
          port: 8080
        periodSeconds: 30
        timeoutSeconds: 10
        failureThreshold: 3

This separates startup concerns from runtime health, allowing faster restart of genuinely unhealthy pods post-startup.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/neuralnav/configuration/templates/kserve-inferenceservice.yaml.j2` around
lines 81 - 85, The liveness probe currently uses a very large
initialDelaySeconds (600) which delays restarts; replace that long
initialDelaySeconds on the livenessProbe with a separate startupProbe so model
load time is handled by startup checks while liveness maintains faster failure
detection. Specifically, add a startupProbe block (httpGet path/port matching
existing probe) with initialDelaySeconds ~30, periodSeconds ~30 and a larger
failureThreshold (e.g., 20) to cover long loads, then set the livenessProbe (the
existing periodSeconds/timeoutSeconds/failureThreshold fields) to shorter
failureThreshold (e.g., 3) and remove the 600 initialDelaySeconds from
livenessProbe to restore quicker restarts after readiness; update the YAML
template keys initialDelaySeconds, startupProbe and livenessProbe accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/neuralnav/configuration/templates/kserve-inferenceservice.yaml.j2`:
- Around line 81-85: The liveness probe currently uses a very large
initialDelaySeconds (600) which delays restarts; replace that long
initialDelaySeconds on the livenessProbe with a separate startupProbe so model
load time is handled by startup checks while liveness maintains faster failure
detection. Specifically, add a startupProbe block (httpGet path/port matching
existing probe) with initialDelaySeconds ~30, periodSeconds ~30 and a larger
failureThreshold (e.g., 20) to cover long loads, then set the livenessProbe (the
existing periodSeconds/timeoutSeconds/failureThreshold fields) to shorter
failureThreshold (e.g., 3) and remove the 600 initialDelaySeconds from
livenessProbe to restore quicker restarts after readiness; update the YAML
template keys initialDelaySeconds, startupProbe and livenessProbe accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f408d394-33fa-426a-a45e-5f77429ba528

📥 Commits

Reviewing files that changed from the base of the PR and between 1f21b47 and 80e0b5d.

📒 Files selected for processing (4)
  • data/configuration/model_catalog.json
  • src/neuralnav/configuration/generator.py
  • src/neuralnav/configuration/templates/kserve-inferenceservice.yaml.j2
  • src/neuralnav/knowledge_base/model_catalog.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • data/configuration/model_catalog.json

Signed-off-by: Andre Fredette <afredette@redhat.com>
Signed-off-by: Andre Fredette <afredette@redhat.com>
@@ -257,14 +259,21 @@ def _run_inference_test(deployment_id: str, prompt: str, max_tokens: int, temper
start_time = time.time()

curl_cmd = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use requests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants