fix: image tag, branch name, monitoring docs#31
fix: image tag, branch name, monitoring docs#31zdtsw wants to merge 1 commit intoopendatahub-io:mainfrom
Conversation
- correct tag on kserve image - correct branch on git repo - add pre-req for monitoring - mention monitoring in README - add new make target for deploy monitorin(helm prom) Signed-off-by: Wen Zhou <wenzhou@redhat.com>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: zdtsw The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughThis pull request aligns default version references across the project from rhoai-3.4 to rhoai-3.4-ea.1 in workflow inputs, scripts, and documentation links. It introduces monitoring capabilities via two new Makefile targets ( Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Security & Code Quality Observations
|
| Check name | Status | Explanation |
|---|---|---|
| Description Check | ✅ Passed | Check skipped - CodeRabbit’s high-level summary is enabled. |
| Title check | ✅ Passed | The title accurately summarizes the main changes: updating image tags, branch names, and adding monitoring documentation across multiple files. |
| Docstring Coverage | ✅ Passed | No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check. |
✏️ Tip: You can configure your own custom pre-merge checks in the settings.
Comment @coderabbitai help to get the list of available commands and usage tips.
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
monitoring-stack/README.md (1)
62-69:⚠️ Potential issue | 🟡 MinorMake verification instructions platform-aware.
Line 67 assumes a self-hosted Prometheus service name and can fail for managed Prometheus setups.
Suggested doc fix
-# Check Prometheus is scraping metrics -kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090 -# Open http://localhost:9090 and query: vllm_num_requests_running +# Self-hosted kube-prometheus-stack: +kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090 +# Open http://localhost:9090 and query: vllm_num_requests_running + +# Managed Prometheus (e.g., Azure): +# Validate via provider metrics explorer / managed Grafana.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@monitoring-stack/README.md` around lines 62 - 69, Update the verification instructions to be platform-aware by replacing the hard-coded svc/prometheus-operated reference with guidance to first discover the Prometheus endpoint (e.g., run kubectl get svc -n monitoring or check your cloud/managed Prometheus console) and then either port-forward to the discovered service (kubectl port-forward -n monitoring svc/<prometheus-service> 9090:9090) or open the managed Prometheus URL in your cloud provider UI to query vllm_num_requests_running; mention both self-hosted and managed workflows so the doc covers both cases.
🧹 Nitpick comments (1)
Makefile (1)
188-204: Wait for controller rollout after toggling monitoring.These targets set env vars but do not wait for the deployment rollout, so success can be reported before the new config is active.
Suggested fix
enable-monitoring: check-kubeconfig `@echo` "=== Enabling monitoring in KServe ===" @@ kubectl set env deployment/kserve-controller-manager -n $(KSERVE_NAMESPACE) LLMISVC_MONITORING_DISABLED=false + kubectl rollout status deployment/kserve-controller-manager -n $(KSERVE_NAMESPACE) --timeout=120s @@ disable-monitoring: check-kubeconfig @@ kubectl set env deployment/kserve-controller-manager -n $(KSERVE_NAMESPACE) LLMISVC_MONITORING_DISABLED=true + kubectl rollout status deployment/kserve-controller-manager -n $(KSERVE_NAMESPACE) --timeout=120s `@echo` "Monitoring disabled."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Makefile` around lines 188 - 204, The enable-monitoring and disable-monitoring Makefile targets currently set the LLMISVC_MONITORING_DISABLED env on the kserve-controller-manager deployment but do not wait for the controller rollout; after kubectl set env in the enable-monitoring and disable-monitoring targets, run kubectl rollout status deployment/kserve-controller-manager -n $(KSERVE_NAMESPACE) (or an equivalent wait) to block until the new replica set is ready so the Makefile only reports success once the controller has finished rolling out; keep the existing pre-check (kubectl get deployment kserve-controller-manager) and use the same deployment name and env var (LLMISVC_MONITORING_DISABLED) to locate the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@charts/kserve/README.md`:
- Line 116: The docs show two different defaults for the CLI flag `--tag` and
the workflow variable `image_tag` (`rhoai-3.4-ea.1` vs `3.4.0-ea.1`), which will
confuse automation; pick one canonical default and update both places (the
README entry for `--tag TAG` and the workflow documentation referencing
`image_tag`) to use the same exact string, or explicitly note why they differ if
intentional, ensuring the workflow passes the exact canonical `image_tag` value
to `--tag`.
In `@Makefile`:
- Around line 1-4: The Makefile's .PHONY declaration is missing the required
"all" and "clean" targets, triggering minphony warnings from checkmake; update
the .PHONY lines (the existing .PHONY entries at the top of the Makefile) to
include all and clean (e.g., add "all" and "clean" to one of the .PHONY lists)
so the policy-required phony targets are declared and the warnings are resolved.
In `@monitoring-stack/aks/README.md`:
- Around line 73-84: Split the "Verify Monitoring Works" section into two
subsections (Self-hosted Prometheus and Azure Managed Prometheus) so commands
aren’t broken for Option 2 users: keep the PodMonitor check (kubectl get
podmonitors -n <llmisvc-namespace>) as common, add a Self-hosted subsection that
uses port-forward to svc/prometheus-operated and instructs opening /targets, and
add an Azure Managed Prometheus subsection that explains how to verify scraping
via the Azure portal or the managed Prometheus query UI instead of using
svc/prometheus-operated; update headings and the surrounding text in the
README.md "Verify Monitoring Works" section accordingly.
In `@monitoring-stack/cks/README.md`:
- Around line 73-81: The README exposes the Grafana admin password via terminal
output and suggests using the default admin user; update the guidance to avoid
printing secrets and require immediate rotation: replace the raw kubectl get
secret snippet with instructions to retrieve the initial password securely
(avoid sharing terminal output), log in once to rotate the password immediately,
and document preferring a pre-created Kubernetes Secret by referencing the helm
value grafana.admin.existingSecret (and the prometheus-grafana secret name) so
operators provision credentials via an existing Secret for non-dev environments.
---
Outside diff comments:
In `@monitoring-stack/README.md`:
- Around line 62-69: Update the verification instructions to be platform-aware
by replacing the hard-coded svc/prometheus-operated reference with guidance to
first discover the Prometheus endpoint (e.g., run kubectl get svc -n monitoring
or check your cloud/managed Prometheus console) and then either port-forward to
the discovered service (kubectl port-forward -n monitoring
svc/<prometheus-service> 9090:9090) or open the managed Prometheus URL in your
cloud provider UI to query vllm_num_requests_running; mention both self-hosted
and managed workflows so the doc covers both cases.
---
Nitpick comments:
In `@Makefile`:
- Around line 188-204: The enable-monitoring and disable-monitoring Makefile
targets currently set the LLMISVC_MONITORING_DISABLED env on the
kserve-controller-manager deployment but do not wait for the controller rollout;
after kubectl set env in the enable-monitoring and disable-monitoring targets,
run kubectl rollout status deployment/kserve-controller-manager -n
$(KSERVE_NAMESPACE) (or an equivalent wait) to block until the new replica set
is ready so the Makefile only reports success once the controller has finished
rolling out; keep the existing pre-check (kubectl get deployment
kserve-controller-manager) and use the same deployment name and env var
(LLMISVC_MONITORING_DISABLED) to locate the change.
ℹ️ Review info
Configuration used: Repository: opendatahub-io/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (10)
.github/workflows/kserve-update-chart.yamlMakefileREADME.mdcharts/kserve/README.mdcharts/kserve/generate-chart.shdocs/deploying-llm-d-on-managed-kubernetes.mdmonitoring-stack/README.mdmonitoring-stack/aks/README.mdmonitoring-stack/cks/README.mdtest/conformance/verify-llm-d-deployment.sh
| | Option | Default | Repository | Description | | ||
| |-------------------|-----------------|--------------------------|------------------------------------------------| | ||
| | `--overlay PATH` | (required) | - | Path to local KServe Kustomize overlay | | ||
| | `--tag TAG` | `rhoai-3.4-ea.1` | - | RHOAI release version for container images stored in quay.io | |
There was a problem hiding this comment.
Inconsistent --tag default across docs causes ambiguous chart generation behavior.
Line 116 documents --tag default as rhoai-3.4-ea.1, but Lines 150-152 document workflow image_tag default as 3.4.0-ea.1 and pass it directly to --tag. Please normalize to one canonical default (or explicitly document why they differ) to avoid wrong image mapping during automation.
Suggested doc fix
-| `--tag TAG` | `rhoai-3.4-ea.1` | - | RHOAI release version for container images stored in quay.io |
+| `--tag TAG` | `3.4.0-ea.1` | - | RHOAI image tag used for quay.io → registry.redhat.io replacement |Also applies to: 150-152
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@charts/kserve/README.md` at line 116, The docs show two different defaults
for the CLI flag `--tag` and the workflow variable `image_tag` (`rhoai-3.4-ea.1`
vs `3.4.0-ea.1`), which will confuse automation; pick one canonical default and
update both places (the README entry for `--tag TAG` and the workflow
documentation referencing `image_tag`) to use the same exact string, or
explicitly note why they differ if intentional, ensuring the workflow passes the
exact canonical `image_tag` value to `--tag`.
| .PHONY: deploy deploy-all undeploy undeploy-kserve status help check-kubeconfig sync clear-cache | ||
| .PHONY: deploy-cert-manager deploy-istio deploy-lws deploy-kserve deploy-opendatahub-prerequisites deploy-cert-manager-pki | ||
| .PHONY: test conformance | ||
| .PHONY: enable-monitoring disable-monitoring |
There was a problem hiding this comment.
Add all and clean phony targets to satisfy checkmake policy.
Current PHONY list still leaves minphony warnings for required targets.
Suggested fix
-.PHONY: test conformance
+.PHONY: all clean test conformance
.PHONY: enable-monitoring disable-monitoring
+
+all: help
+
+clean: clear-cache🧰 Tools
🪛 checkmake (0.2.2)
[warning] 4-4: Missing required phony target "all"
(minphony)
[warning] 4-4: Missing required phony target "clean"
(minphony)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Makefile` around lines 1 - 4, The Makefile's .PHONY declaration is missing
the required "all" and "clean" targets, triggering minphony warnings from
checkmake; update the .PHONY lines (the existing .PHONY entries at the top of
the Makefile) to include all and clean (e.g., add "all" and "clean" to one of
the .PHONY lists) so the policy-required phony targets are declared and the
warnings are resolved.
| ## 3. Verify Monitoring Works | ||
|
|
||
| After deploying an LLMInferenceService: | ||
|
|
||
| ```bash | ||
| # Check PodMonitors created by KServe | ||
| # Check PodMonitors were created (replace with your namespace, e.g., llm-inference) | ||
| kubectl get podmonitors -n <llmisvc-namespace> | ||
|
|
||
| # Check targets in Prometheus | ||
| # Check Prometheus is scraping targets | ||
| kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090 | ||
| # Open http://localhost:9090/targets | ||
| # Open http://localhost:9090/targets and look for vLLM endpoints | ||
| ``` |
There was a problem hiding this comment.
Split verification steps by Prometheus option to avoid broken commands.
Line 82 is self-hosted-specific (svc/prometheus-operated) and will not work for Azure Managed Prometheus users following Option 2.
Suggested doc fix
-# Check Prometheus is scraping targets
-kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
-# Open http://localhost:9090/targets and look for vLLM endpoints
+# Self-hosted kube-prometheus-stack:
+kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
+# Open http://localhost:9090/targets and look for vLLM endpoints
+
+# Azure Managed Prometheus:
+# Use Azure Monitor workspace -> Metrics Explorer (or Azure Managed Grafana)
+# to verify vLLM metrics are being scraped.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@monitoring-stack/aks/README.md` around lines 73 - 84, Split the "Verify
Monitoring Works" section into two subsections (Self-hosted Prometheus and Azure
Managed Prometheus) so commands aren’t broken for Option 2 users: keep the
PodMonitor check (kubectl get podmonitors -n <llmisvc-namespace>) as common, add
a Self-hosted subsection that uses port-forward to svc/prometheus-operated and
instructs opening /targets, and add an Azure Managed Prometheus subsection that
explains how to verify scraping via the Azure portal or the managed Prometheus
query UI instead of using svc/prometheus-operated; update headings and the
surrounding text in the README.md "Verify Monitoring Works" section accordingly.
| **Grafana:** | ||
| ```bash | ||
| # Port forward | ||
| kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80 | ||
|
|
||
| # Get password | ||
| kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 -d | ||
| ``` | ||
|
|
||
| Open http://localhost:3000 (user: admin) |
There was a problem hiding this comment.
Harden Grafana credential guidance to avoid secret exposure (CWE-200, CWE-521).
Lines 79-81 encourage printing the admin password and logging in with default admin user, which can leak secrets via terminal history/screen capture and leaves weak default-operational posture. Recommend documenting immediate password rotation and preferred use of a pre-created secret (grafana.admin.existingSecret) instead of exposing credentials in plaintext output.
Suggested doc hardening
-# Get password
-kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 -d
-```
-Open http://localhost:3000 (user: admin)
+# Get initial password (avoid sharing terminal output; rotate immediately after first login)
+kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 -d
+```
+Open http://localhost:3000 (user: admin), then immediately rotate credentials.
+
+For non-dev environments, prefer provisioning Grafana credentials via a pre-created Kubernetes Secret
+and configure kube-prometheus-stack to use that existing secret.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@monitoring-stack/cks/README.md` around lines 73 - 81, The README exposes the
Grafana admin password via terminal output and suggests using the default admin
user; update the guidance to avoid printing secrets and require immediate
rotation: replace the raw kubectl get secret snippet with instructions to
retrieve the initial password securely (avoid sharing terminal output), log in
once to rotate the password immediately, and document preferring a pre-created
Kubernetes Secret by referencing the helm value grafana.admin.existingSecret
(and the prometheus-grafana secret name) so operators provision credentials via
an existing Secret for non-dev environments.
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
close in favor of https://github.com/opendatahub-io/rhaii-on-xks/pull/40/changes |
Description
How Has This Been Tested?
Merge criteria:
Summary by CodeRabbit
New Features
Documentation
Chores