Skip to content

fix: image tag, branch name, monitoring docs#31

Closed
zdtsw wants to merge 1 commit intoopendatahub-io:mainfrom
zdtsw-forking:chore_2
Closed

fix: image tag, branch name, monitoring docs#31
zdtsw wants to merge 1 commit intoopendatahub-io:mainfrom
zdtsw-forking:chore_2

Conversation

@zdtsw
Copy link
Copy Markdown
Member

@zdtsw zdtsw commented Mar 3, 2026

Description

How Has This Been Tested?

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • New Features

    • Added monitoring enable/disable commands for KServe deployments.
  • Documentation

    • Expanded monitoring setup guides with platform-specific instructions (Azure Kubernetes Service and standard Kubernetes).
    • Updated deployment guides with enhanced monitoring prerequisites and setup steps.
  • Chores

    • Updated default version references from rhoai-3.4 to rhoai-3.4-ea.1 across workflow configurations and scripts.

- correct tag on kserve image
- correct branch on git repo
- add pre-req for monitoring
- mention monitoring in README
- add new make target for deploy monitorin(helm prom)

Signed-off-by: Wen Zhou <wenzhou@redhat.com>
@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 3, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: zdtsw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Mar 3, 2026
@zdtsw zdtsw requested review from aneeshkp and pierDipi March 3, 2026 14:56
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 3, 2026

📝 Walkthrough

Walkthrough

This pull request aligns default version references across the project from rhoai-3.4 to rhoai-3.4-ea.1 in workflow inputs, scripts, and documentation links. It introduces monitoring capabilities via two new Makefile targets (enable-monitoring and disable-monitoring) that toggle the LLMISVC_MONITORING_DISABLED environment variable on the KServe controller manager. The changes include comprehensive monitoring setup documentation for both AKS and CKS environments, detailing Prometheus installation options, verification steps, and dashboard access. Documentation is updated to clarify image replacement logic, workflow parameters, and monitoring prerequisites across multiple README files.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Security & Code Quality Observations

⚠️ Makefile Monitoring Targets — Best Practice Note

The enable-monitoring and disable-monitoring targets modify the KServe controller deployment via kubectl set env. This is operationally sound, but consider:

  1. RBAC validation: Ensure the deployment service account has permissions to modify environment variables. The targets assume appropriate RBAC context but don't validate it.
  2. Audit trail: Cluster administrators should enable Kubernetes API audit logging to track these environment mutations for compliance and debugging.

ℹ️ Documentation Accuracy

Monitoring setup documentation across monitoring-stack/ directories is extensive and technical. Key review points:

  • AKS/CKS Prometheus options: Verify that CRD references (ServiceMonitor/PodMonitor) and namespace placeholders align with actual KServe and Prometheus deployments.
  • Grafana credentials retrieval: Ensure the port-forward and secret extraction commands match the Helm chart outputs.
  • Namespace assumptions: Several code blocks use <NAMESPACE> and kserve placeholders—confirm these are documented and don't mislead users into hardcoding values.

✅ Configuration Alignment

Default value updates (rhoai-3.4 → rhoai-3.4-ea.1) are consistent across workflow, script, and documentation, reducing risk of version mismatch.

📋 Minor Enhancement Opportunity

The new Makefile targets don't include rollback or status-verification steps beyond prerequisite checks. Consider documenting the manual rollback procedure (reverting the env change) in the README for operational clarity.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: updating image tags, branch names, and adding monitoring documentation across multiple files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
monitoring-stack/README.md (1)

62-69: ⚠️ Potential issue | 🟡 Minor

Make verification instructions platform-aware.

Line 67 assumes a self-hosted Prometheus service name and can fail for managed Prometheus setups.

Suggested doc fix
-# Check Prometheus is scraping metrics
-kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
-# Open http://localhost:9090 and query: vllm_num_requests_running
+# Self-hosted kube-prometheus-stack:
+kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
+# Open http://localhost:9090 and query: vllm_num_requests_running
+
+# Managed Prometheus (e.g., Azure):
+# Validate via provider metrics explorer / managed Grafana.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@monitoring-stack/README.md` around lines 62 - 69, Update the verification
instructions to be platform-aware by replacing the hard-coded
svc/prometheus-operated reference with guidance to first discover the Prometheus
endpoint (e.g., run kubectl get svc -n monitoring or check your cloud/managed
Prometheus console) and then either port-forward to the discovered service
(kubectl port-forward -n monitoring svc/<prometheus-service> 9090:9090) or open
the managed Prometheus URL in your cloud provider UI to query
vllm_num_requests_running; mention both self-hosted and managed workflows so the
doc covers both cases.
🧹 Nitpick comments (1)
Makefile (1)

188-204: Wait for controller rollout after toggling monitoring.

These targets set env vars but do not wait for the deployment rollout, so success can be reported before the new config is active.

Suggested fix
 enable-monitoring: check-kubeconfig
 	`@echo` "=== Enabling monitoring in KServe ==="
@@
 	kubectl set env deployment/kserve-controller-manager -n $(KSERVE_NAMESPACE) LLMISVC_MONITORING_DISABLED=false
+	kubectl rollout status deployment/kserve-controller-manager -n $(KSERVE_NAMESPACE) --timeout=120s
@@
 disable-monitoring: check-kubeconfig
@@
 	kubectl set env deployment/kserve-controller-manager -n $(KSERVE_NAMESPACE) LLMISVC_MONITORING_DISABLED=true
+	kubectl rollout status deployment/kserve-controller-manager -n $(KSERVE_NAMESPACE) --timeout=120s
 	`@echo` "Monitoring disabled."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 188 - 204, The enable-monitoring and
disable-monitoring Makefile targets currently set the
LLMISVC_MONITORING_DISABLED env on the kserve-controller-manager deployment but
do not wait for the controller rollout; after kubectl set env in the
enable-monitoring and disable-monitoring targets, run kubectl rollout status
deployment/kserve-controller-manager -n $(KSERVE_NAMESPACE) (or an equivalent
wait) to block until the new replica set is ready so the Makefile only reports
success once the controller has finished rolling out; keep the existing
pre-check (kubectl get deployment kserve-controller-manager) and use the same
deployment name and env var (LLMISVC_MONITORING_DISABLED) to locate the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@charts/kserve/README.md`:
- Line 116: The docs show two different defaults for the CLI flag `--tag` and
the workflow variable `image_tag` (`rhoai-3.4-ea.1` vs `3.4.0-ea.1`), which will
confuse automation; pick one canonical default and update both places (the
README entry for `--tag TAG` and the workflow documentation referencing
`image_tag`) to use the same exact string, or explicitly note why they differ if
intentional, ensuring the workflow passes the exact canonical `image_tag` value
to `--tag`.

In `@Makefile`:
- Around line 1-4: The Makefile's .PHONY declaration is missing the required
"all" and "clean" targets, triggering minphony warnings from checkmake; update
the .PHONY lines (the existing .PHONY entries at the top of the Makefile) to
include all and clean (e.g., add "all" and "clean" to one of the .PHONY lists)
so the policy-required phony targets are declared and the warnings are resolved.

In `@monitoring-stack/aks/README.md`:
- Around line 73-84: Split the "Verify Monitoring Works" section into two
subsections (Self-hosted Prometheus and Azure Managed Prometheus) so commands
aren’t broken for Option 2 users: keep the PodMonitor check (kubectl get
podmonitors -n <llmisvc-namespace>) as common, add a Self-hosted subsection that
uses port-forward to svc/prometheus-operated and instructs opening /targets, and
add an Azure Managed Prometheus subsection that explains how to verify scraping
via the Azure portal or the managed Prometheus query UI instead of using
svc/prometheus-operated; update headings and the surrounding text in the
README.md "Verify Monitoring Works" section accordingly.

In `@monitoring-stack/cks/README.md`:
- Around line 73-81: The README exposes the Grafana admin password via terminal
output and suggests using the default admin user; update the guidance to avoid
printing secrets and require immediate rotation: replace the raw kubectl get
secret snippet with instructions to retrieve the initial password securely
(avoid sharing terminal output), log in once to rotate the password immediately,
and document preferring a pre-created Kubernetes Secret by referencing the helm
value grafana.admin.existingSecret (and the prometheus-grafana secret name) so
operators provision credentials via an existing Secret for non-dev environments.

---

Outside diff comments:
In `@monitoring-stack/README.md`:
- Around line 62-69: Update the verification instructions to be platform-aware
by replacing the hard-coded svc/prometheus-operated reference with guidance to
first discover the Prometheus endpoint (e.g., run kubectl get svc -n monitoring
or check your cloud/managed Prometheus console) and then either port-forward to
the discovered service (kubectl port-forward -n monitoring
svc/<prometheus-service> 9090:9090) or open the managed Prometheus URL in your
cloud provider UI to query vllm_num_requests_running; mention both self-hosted
and managed workflows so the doc covers both cases.

---

Nitpick comments:
In `@Makefile`:
- Around line 188-204: The enable-monitoring and disable-monitoring Makefile
targets currently set the LLMISVC_MONITORING_DISABLED env on the
kserve-controller-manager deployment but do not wait for the controller rollout;
after kubectl set env in the enable-monitoring and disable-monitoring targets,
run kubectl rollout status deployment/kserve-controller-manager -n
$(KSERVE_NAMESPACE) (or an equivalent wait) to block until the new replica set
is ready so the Makefile only reports success once the controller has finished
rolling out; keep the existing pre-check (kubectl get deployment
kserve-controller-manager) and use the same deployment name and env var
(LLMISVC_MONITORING_DISABLED) to locate the change.

ℹ️ Review info

Configuration used: Repository: opendatahub-io/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between d5e28da and ff92135.

📒 Files selected for processing (10)
  • .github/workflows/kserve-update-chart.yaml
  • Makefile
  • README.md
  • charts/kserve/README.md
  • charts/kserve/generate-chart.sh
  • docs/deploying-llm-d-on-managed-kubernetes.md
  • monitoring-stack/README.md
  • monitoring-stack/aks/README.md
  • monitoring-stack/cks/README.md
  • test/conformance/verify-llm-d-deployment.sh

| Option | Default | Repository | Description |
|-------------------|-----------------|--------------------------|------------------------------------------------|
| `--overlay PATH` | (required) | - | Path to local KServe Kustomize overlay |
| `--tag TAG` | `rhoai-3.4-ea.1` | - | RHOAI release version for container images stored in quay.io |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent --tag default across docs causes ambiguous chart generation behavior.

Line 116 documents --tag default as rhoai-3.4-ea.1, but Lines 150-152 document workflow image_tag default as 3.4.0-ea.1 and pass it directly to --tag. Please normalize to one canonical default (or explicitly document why they differ) to avoid wrong image mapping during automation.

Suggested doc fix
-| `--tag TAG`       | `rhoai-3.4-ea.1`    | -                        | RHOAI release version for container images stored in quay.io    |
+| `--tag TAG`       | `3.4.0-ea.1`        | -                        | RHOAI image tag used for quay.io → registry.redhat.io replacement    |

Also applies to: 150-152

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@charts/kserve/README.md` at line 116, The docs show two different defaults
for the CLI flag `--tag` and the workflow variable `image_tag` (`rhoai-3.4-ea.1`
vs `3.4.0-ea.1`), which will confuse automation; pick one canonical default and
update both places (the README entry for `--tag TAG` and the workflow
documentation referencing `image_tag`) to use the same exact string, or
explicitly note why they differ if intentional, ensuring the workflow passes the
exact canonical `image_tag` value to `--tag`.

Comment on lines 1 to +4
.PHONY: deploy deploy-all undeploy undeploy-kserve status help check-kubeconfig sync clear-cache
.PHONY: deploy-cert-manager deploy-istio deploy-lws deploy-kserve deploy-opendatahub-prerequisites deploy-cert-manager-pki
.PHONY: test conformance
.PHONY: enable-monitoring disable-monitoring
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add all and clean phony targets to satisfy checkmake policy.

Current PHONY list still leaves minphony warnings for required targets.

Suggested fix
-.PHONY: test conformance
+.PHONY: all clean test conformance
 .PHONY: enable-monitoring disable-monitoring
+
+all: help
+
+clean: clear-cache
🧰 Tools
🪛 checkmake (0.2.2)

[warning] 4-4: Missing required phony target "all"

(minphony)


[warning] 4-4: Missing required phony target "clean"

(minphony)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 1 - 4, The Makefile's .PHONY declaration is missing
the required "all" and "clean" targets, triggering minphony warnings from
checkmake; update the .PHONY lines (the existing .PHONY entries at the top of
the Makefile) to include all and clean (e.g., add "all" and "clean" to one of
the .PHONY lists) so the policy-required phony targets are declared and the
warnings are resolved.

Comment on lines +73 to 84
## 3. Verify Monitoring Works

After deploying an LLMInferenceService:

```bash
# Check PodMonitors created by KServe
# Check PodMonitors were created (replace with your namespace, e.g., llm-inference)
kubectl get podmonitors -n <llmisvc-namespace>

# Check targets in Prometheus
# Check Prometheus is scraping targets
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
# Open http://localhost:9090/targets
# Open http://localhost:9090/targets and look for vLLM endpoints
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Split verification steps by Prometheus option to avoid broken commands.

Line 82 is self-hosted-specific (svc/prometheus-operated) and will not work for Azure Managed Prometheus users following Option 2.

Suggested doc fix
-# Check Prometheus is scraping targets
-kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
-# Open http://localhost:9090/targets and look for vLLM endpoints
+# Self-hosted kube-prometheus-stack:
+kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
+# Open http://localhost:9090/targets and look for vLLM endpoints
+
+# Azure Managed Prometheus:
+# Use Azure Monitor workspace -> Metrics Explorer (or Azure Managed Grafana)
+# to verify vLLM metrics are being scraped.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@monitoring-stack/aks/README.md` around lines 73 - 84, Split the "Verify
Monitoring Works" section into two subsections (Self-hosted Prometheus and Azure
Managed Prometheus) so commands aren’t broken for Option 2 users: keep the
PodMonitor check (kubectl get podmonitors -n <llmisvc-namespace>) as common, add
a Self-hosted subsection that uses port-forward to svc/prometheus-operated and
instructs opening /targets, and add an Azure Managed Prometheus subsection that
explains how to verify scraping via the Azure portal or the managed Prometheus
query UI instead of using svc/prometheus-operated; update headings and the
surrounding text in the README.md "Verify Monitoring Works" section accordingly.

Comment on lines +73 to 81
**Grafana:**
```bash
# Port forward
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

# Get password
kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 -d
```

Open http://localhost:3000 (user: admin)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Harden Grafana credential guidance to avoid secret exposure (CWE-200, CWE-521).

Lines 79-81 encourage printing the admin password and logging in with default admin user, which can leak secrets via terminal history/screen capture and leaves weak default-operational posture. Recommend documenting immediate password rotation and preferred use of a pre-created secret (grafana.admin.existingSecret) instead of exposing credentials in plaintext output.

Suggested doc hardening
-# Get password
-kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 -d
-```
-Open http://localhost:3000 (user: admin)
+# Get initial password (avoid sharing terminal output; rotate immediately after first login)
+kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 -d
+```
+Open http://localhost:3000 (user: admin), then immediately rotate credentials.
+
+For non-dev environments, prefer provisioning Grafana credentials via a pre-created Kubernetes Secret
+and configure kube-prometheus-stack to use that existing secret.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@monitoring-stack/cks/README.md` around lines 73 - 81, The README exposes the
Grafana admin password via terminal output and suggests using the default admin
user; update the guidance to avoid printing secrets and require immediate
rotation: replace the raw kubectl get secret snippet with instructions to
retrieve the initial password securely (avoid sharing terminal output), log in
once to rotate the password immediately, and document preferring a pre-created
Kubernetes Secret by referencing the helm value grafana.admin.existingSecret
(and the prometheus-grafana secret name) so operators provision credentials via
an existing Secret for non-dev environments.

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 18, 2026

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@zdtsw
Copy link
Copy Markdown
Member Author

zdtsw commented Mar 19, 2026

@zdtsw zdtsw closed this Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant