docs: MLflow tracing for Claude Code on RHOAI#105
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughNew documentation for MLflow tracing integration with Claude Code agent runtimes on Red Hat OpenShift AI. The guide validates trace capture across three inference backends (Vertex AI, vLLM, OGX→vLLM), defines the expected trace schema, provides backend-specific execution results, and supplies step-by-step setup instructions including MLflow configuration and RBAC changes. ChangesMLflow Tracing Documentation
🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/claude-agent/mlflow-tracing.md`:
- Around line 107-118: The fenced code block containing the trace schema tree
(the block starting with "claude_code_conversation (root)" and the subsequent
tool lines) lacks a language identifier and fails the markdown linter; update
that triple-backtick fence to include the language tag "text" (i.e., change ```
to ```text) so the block is recognized as plain text in mlflow-tracing.md.
- Around line 17-22: The fenced code block in
docs/claude-agent/mlflow-tracing.md containing the log snippet lacks a language
identifier which fails the markdown linter; update that block by adding a
language tag such as text or log after the opening backticks (i.e., change the
``` to ```text or ```log) so the linter accepts the block and the log output
lines (INFO Using native /v1/messages passthrough, base_url=..., model=..., HTTP
200) remain unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 2040aeeb-aeae-425b-92b7-0bec3d4586b3
⛔ Files ignored due to path filters (5)
docs/claude-agent/screenshots/ogx-trace.pngis excluded by!**/*.pngdocs/claude-agent/screenshots/vertex-summary.pngis excluded by!**/*.pngdocs/claude-agent/screenshots/vertex-trace.pngis excluded by!**/*.pngdocs/claude-agent/screenshots/vllm-summary.pngis excluded by!**/*.pngdocs/claude-agent/screenshots/vllm-trace.pngis excluded by!**/*.png
📒 Files selected for processing (1)
docs/claude-agent/mlflow-tracing.md
8fb9834 to
d896503
Compare
cca0d2d to
695fd30
Compare
|
Hey @Nehanth - a new repo-level ruleset is added that now requires Unit Tests and lint checks to pass before merge, plus approval from the agentic-starter-kits-maintainers team. This PR is currently blocked because the Unit Tests check hasn't run on it. A rebase onto main should pick up the updated workflow and trigger the required checks. Please rebase when you get a chance. |
695fd30 to
8fb69c9
Compare
Done! |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
docs/claude-agent/mlflow-tracing.md (1)
204-204: 🏗️ Heavy liftConsider scoping down RBAC permissions.
Granting the
editrole to the default service account provides broad read/write access to most resources in the namespace. For production deployments, consider creating a dedicated service account with minimal permissions required for MLflow integration (e.g., permissions to create/update experiments, runs, and access required storage). The exact permissions depend on thekubernetes-namespacedauth plugin requirements.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/claude-agent/mlflow-tracing.md` at line 204, The current instruction uses "oc adm policy add-role-to-user edit -z default -n <your-namespace>" which grants the broad edit role to the default service account; replace this with guidance to create and bind a dedicated service account with least-privilege RBAC for MLflow (instead of using the default SA). Update the docs to show creating a service account (e.g., "mlflow-sa"), a Role or ClusterRole containing only needed verbs/resources for experiments/runs and storage access, and a RoleBinding that binds that Role to "mlflow-sa"; mention that the exact rules should be derived from the kubernetes-namespaced auth plugin requirements and provide the example placeholders for Role rules and the RoleBinding that users must tailor for production.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/claude-agent/mlflow-tracing.md`:
- Around line 191-199: Update the MLflow version requirement in the Dockerfile
documentation snippet: replace the fork reference or the version constraint that
implies "3.11" with a concrete minimum of 3.11.1 so users get the
kubernetes-namespaced auth plugin; specifically change the pip install target
that currently uses "'mlflow[kubernetes] @
git+https://github.com/red-hat-data-services/mlflow.git@rhoai-3.4'" or any
mention of ">=3.11" to use "mlflow[kubernetes]>=3.11.1" (also update the
explanatory sentence that references RHOAI shipping 3.11 to mention 3.11.1), and
check the later reference around the second mention (line ~273) to ensure it
matches the same >=3.11.1 constraint.
---
Nitpick comments:
In `@docs/claude-agent/mlflow-tracing.md`:
- Line 204: The current instruction uses "oc adm policy add-role-to-user edit -z
default -n <your-namespace>" which grants the broad edit role to the default
service account; replace this with guidance to create and bind a dedicated
service account with least-privilege RBAC for MLflow (instead of using the
default SA). Update the docs to show creating a service account (e.g.,
"mlflow-sa"), a Role or ClusterRole containing only needed verbs/resources for
experiments/runs and storage access, and a RoleBinding that binds that Role to
"mlflow-sa"; mention that the exact rules should be derived from the
kubernetes-namespaced auth plugin requirements and provide the example
placeholders for Role rules and the RoleBinding that users must tailor for
production.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 38228af3-b2be-41d4-a99d-e0e02786305e
⛔ Files ignored due to path filters (6)
docs/claude-agent/screenshots/ogx-summary.pngis excluded by!**/*.pngdocs/claude-agent/screenshots/ogx-trace.pngis excluded by!**/*.pngdocs/claude-agent/screenshots/vertex-summary.pngis excluded by!**/*.pngdocs/claude-agent/screenshots/vertex-trace.pngis excluded by!**/*.pngdocs/claude-agent/screenshots/vllm-summary.pngis excluded by!**/*.pngdocs/claude-agent/screenshots/vllm-trace.pngis excluded by!**/*.png
📒 Files selected for processing (1)
docs/claude-agent/mlflow-tracing.md
aakankshaduggal
left a comment
There was a problem hiding this comment.
Nice work — the doc covers RHAIENG-4751 through 4754 cleanly, and the "same prompt across 3 backends" approach is a great way to prove backend-agnostic tracing. A few things to address:
-
File location vs repo restructure — there's an active discussion on restructuring the repo (see the thread on the restructure proposal). Should this live under
agents/claude-code/instead ofdocs/claude-agent/to align with the new structure? -
ogx-summary.png is missing — PR body notes "to be added." Should be included before merge.
-
MLFLOW_TRACKING_INSECURE_TLS=true— worth adding a note that this is for dev/test setups and production deployments should use proper TLS certificates. -
ANTHROPIC_API_KEY=fakein step 4 — this works but could confuse readers. A brief note explaining why (OGX doesn't validate API keys for self-hosted models) would help. -
Hardcoded
redhat-ods-applicationsnamespace in the MLflow tracking URI — this varies by RHOAI installation, worth calling out.
Reviewed by Claude with @aakankshaduggal's supervision
|
Thanks for the review @aakankshaduggal! All points addressed in the latest push:
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@agents/claude-code/mlflow-tracing.md`:
- Around line 203-205: The doc currently shows granting the broad `edit` role to
the `default` service account via the command `oc adm policy add-role-to-user
edit -z default -n <your-namespace>`; change the guidance to instruct creating a
dedicated service account (e.g., `mlflow-sa`) and a minimal Role and RoleBinding
that only grant MLflow-required verbs/resources (list the specific API
groups/resources/verbs MLflow needs) instead of using `edit`, and replace the
single-line example with instructions to create the service account and bind
only that minimal role to it.
- Around line 247-248: The generated entrypoint currently hardcodes
env["MLFLOW_TRACKING_INSECURE_TLS"] = "true"; change this to read from the
environment with a safe default (e.g., use os.getenv or equivalent to set
MLFLOW_TRACKING_INSECURE_TLS to "true" only if explicitly set, defaulting to
"false") and update the runtime settings write (the code that writes to sf using
s) to reflect that value; also add a brief comment or docstring next to where
env is populated explaining this flag should only be enabled for dev/test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 548ce6d6-04e5-4046-b279-f02834cc8c1d
⛔ Files ignored due to path filters (6)
agents/claude-code/screenshots/ogx-summary.pngis excluded by!**/*.pngagents/claude-code/screenshots/ogx-trace.pngis excluded by!**/*.pngagents/claude-code/screenshots/vertex-summary.pngis excluded by!**/*.pngagents/claude-code/screenshots/vertex-trace.pngis excluded by!**/*.pngagents/claude-code/screenshots/vllm-summary.pngis excluded by!**/*.pngagents/claude-code/screenshots/vllm-trace.pngis excluded by!**/*.png
📒 Files selected for processing (1)
agents/claude-code/mlflow-tracing.md
tarun-etikala
left a comment
There was a problem hiding this comment.
Thanks for the thorough investigation across all four RHAIENG tickets, @Nehanth. The tracing validation across three backends is valuable. A few things to address before merging.
Structure: doesn't match repo conventions
- JIRA tickets as headings: doc uses ticket numbers as section headings. Please restructure around what the reader needs to do, not which ticket produced the finding. The setup guide (current RHAIENG-4754, Steps 1–6) should be the primary content. The investigation findings (4751, 4752/4753) belong in the JIRA ticket descriptions they're useful context but not actionable docs for someone setting up tracing. Recommendations based on findings could be added here
- Voice: Repo docs use second person imperative ("Edit
.env", "Runmake deploy"). This doc uses first person plural throughout ("We deployed", "We ran"). Please rewrite to match. - Redundancy: The same backend comparison tables (Vertex AI, vLLM, OGX — identical trace IDs, tokens, latencies) in both the 4751 and 4752/4753 sections (~50 lines duplicated). Consolidate into a single "Results" section.
sanafayyaz315
left a comment
There was a problem hiding this comment.
Looks good overall. Once the structural changes @tarun-etikala recommended are addressed (restructure around reader actions instead of JIRA tickets, fix voice, consolidate duplicated tables), this should be good to merge.
Documents MLflow autolog integration with Claude Code across Vertex AI, vLLM, and OGX backends. Covers RHAIENG-4751, 4752, 4753, and 4754 — telemetry investigation, tool call tracing prototype, session-level metrics, and RHOAI 3.5 setup guide and recommendation. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
… fix API key placeholder, link to repo file
4b14907 to
aa62a46
Compare
aakankshaduggal
left a comment
There was a problem hiding this comment.
Thanks @Nehanth, lgtm! 🚢
Summary
Adds
agents/claude-code/with documentation covering MLflow tracing for Claude Code on RHOAI (RHAIENG-4751, 4752, 4753, 4754).mlflow autolog claudework across all backends (Vertex AI, vLLM, OGX) with the same trace schema.Screenshots
Includes MLflow trace screenshots for all three backends showing both Inputs/Outputs detail and session waterfall views.