Skip to content

docs: MLflow tracing for Claude Code on RHOAI#105

Merged
Nehanth merged 10 commits into
mainfrom
mlflow-tracing-docs
May 21, 2026
Merged

docs: MLflow tracing for Claude Code on RHOAI#105
Nehanth merged 10 commits into
mainfrom
mlflow-tracing-docs

Conversation

@Nehanth
Copy link
Copy Markdown
Contributor

@Nehanth Nehanth commented May 18, 2026

Summary

Adds agents/claude-code/ with documentation covering MLflow tracing for Claude Code on RHOAI (RHAIENG-4751, 4752, 4753, 4754).

  • RHAIENG-4751 — OGX telemetry investigation. Agent-level OTel spans via mlflow autolog claude work across all backends (Vertex AI, vLLM, OGX) with the same trace schema.
  • RHAIENG-4752 & 4753 — Tool call trace prototype and session-level metrics. Validated with "build me a tetris game" across all three backends.
  • RHAIENG-4754 — Step-by-step setup guide for hooking Claude Code, OGX, and MLflow together on RHOAI 3.4, with recommendation to productize for RHOAI 3.5.

Screenshots

Includes MLflow trace screenshots for all three backends showing both Inputs/Outputs detail and session waterfall views.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 18, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

New documentation for MLflow tracing integration with Claude Code agent runtimes on Red Hat OpenShift AI. The guide validates trace capture across three inference backends (Vertex AI, vLLM, OGX→vLLM), defines the expected trace schema, provides backend-specific execution results, and supplies step-by-step setup instructions including MLflow configuration and RBAC changes.

Changes

MLflow Tracing Documentation

Layer / File(s) Summary
Overview and tracing validation across backends
agents/claude-code/mlflow-tracing.md
Introduction and context for containerized Claude Code tracing on OpenShift AI with end-to-end evidence from running the same prompt across Vertex AI, vLLM direct, and OGX→vLLM backends with captured session traces.
Trace schema and backend-specific results
agents/claude-code/mlflow-tracing.md
Expected trace schema with root conversation span and tool/LLM inference spans; captured per-span and session fields; backend-specific results from a "Tetris game" prompt including token counts, latency, span counts, and trace IDs.
Observability setup and RHOAI configuration
agents/claude-code/mlflow-tracing.md
Prerequisites, Red Hat MLflow fork installation with Kubernetes auth plugin, RBAC configuration, environment variables for MLflow and OGX, entrypoint wiring for mlflow autolog claude, verification steps, and guidance for upgrading to upstream MLflow >=3.11.

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding MLflow tracing documentation for Claude Code on RHOAI, which matches the primary purpose of the changeset.
Description check ✅ Passed The description is directly related to the changeset, providing a clear summary of the documentation added, the issues addressed, and key content including setup guides, screenshots, and validation details.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch mlflow-tracing-docs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/claude-agent/mlflow-tracing.md`:
- Around line 107-118: The fenced code block containing the trace schema tree
(the block starting with "claude_code_conversation  (root)" and the subsequent
tool lines) lacks a language identifier and fails the markdown linter; update
that triple-backtick fence to include the language tag "text" (i.e., change ```
to ```text) so the block is recognized as plain text in mlflow-tracing.md.
- Around line 17-22: The fenced code block in
docs/claude-agent/mlflow-tracing.md containing the log snippet lacks a language
identifier which fails the markdown linter; update that block by adding a
language tag such as text or log after the opening backticks (i.e., change the
``` to ```text or ```log) so the linter accepts the block and the log output
lines (INFO Using native /v1/messages passthrough, base_url=..., model=..., HTTP
200) remain unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 2040aeeb-aeae-425b-92b7-0bec3d4586b3

📥 Commits

Reviewing files that changed from the base of the PR and between 237a0b5 and d4781b9.

⛔ Files ignored due to path filters (5)
  • docs/claude-agent/screenshots/ogx-trace.png is excluded by !**/*.png
  • docs/claude-agent/screenshots/vertex-summary.png is excluded by !**/*.png
  • docs/claude-agent/screenshots/vertex-trace.png is excluded by !**/*.png
  • docs/claude-agent/screenshots/vllm-summary.png is excluded by !**/*.png
  • docs/claude-agent/screenshots/vllm-trace.png is excluded by !**/*.png
📒 Files selected for processing (1)
  • docs/claude-agent/mlflow-tracing.md

Comment thread docs/claude-agent/mlflow-tracing.md Outdated
Comment thread docs/claude-agent/mlflow-tracing.md Outdated
@tarun-etikala
Copy link
Copy Markdown
Contributor

tarun-etikala commented May 18, 2026

Hey @Nehanth - a new repo-level ruleset is added that now requires Unit Tests and lint checks to pass before merge, plus approval from the agentic-starter-kits-maintainers team.

This PR is currently blocked because the Unit Tests check hasn't run on it. A rebase onto main should pick up the updated workflow and trigger the required checks. Please rebase when you get a chance.

@Nehanth Nehanth force-pushed the mlflow-tracing-docs branch from 695fd30 to 8fb69c9 Compare May 18, 2026 19:43
@Nehanth Nehanth requested a review from a team as a code owner May 18, 2026 19:43
@Nehanth
Copy link
Copy Markdown
Contributor Author

Nehanth commented May 18, 2026

Hey @Nehanth - a new repo-level ruleset is added that now requires Unit Tests and lint checks to pass before merge, plus approval from the agentic-starter-kits-maintainers team.

This PR is currently blocked because the Unit Tests check hasn't run on it. A rebase onto main should pick up the updated workflow and trigger the required checks. Please rebase when you get a chance.

Done!

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
docs/claude-agent/mlflow-tracing.md (1)

204-204: 🏗️ Heavy lift

Consider scoping down RBAC permissions.

Granting the edit role to the default service account provides broad read/write access to most resources in the namespace. For production deployments, consider creating a dedicated service account with minimal permissions required for MLflow integration (e.g., permissions to create/update experiments, runs, and access required storage). The exact permissions depend on the kubernetes-namespaced auth plugin requirements.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/claude-agent/mlflow-tracing.md` at line 204, The current instruction
uses "oc adm policy add-role-to-user edit -z default -n <your-namespace>" which
grants the broad edit role to the default service account; replace this with
guidance to create and bind a dedicated service account with least-privilege
RBAC for MLflow (instead of using the default SA). Update the docs to show
creating a service account (e.g., "mlflow-sa"), a Role or ClusterRole containing
only needed verbs/resources for experiments/runs and storage access, and a
RoleBinding that binds that Role to "mlflow-sa"; mention that the exact rules
should be derived from the kubernetes-namespaced auth plugin requirements and
provide the example placeholders for Role rules and the RoleBinding that users
must tailor for production.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/claude-agent/mlflow-tracing.md`:
- Around line 191-199: Update the MLflow version requirement in the Dockerfile
documentation snippet: replace the fork reference or the version constraint that
implies "3.11" with a concrete minimum of 3.11.1 so users get the
kubernetes-namespaced auth plugin; specifically change the pip install target
that currently uses "'mlflow[kubernetes] @
git+https://github.com/red-hat-data-services/mlflow.git@rhoai-3.4'" or any
mention of ">=3.11" to use "mlflow[kubernetes]>=3.11.1" (also update the
explanatory sentence that references RHOAI shipping 3.11 to mention 3.11.1), and
check the later reference around the second mention (line ~273) to ensure it
matches the same >=3.11.1 constraint.

---

Nitpick comments:
In `@docs/claude-agent/mlflow-tracing.md`:
- Line 204: The current instruction uses "oc adm policy add-role-to-user edit -z
default -n <your-namespace>" which grants the broad edit role to the default
service account; replace this with guidance to create and bind a dedicated
service account with least-privilege RBAC for MLflow (instead of using the
default SA). Update the docs to show creating a service account (e.g.,
"mlflow-sa"), a Role or ClusterRole containing only needed verbs/resources for
experiments/runs and storage access, and a RoleBinding that binds that Role to
"mlflow-sa"; mention that the exact rules should be derived from the
kubernetes-namespaced auth plugin requirements and provide the example
placeholders for Role rules and the RoleBinding that users must tailor for
production.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 38228af3-b2be-41d4-a99d-e0e02786305e

📥 Commits

Reviewing files that changed from the base of the PR and between d4781b9 and 8fb69c9.

⛔ Files ignored due to path filters (6)
  • docs/claude-agent/screenshots/ogx-summary.png is excluded by !**/*.png
  • docs/claude-agent/screenshots/ogx-trace.png is excluded by !**/*.png
  • docs/claude-agent/screenshots/vertex-summary.png is excluded by !**/*.png
  • docs/claude-agent/screenshots/vertex-trace.png is excluded by !**/*.png
  • docs/claude-agent/screenshots/vllm-summary.png is excluded by !**/*.png
  • docs/claude-agent/screenshots/vllm-trace.png is excluded by !**/*.png
📒 Files selected for processing (1)
  • docs/claude-agent/mlflow-tracing.md

Comment thread agents/claude-code/mlflow-tracing.md
Copy link
Copy Markdown
Contributor

@aakankshaduggal aakankshaduggal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work — the doc covers RHAIENG-4751 through 4754 cleanly, and the "same prompt across 3 backends" approach is a great way to prove backend-agnostic tracing. A few things to address:

  1. File location vs repo restructure — there's an active discussion on restructuring the repo (see the thread on the restructure proposal). Should this live under agents/claude-code/ instead of docs/claude-agent/ to align with the new structure?

  2. ogx-summary.png is missing — PR body notes "to be added." Should be included before merge.

  3. MLFLOW_TRACKING_INSECURE_TLS=true — worth adding a note that this is for dev/test setups and production deployments should use proper TLS certificates.

  4. ANTHROPIC_API_KEY=fake in step 4 — this works but could confuse readers. A brief note explaining why (OGX doesn't validate API keys for self-hosted models) would help.

  5. Hardcoded redhat-ods-applications namespace in the MLflow tracking URI — this varies by RHOAI installation, worth calling out.


Reviewed by Claude with @aakankshaduggal's supervision

@Nehanth
Copy link
Copy Markdown
Contributor Author

Nehanth commented May 19, 2026

Thanks for the review @aakankshaduggal! All points addressed in the latest push:

  1. File location — Moved to agents/claude-code/ to align with the repo structure.
  2. ogx-summary.png — Added, all 6 screenshots are now included.
  3. MLFLOW_TRACKING_INSECURE_TLS — Added note: "for dev/test only — production deployments should use proper TLS certificates."
  4. ANTHROPIC_API_KEY=fake — Added note: "OGX does not validate API keys for self-hosted models, any non-empty string works."
  5. Hardcoded namespace — Changed to mlflow.<your-rhoai-namespace>.svc:8443 with a comment noting redhat-ods-applications is common.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@agents/claude-code/mlflow-tracing.md`:
- Around line 203-205: The doc currently shows granting the broad `edit` role to
the `default` service account via the command `oc adm policy add-role-to-user
edit -z default -n <your-namespace>`; change the guidance to instruct creating a
dedicated service account (e.g., `mlflow-sa`) and a minimal Role and RoleBinding
that only grant MLflow-required verbs/resources (list the specific API
groups/resources/verbs MLflow needs) instead of using `edit`, and replace the
single-line example with instructions to create the service account and bind
only that minimal role to it.
- Around line 247-248: The generated entrypoint currently hardcodes
env["MLFLOW_TRACKING_INSECURE_TLS"] = "true"; change this to read from the
environment with a safe default (e.g., use os.getenv or equivalent to set
MLFLOW_TRACKING_INSECURE_TLS to "true" only if explicitly set, defaulting to
"false") and update the runtime settings write (the code that writes to sf using
s) to reflect that value; also add a brief comment or docstring next to where
env is populated explaining this flag should only be enabled for dev/test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 548ce6d6-04e5-4046-b279-f02834cc8c1d

📥 Commits

Reviewing files that changed from the base of the PR and between 8fb69c9 and 5f74dd8.

⛔ Files ignored due to path filters (6)
  • agents/claude-code/screenshots/ogx-summary.png is excluded by !**/*.png
  • agents/claude-code/screenshots/ogx-trace.png is excluded by !**/*.png
  • agents/claude-code/screenshots/vertex-summary.png is excluded by !**/*.png
  • agents/claude-code/screenshots/vertex-trace.png is excluded by !**/*.png
  • agents/claude-code/screenshots/vllm-summary.png is excluded by !**/*.png
  • agents/claude-code/screenshots/vllm-trace.png is excluded by !**/*.png
📒 Files selected for processing (1)
  • agents/claude-code/mlflow-tracing.md

Comment thread agents/claude-code/mlflow-tracing.md
Comment thread agents/claude-code/mlflow-tracing.md
@Nehanth Nehanth requested a review from aakankshaduggal May 19, 2026 18:27
Copy link
Copy Markdown
Contributor

@tarun-etikala tarun-etikala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough investigation across all four RHAIENG tickets, @Nehanth. The tracing validation across three backends is valuable. A few things to address before merging.

Structure: doesn't match repo conventions

  • JIRA tickets as headings: doc uses ticket numbers as section headings. Please restructure around what the reader needs to do, not which ticket produced the finding. The setup guide (current RHAIENG-4754, Steps 1–6) should be the primary content. The investigation findings (4751, 4752/4753) belong in the JIRA ticket descriptions they're useful context but not actionable docs for someone setting up tracing. Recommendations based on findings could be added here
  • Voice: Repo docs use second person imperative ("Edit .env", "Run make deploy"). This doc uses first person plural throughout ("We deployed", "We ran"). Please rewrite to match.
  • Redundancy: The same backend comparison tables (Vertex AI, vLLM, OGX — identical trace IDs, tokens, latencies) in both the 4751 and 4752/4753 sections (~50 lines duplicated). Consolidate into a single "Results" section.

Comment thread agents/claude-code/mlflow-tracing.md Outdated
Comment thread agents/claude-code/mlflow-tracing.md Outdated
Comment thread agents/claude-code/mlflow-tracing.md Outdated
Comment thread agents/claude-code/mlflow-tracing.md Outdated
Copy link
Copy Markdown
Contributor

@sanafayyaz315 sanafayyaz315 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Once the structural changes @tarun-etikala recommended are addressed (restructure around reader actions instead of JIRA tickets, fix voice, consolidate duplicated tables), this should be good to merge.

Comment thread agents/claude-code/mlflow-tracing.md Outdated
tarun-etikala
tarun-etikala previously approved these changes May 20, 2026
Nehanth and others added 6 commits May 20, 2026 17:57
Documents MLflow autolog integration with Claude Code across Vertex AI,
vLLM, and OGX backends. Covers RHAIENG-4751, 4752, 4753, and 4754 —
telemetry investigation, tool call tracing prototype, session-level
metrics, and RHOAI 3.5 setup guide and recommendation.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@Nehanth Nehanth force-pushed the mlflow-tracing-docs branch from 4b14907 to aa62a46 Compare May 20, 2026 22:07
Copy link
Copy Markdown
Contributor

@aakankshaduggal aakankshaduggal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Nehanth, lgtm! 🚢

@Nehanth Nehanth merged commit 10e0129 into main May 21, 2026
8 checks passed
@Nehanth Nehanth deleted the mlflow-tracing-docs branch May 21, 2026 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants