feat: Migrate RHAI notebooks to RuntimePatches API by hrathina · Pull Request #836 · opendatahub-io/distributed-workloads

hrathina · 2026-05-07T16:46:21Z

Description

How Has This Been Tested?

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

Refactor
- Updated training example notebooks to use a new patch-based configuration API for Kubeflow Trainer jobs. All training functionality remains unchanged.

openshift-ci · 2026-05-07T16:46:29Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign robert-bell for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-05-07T16:46:37Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: fd3f2ef9-f0da-49d9-a20e-99219e44a218

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Nine Jupyter notebooks in the test trainer resources directory are updated to replace the deprecated PodTemplateOverrides API imports and configuration with a new nested patch-based API (RuntimePatch, TrainingRuntimeSpecPatch, JobSetTemplatePatch, JobSetSpecPatch, ReplicatedJobPatch, JobTemplatePatch, JobSpecPatch, PodTemplatePatch, PodSpecPatch, ContainerPatch). The refactoring preserves functional behavior—PVC mounting and pod spec configuration—while restructuring the job options passed to client.train() to use the new patch object hierarchy instead of override objects.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Security findings

No actionable security issues identified. The API migration does not introduce authentication bypass, privilege escalation, injection vulnerabilities, or credential exposure risks. PVC volume attachment patterns remain equivalent to the prior override-based approach.

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately describes the primary change: migrating notebooks from override-based Kubernetes options to the RuntimePatches API, which is the core modification across all 9 notebooks.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

tests/trainer/resources/rhai_features_fsdp_shared_state.ipynb (1)
167-168: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

FSDP mode mismatch makes this notebook functionally identical to rhai_features_fsdp_full_state.ipynb.

Line 167 has the correct shard_grad_op mode commented out, and line 168 applies full_shard — the same setting used by the full-state notebook. This means the shared-state notebook tests nothing distinct. The markdown title at line 7 also still reads "Full State" rather than "Shared State".
🐛 Fix: restore the intended FSDP mode
-    # training_args_kwargs["fsdp"] = "shard_grad_op auto_wrap"
-    training_args_kwargs["fsdp"] = "full_shard auto_wrap"
+    training_args_kwargs["fsdp"] = "shard_grad_op auto_wrap"
And fix the notebook title cell:
-"# RHAI Features Test - TrainJob Submission with FSDP (Full State)\n",
+"# RHAI Features Test - TrainJob Submission with FSDP (Shared State)\n",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/trainer/resources/rhai_features_fsdp_shared_state.ipynb` around lines
167 - 168, The notebook uses the wrong FSDP mode and title: update the cell that
sets training_args_kwargs["fsdp"] to restore the intended shared-state mode by
setting it to "shard_grad_op auto_wrap" (undo the current "full_shard
auto_wrap"), and edit the markdown title cell (currently "Full State") to read
"Shared State" so the notebook and its configuration reflect the shared-state
test scenario.
tests/trainer/resources/lora.ipynb (1)
568-573: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Same stale pod_logs bug as in osft.ipynb.

Line 568 fetches a new generator, but line 571 discards it and uses the already-exhausted pod_logs generator from the cell at line 524. The log output will always be empty.
🐛 Fix
-logs = client.get_job_logs(name=job_name, follow=False)
-
-# Collect all log lines from the generator into a list
-logs = list(pod_logs)
+logs = list(client.get_job_logs(name=job_name, follow=False))
 log_text = "\n".join(str(line) for line in logs)
 print(log_text)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/trainer/resources/lora.ipynb` around lines 568 - 573, The code fetches
a new generator into "logs" via client.get_job_logs(name=job_name, follow=False)
but then erroneously iterates the previously-exhausted "pod_logs" generator;
replace uses of "pod_logs" with the newly fetched "logs" (or rename the fetched
generator to pod_logs and remove the old exhausted variable) so that the list
construction and join use the fresh generator returned by client.get_job_logs.
tests/trainer/resources/osft.ipynb (1)
494-501: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

pod_logs is already exhausted when this cell executes — the log print is always empty.

Line 496 fetches a new generator into logs, but line 499 immediately overwrites it with list(pod_logs) — where pod_logs was assigned and fully consumed (list(pod_logs)) back at lines 452 and 455 in the previous cell. The result is that this cell always prints empty output.
🐛 Fix: use the generator just fetched
-logs = client.get_job_logs(name=job_name, follow=False)
-
-# Collect all log lines from the generator into a list
-logs = list(pod_logs)
+# Collect all log lines from the generator into a list
+logs = list(client.get_job_logs(name=job_name, follow=False))
 log_text = "\n".join(str(line) for line in logs)
 print(log_text)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/trainer/resources/osft.ipynb` around lines 494 - 501, The cell is
printing an empty log because it ignores the newly fetched generator: it assigns
logs = client.get_job_logs(name=job_name, follow=False) but then overwrites logs
with list(pod_logs), where pod_logs is an already-exhausted generator; replace
uses of pod_logs here with the newly returned generator (logs) so you consume
the fresh generator (e.g., convert logs to a list or join its lines) and then
build log_text from that list; update any references to pod_logs in this cell to
use logs (or rename for clarity) and ensure you don't reuse an exhausted
generator from a previous cell.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/trainer/resources/mnist.ipynb`:
- Line 437: The PVC name pvc_name (assigned from os.getenv("SHARED_PVC_NAME",
"")) is injected unguarded into the k8s manifest as claimName which causes an
opaque API error when empty; update the manifest assembly in mnist.ipynb to only
include the "persistentVolumeClaim": {"claimName": pvc_name} entry when pvc_name
is non-empty (truthy) or omit the entire volumeMount/volume entry
otherwise—locate the code that constructs the dict containing
"persistentVolumeClaim" and wrap or conditionally add that key based on pvc_name
to mirror the validation used in lora.ipynb.

In `@tests/trainer/resources/torchrun_failure.ipynb`:
- Around line 350-352: The test uses hard-coded patch targets
ReplicatedJobPatch(name="node") and ContainerPatch(name="node") which silently
no-op if the loaded TrainingRuntime uses different replica or container names;
update the test to derive the replica and container names from the runtime
metadata (th_runtime) or assert/validate the expected names before creating
patches, then construct ReplicatedJobPatch and ContainerPatch using those
derived names so PVC injection applies reliably (reference identifiers:
ReplicatedJobPatch, ContainerPatch, th_runtime).

---

Outside diff comments:
In `@tests/trainer/resources/lora.ipynb`:
- Around line 568-573: The code fetches a new generator into "logs" via
client.get_job_logs(name=job_name, follow=False) but then erroneously iterates
the previously-exhausted "pod_logs" generator; replace uses of "pod_logs" with
the newly fetched "logs" (or rename the fetched generator to pod_logs and remove
the old exhausted variable) so that the list construction and join use the fresh
generator returned by client.get_job_logs.

In `@tests/trainer/resources/osft.ipynb`:
- Around line 494-501: The cell is printing an empty log because it ignores the
newly fetched generator: it assigns logs = client.get_job_logs(name=job_name,
follow=False) but then overwrites logs with list(pod_logs), where pod_logs is an
already-exhausted generator; replace uses of pod_logs here with the newly
returned generator (logs) so you consume the fresh generator (e.g., convert logs
to a list or join its lines) and then build log_text from that list; update any
references to pod_logs in this cell to use logs (or rename for clarity) and
ensure you don't reuse an exhausted generator from a previous cell.

In `@tests/trainer/resources/rhai_features_fsdp_shared_state.ipynb`:
- Around line 167-168: The notebook uses the wrong FSDP mode and title: update
the cell that sets training_args_kwargs["fsdp"] to restore the intended
shared-state mode by setting it to "shard_grad_op auto_wrap" (undo the current
"full_shard auto_wrap"), and edit the markdown title cell (currently "Full
State") to read "Shared State" so the notebook and its configuration reflect the
shared-state test scenario.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 59113999-7d72-4863-847b-835add079f0a

📥 Commits

Reviewing files that changed from the base of the PR and between 0b30c29 and de00eed.

📒 Files selected for processing (9)

tests/trainer/resources/lora.ipynb
tests/trainer/resources/mnist.ipynb
tests/trainer/resources/osft.ipynb
tests/trainer/resources/rhai_features.ipynb
tests/trainer/resources/rhai_features_deepspeed_stage0.ipynb
tests/trainer/resources/rhai_features_fsdp_full_state.ipynb
tests/trainer/resources/rhai_features_fsdp_shared_state.ipynb
tests/trainer/resources/sft.ipynb
tests/trainer/resources/torchrun_failure.ipynb

coderabbitai · 2026-05-07T16:56:57Z

+    "                                                volumes=[\n",
+    "                                                    {\n",
+    "                                                        \"name\": \"work\",\n",
+    "                                                        \"persistentVolumeClaim\": {\"claimName\": pvc_name},\n",


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

pvc_name is used without an emptiness guard; an empty value produces an opaque Kubernetes API error.

pvc_name = os.getenv("SHARED_PVC_NAME", "") (line 400) defaults to "". The changed code at line 437 injects it directly as claimName. Other notebooks in the same repo (e.g., lora.ipynb) validate the PVC name before use.

pvc_name = os.getenv("SHARED_PVC_NAME", "") print(f"[notebook] Using PVC: {pvc_name}") +if not pvc_name: + raise RuntimeError("SHARED_PVC_NAME environment variable is required")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/trainer/resources/mnist.ipynb` at line 437, The PVC name pvc_name (assigned from os.getenv("SHARED_PVC_NAME", "")) is injected unguarded into the k8s manifest as claimName which causes an opaque API error when empty; update the manifest assembly in mnist.ipynb to only include the "persistentVolumeClaim": {"claimName": pvc_name} entry when pvc_name is non-empty (truthy) or omit the entire volumeMount/volume entry otherwise—locate the code that constructs the dict containing "persistentVolumeClaim" and wrap or conditionally add that key based on pvc_name to mirror the validation used in lora.ipynb.

coderabbitai · 2026-05-07T16:56:57Z

+    "                            ReplicatedJobPatch(\n",
+    "                                name=\"node\",\n",
+    "                                template=JobTemplatePatch(\n",


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify whether runtime fixtures and notebooks consistently use "node" as replicated job/container names. set -euo pipefail echo "== Notebooks using ReplicatedJobPatch/ContainerPatch names ==" rg -n -C2 'ReplicatedJobPatch\(|ContainerPatch\(' --type-add 'ipynb:*.ipynb' -t ipynb echo echo "== Runtime manifests likely defining container/replicated job names ==" rg -n -C3 'kind:\s*TrainingRuntime|replicatedJobs:|containers:|name:\s*node' --glob '*.y*ml' echo echo "Expected result:" echo "- Every TrainingRuntime used by these tests defines replicated job and container name 'node'." echo "- If not, patch target names must be made runtime-aware or explicitly validated in notebook code."

Repository: opendatahub-io/distributed-workloads

Length of output: 25132

🏁 Script executed:

#!/bin/bash echo "== Search for all TrainingRuntime kind: definitions ==" rg -n 'kind:\s*TrainingRuntime' --glob '*.y*ml' -A 50 | head -200 echo echo "== Find fixture setup or runtime references in torchrun_failure.ipynb ==" rg -n 'TrainingRuntime\|th_runtime\|runtime.*=' tests/trainer/resources/torchrun_failure.ipynb -B 2 -A 2 | head -100 echo echo "== Check if there's a conftest or setup that defines runtime fixtures ==" find tests -name conftest.py -o -name '*fixture*' -o -name '*setup*' | head -20

Repository: opendatahub-io/distributed-workloads

Length of output: 283

🏁 Script executed:

#!/bin/bash echo "== Extract imports and patch usage from torchrun_failure.ipynb ==" rg -n 'from kubeflow\|import.*patch\|ReplicatedJobPatch\|ContainerPatch' tests/trainer/resources/torchrun_failure.ipynb -B 1 -A 1 | head -150 echo echo "== Search for kubeflow trainer options module in codebase ==" find . -path '*/kubeflow/trainer/options*' -o -path '*/kubeflow/trainer*patch*' 2>/dev/null | head -20 echo echo "== Look for any documentation or test files explaining patch matching ==" rg -l 'ReplicatedJobPatch\|patch.*name' --type py | head -10 echo echo "== Check structure of torchrun_failure.ipynb around patch definitions ==" sed -n '340,380p' tests/trainer/resources/torchrun_failure.ipynb | head -50

Repository: opendatahub-io/distributed-workloads

Length of output: 2506

🏁 Script executed:

#!/bin/bash echo "== Find th_runtime definition and context in torchrun_failure.ipynb ==" rg -n 'th_runtime' tests/trainer/resources/torchrun_failure.ipynb -B 3 -A 3 | head -200

Repository: opendatahub-io/distributed-workloads

Length of output: 930

🏁 Script executed:

#!/bin/bash echo "== Check if there's any test validation or assertion that patch was applied ==" rg -n 'assert\|volume_mounts\|pvc\|PVC' tests/trainer/resources/torchrun_failure.ipynb -B 2 -A 2 | head -100 echo echo "== Search for kubeflow.v1 or similar imports and patch logic ==" rg -n 'from kubeflow\|import.*kubeflow\|class.*Patch' --type py | head -20

Repository: opendatahub-io/distributed-workloads

Length of output: 224

Hard-coded patch target names silently skip PVC injection if runtime structure differs

ReplicatedJobPatch(name="node") and ContainerPatch(name="node") assume runtime internals match a fixed naming convention. If a loaded TrainingRuntime uses different replicated job or container names, the patches silently fail to apply (Kubernetes strategic merge behavior), leaving the PVC mount uninjected and causing cryptic test failures. Derive target names from th_runtime metadata or validate expected names before submission.

Affected locations

Line 350: ReplicatedJobPatch(name="node")

Line 360: ContainerPatch(name="node")

Also applies to: 360-366

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/trainer/resources/torchrun_failure.ipynb` around lines 350 - 352, The test uses hard-coded patch targets ReplicatedJobPatch(name="node") and ContainerPatch(name="node") which silently no-op if the loaded TrainingRuntime uses different replica or container names; update the test to derive the replica and container names from the runtime metadata (th_runtime) or assert/validate the expected names before creating patches, then construct ReplicatedJobPatch and ContainerPatch using those derived names so PVC injection applies reliably (reference identifiers: ReplicatedJobPatch, ContainerPatch, th_runtime).

feat: Migrate RHAI notebooks to RuntimePatches API

de00eed

openshift-ci Bot requested review from astefanutti and efazal May 7, 2026 16:46

hrathina marked this pull request as draft May 7, 2026 16:46

openshift-ci Bot added the do-not-merge/work-in-progress label May 7, 2026

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

hrathina added 2 commits May 7, 2026 17:58

fix: remove trailing newlines from TrainingHub notebooks

d75b423

fix: remove trailing newlines from RHAI notebooks

58f00c0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Migrate RHAI notebooks to RuntimePatches API#836

feat: Migrate RHAI notebooks to RuntimePatches API#836
hrathina wants to merge 3 commits into
opendatahub-io:mainfrom
hrathina:migrate-traininghub-runtimepatches

hrathina commented May 7, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

openshift-ci Bot commented May 7, 2026

Uh oh!

coderabbitai Bot commented May 7, 2026 •

edited

Loading

Review skipped

Walkthrough

Estimated code review effort

Security findings

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 7, 2026

Uh oh!

coderabbitai Bot May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hrathina commented May 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Merge criteria:

Summary by CodeRabbit

Uh oh!

openshift-ci Bot commented May 7, 2026

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Estimated code review effort

Security findings

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hrathina commented May 7, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 7, 2026 •

edited

Loading