Skip to content

feat: Migrate RHAI notebooks to RuntimePatches API#836

Draft
hrathina wants to merge 3 commits into
opendatahub-io:mainfrom
hrathina:migrate-traininghub-runtimepatches
Draft

feat: Migrate RHAI notebooks to RuntimePatches API#836
hrathina wants to merge 3 commits into
opendatahub-io:mainfrom
hrathina:migrate-traininghub-runtimepatches

Conversation

@hrathina

@hrathina hrathina commented May 7, 2026

Copy link
Copy Markdown
Contributor

Description

How Has This Been Tested?

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • Refactor
    • Updated training example notebooks to use a new patch-based configuration API for Kubeflow Trainer jobs. All training functionality remains unchanged.

@openshift-ci openshift-ci Bot requested review from astefanutti and efazal May 7, 2026 16:46
@openshift-ci

openshift-ci Bot commented May 7, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign robert-bell for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hrathina hrathina marked this pull request as draft May 7, 2026 16:46
@coderabbitai

coderabbitai Bot commented May 7, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: fd3f2ef9-f0da-49d9-a20e-99219e44a218

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Nine Jupyter notebooks in the test trainer resources directory are updated to replace the deprecated PodTemplateOverrides API imports and configuration with a new nested patch-based API (RuntimePatch, TrainingRuntimeSpecPatch, JobSetTemplatePatch, JobSetSpecPatch, ReplicatedJobPatch, JobTemplatePatch, JobSpecPatch, PodTemplatePatch, PodSpecPatch, ContainerPatch). The refactoring preserves functional behavior—PVC mounting and pod spec configuration—while restructuring the job options passed to client.train() to use the new patch object hierarchy instead of override objects.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes


Security findings

No actionable security issues identified. The API migration does not introduce authentication bypass, privilege escalation, injection vulnerabilities, or credential exposure risks. PVC volume attachment patterns remain equivalent to the prior override-based approach.

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the primary change: migrating notebooks from override-based Kubernetes options to the RuntimePatches API, which is the core modification across all 9 notebooks.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
tests/trainer/resources/rhai_features_fsdp_shared_state.ipynb (1)

167-168: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

FSDP mode mismatch makes this notebook functionally identical to rhai_features_fsdp_full_state.ipynb.

Line 167 has the correct shard_grad_op mode commented out, and line 168 applies full_shard — the same setting used by the full-state notebook. This means the shared-state notebook tests nothing distinct. The markdown title at line 7 also still reads "Full State" rather than "Shared State".

🐛 Fix: restore the intended FSDP mode
-    # training_args_kwargs["fsdp"] = "shard_grad_op auto_wrap"
-    training_args_kwargs["fsdp"] = "full_shard auto_wrap"
+    training_args_kwargs["fsdp"] = "shard_grad_op auto_wrap"

And fix the notebook title cell:

-"# RHAI Features Test - TrainJob Submission with FSDP (Full State)\n",
+"# RHAI Features Test - TrainJob Submission with FSDP (Shared State)\n",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/trainer/resources/rhai_features_fsdp_shared_state.ipynb` around lines
167 - 168, The notebook uses the wrong FSDP mode and title: update the cell that
sets training_args_kwargs["fsdp"] to restore the intended shared-state mode by
setting it to "shard_grad_op auto_wrap" (undo the current "full_shard
auto_wrap"), and edit the markdown title cell (currently "Full State") to read
"Shared State" so the notebook and its configuration reflect the shared-state
test scenario.
tests/trainer/resources/lora.ipynb (1)

568-573: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Same stale pod_logs bug as in osft.ipynb.

Line 568 fetches a new generator, but line 571 discards it and uses the already-exhausted pod_logs generator from the cell at line 524. The log output will always be empty.

🐛 Fix
-logs = client.get_job_logs(name=job_name, follow=False)
-
-# Collect all log lines from the generator into a list
-logs = list(pod_logs)
+logs = list(client.get_job_logs(name=job_name, follow=False))
 log_text = "\n".join(str(line) for line in logs)
 print(log_text)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/trainer/resources/lora.ipynb` around lines 568 - 573, The code fetches
a new generator into "logs" via client.get_job_logs(name=job_name, follow=False)
but then erroneously iterates the previously-exhausted "pod_logs" generator;
replace uses of "pod_logs" with the newly fetched "logs" (or rename the fetched
generator to pod_logs and remove the old exhausted variable) so that the list
construction and join use the fresh generator returned by client.get_job_logs.
tests/trainer/resources/osft.ipynb (1)

494-501: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

pod_logs is already exhausted when this cell executes — the log print is always empty.

Line 496 fetches a new generator into logs, but line 499 immediately overwrites it with list(pod_logs) — where pod_logs was assigned and fully consumed (list(pod_logs)) back at lines 452 and 455 in the previous cell. The result is that this cell always prints empty output.

🐛 Fix: use the generator just fetched
-logs = client.get_job_logs(name=job_name, follow=False)
-
-# Collect all log lines from the generator into a list
-logs = list(pod_logs)
+# Collect all log lines from the generator into a list
+logs = list(client.get_job_logs(name=job_name, follow=False))
 log_text = "\n".join(str(line) for line in logs)
 print(log_text)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/trainer/resources/osft.ipynb` around lines 494 - 501, The cell is
printing an empty log because it ignores the newly fetched generator: it assigns
logs = client.get_job_logs(name=job_name, follow=False) but then overwrites logs
with list(pod_logs), where pod_logs is an already-exhausted generator; replace
uses of pod_logs here with the newly returned generator (logs) so you consume
the fresh generator (e.g., convert logs to a list or join its lines) and then
build log_text from that list; update any references to pod_logs in this cell to
use logs (or rename for clarity) and ensure you don't reuse an exhausted
generator from a previous cell.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/trainer/resources/mnist.ipynb`:
- Line 437: The PVC name pvc_name (assigned from os.getenv("SHARED_PVC_NAME",
"")) is injected unguarded into the k8s manifest as claimName which causes an
opaque API error when empty; update the manifest assembly in mnist.ipynb to only
include the "persistentVolumeClaim": {"claimName": pvc_name} entry when pvc_name
is non-empty (truthy) or omit the entire volumeMount/volume entry
otherwise—locate the code that constructs the dict containing
"persistentVolumeClaim" and wrap or conditionally add that key based on pvc_name
to mirror the validation used in lora.ipynb.

In `@tests/trainer/resources/torchrun_failure.ipynb`:
- Around line 350-352: The test uses hard-coded patch targets
ReplicatedJobPatch(name="node") and ContainerPatch(name="node") which silently
no-op if the loaded TrainingRuntime uses different replica or container names;
update the test to derive the replica and container names from the runtime
metadata (th_runtime) or assert/validate the expected names before creating
patches, then construct ReplicatedJobPatch and ContainerPatch using those
derived names so PVC injection applies reliably (reference identifiers:
ReplicatedJobPatch, ContainerPatch, th_runtime).

---

Outside diff comments:
In `@tests/trainer/resources/lora.ipynb`:
- Around line 568-573: The code fetches a new generator into "logs" via
client.get_job_logs(name=job_name, follow=False) but then erroneously iterates
the previously-exhausted "pod_logs" generator; replace uses of "pod_logs" with
the newly fetched "logs" (or rename the fetched generator to pod_logs and remove
the old exhausted variable) so that the list construction and join use the fresh
generator returned by client.get_job_logs.

In `@tests/trainer/resources/osft.ipynb`:
- Around line 494-501: The cell is printing an empty log because it ignores the
newly fetched generator: it assigns logs = client.get_job_logs(name=job_name,
follow=False) but then overwrites logs with list(pod_logs), where pod_logs is an
already-exhausted generator; replace uses of pod_logs here with the newly
returned generator (logs) so you consume the fresh generator (e.g., convert logs
to a list or join its lines) and then build log_text from that list; update any
references to pod_logs in this cell to use logs (or rename for clarity) and
ensure you don't reuse an exhausted generator from a previous cell.

In `@tests/trainer/resources/rhai_features_fsdp_shared_state.ipynb`:
- Around line 167-168: The notebook uses the wrong FSDP mode and title: update
the cell that sets training_args_kwargs["fsdp"] to restore the intended
shared-state mode by setting it to "shard_grad_op auto_wrap" (undo the current
"full_shard auto_wrap"), and edit the markdown title cell (currently "Full
State") to read "Shared State" so the notebook and its configuration reflect the
shared-state test scenario.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 59113999-7d72-4863-847b-835add079f0a

📥 Commits

Reviewing files that changed from the base of the PR and between 0b30c29 and de00eed.

📒 Files selected for processing (9)
  • tests/trainer/resources/lora.ipynb
  • tests/trainer/resources/mnist.ipynb
  • tests/trainer/resources/osft.ipynb
  • tests/trainer/resources/rhai_features.ipynb
  • tests/trainer/resources/rhai_features_deepspeed_stage0.ipynb
  • tests/trainer/resources/rhai_features_fsdp_full_state.ipynb
  • tests/trainer/resources/rhai_features_fsdp_shared_state.ipynb
  • tests/trainer/resources/sft.ipynb
  • tests/trainer/resources/torchrun_failure.ipynb

" volumes=[\n",
" {\n",
" \"name\": \"work\",\n",
" \"persistentVolumeClaim\": {\"claimName\": pvc_name},\n",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

pvc_name is used without an emptiness guard; an empty value produces an opaque Kubernetes API error.

pvc_name = os.getenv("SHARED_PVC_NAME", "") (line 400) defaults to "". The changed code at line 437 injects it directly as claimName. Other notebooks in the same repo (e.g., lora.ipynb) validate the PVC name before use.

 pvc_name = os.getenv("SHARED_PVC_NAME", "")
 print(f"[notebook] Using PVC: {pvc_name}")
+if not pvc_name:
+    raise RuntimeError("SHARED_PVC_NAME environment variable is required")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/trainer/resources/mnist.ipynb` at line 437, The PVC name pvc_name
(assigned from os.getenv("SHARED_PVC_NAME", "")) is injected unguarded into the
k8s manifest as claimName which causes an opaque API error when empty; update
the manifest assembly in mnist.ipynb to only include the
"persistentVolumeClaim": {"claimName": pvc_name} entry when pvc_name is
non-empty (truthy) or omit the entire volumeMount/volume entry otherwise—locate
the code that constructs the dict containing "persistentVolumeClaim" and wrap or
conditionally add that key based on pvc_name to mirror the validation used in
lora.ipynb.

Comment on lines +350 to +352
" ReplicatedJobPatch(\n",
" name=\"node\",\n",
" template=JobTemplatePatch(\n",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify whether runtime fixtures and notebooks consistently use "node" as replicated job/container names.

set -euo pipefail

echo "== Notebooks using ReplicatedJobPatch/ContainerPatch names =="
rg -n -C2 'ReplicatedJobPatch\(|ContainerPatch\(' --type-add 'ipynb:*.ipynb' -t ipynb

echo
echo "== Runtime manifests likely defining container/replicated job names =="
rg -n -C3 'kind:\s*TrainingRuntime|replicatedJobs:|containers:|name:\s*node' --glob '*.y*ml'

echo
echo "Expected result:"
echo "- Every TrainingRuntime used by these tests defines replicated job and container name 'node'."
echo "- If not, patch target names must be made runtime-aware or explicitly validated in notebook code."

Repository: opendatahub-io/distributed-workloads

Length of output: 25132


🏁 Script executed:

#!/bin/bash

echo "== Search for all TrainingRuntime kind: definitions =="
rg -n 'kind:\s*TrainingRuntime' --glob '*.y*ml' -A 50 | head -200

echo
echo "== Find fixture setup or runtime references in torchrun_failure.ipynb =="
rg -n 'TrainingRuntime\|th_runtime\|runtime.*=' tests/trainer/resources/torchrun_failure.ipynb -B 2 -A 2 | head -100

echo
echo "== Check if there's a conftest or setup that defines runtime fixtures =="
find tests -name conftest.py -o -name '*fixture*' -o -name '*setup*' | head -20

Repository: opendatahub-io/distributed-workloads

Length of output: 283


🏁 Script executed:

#!/bin/bash

echo "== Extract imports and patch usage from torchrun_failure.ipynb =="
rg -n 'from kubeflow\|import.*patch\|ReplicatedJobPatch\|ContainerPatch' tests/trainer/resources/torchrun_failure.ipynb -B 1 -A 1 | head -150

echo
echo "== Search for kubeflow trainer options module in codebase =="
find . -path '*/kubeflow/trainer/options*' -o -path '*/kubeflow/trainer*patch*' 2>/dev/null | head -20

echo
echo "== Look for any documentation or test files explaining patch matching =="
rg -l 'ReplicatedJobPatch\|patch.*name' --type py | head -10

echo
echo "== Check structure of torchrun_failure.ipynb around patch definitions =="
sed -n '340,380p' tests/trainer/resources/torchrun_failure.ipynb | head -50

Repository: opendatahub-io/distributed-workloads

Length of output: 2506


🏁 Script executed:

#!/bin/bash

echo "== Find th_runtime definition and context in torchrun_failure.ipynb =="
rg -n 'th_runtime' tests/trainer/resources/torchrun_failure.ipynb -B 3 -A 3 | head -200

Repository: opendatahub-io/distributed-workloads

Length of output: 930


🏁 Script executed:

#!/bin/bash

echo "== Check if there's any test validation or assertion that patch was applied =="
rg -n 'assert\|volume_mounts\|pvc\|PVC' tests/trainer/resources/torchrun_failure.ipynb -B 2 -A 2 | head -100

echo
echo "== Search for kubeflow.v1 or similar imports and patch logic =="
rg -n 'from kubeflow\|import.*kubeflow\|class.*Patch' --type py | head -20

Repository: opendatahub-io/distributed-workloads

Length of output: 224


Hard-coded patch target names silently skip PVC injection if runtime structure differs

ReplicatedJobPatch(name="node") and ContainerPatch(name="node") assume runtime internals match a fixed naming convention. If a loaded TrainingRuntime uses different replicated job or container names, the patches silently fail to apply (Kubernetes strategic merge behavior), leaving the PVC mount uninjected and causing cryptic test failures. Derive target names from th_runtime metadata or validate expected names before submission.

Affected locations
  • Line 350: ReplicatedJobPatch(name="node")
  • Line 360: ContainerPatch(name="node")

Also applies to: 360-366

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/trainer/resources/torchrun_failure.ipynb` around lines 350 - 352, The
test uses hard-coded patch targets ReplicatedJobPatch(name="node") and
ContainerPatch(name="node") which silently no-op if the loaded TrainingRuntime
uses different replica or container names; update the test to derive the replica
and container names from the runtime metadata (th_runtime) or assert/validate
the expected names before creating patches, then construct ReplicatedJobPatch
and ContainerPatch using those derived names so PVC injection applies reliably
(reference identifiers: ReplicatedJobPatch, ContainerPatch, th_runtime).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant