feat: Migrate RHAI notebooks to RuntimePatches API#836
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Central YAML (base), Organization UI (inherited) Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughNine Jupyter notebooks in the test trainer resources directory are updated to replace the deprecated Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Security findingsNo actionable security issues identified. The API migration does not introduce authentication bypass, privilege escalation, injection vulnerabilities, or credential exposure risks. PVC volume attachment patterns remain equivalent to the prior override-based approach. 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
tests/trainer/resources/rhai_features_fsdp_shared_state.ipynb (1)
167-168:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFSDP mode mismatch makes this notebook functionally identical to
rhai_features_fsdp_full_state.ipynb.Line 167 has the correct
shard_grad_opmode commented out, and line 168 appliesfull_shard— the same setting used by the full-state notebook. This means the shared-state notebook tests nothing distinct. The markdown title at line 7 also still reads "Full State" rather than "Shared State".🐛 Fix: restore the intended FSDP mode
- # training_args_kwargs["fsdp"] = "shard_grad_op auto_wrap" - training_args_kwargs["fsdp"] = "full_shard auto_wrap" + training_args_kwargs["fsdp"] = "shard_grad_op auto_wrap"And fix the notebook title cell:
-"# RHAI Features Test - TrainJob Submission with FSDP (Full State)\n", +"# RHAI Features Test - TrainJob Submission with FSDP (Shared State)\n",🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/trainer/resources/rhai_features_fsdp_shared_state.ipynb` around lines 167 - 168, The notebook uses the wrong FSDP mode and title: update the cell that sets training_args_kwargs["fsdp"] to restore the intended shared-state mode by setting it to "shard_grad_op auto_wrap" (undo the current "full_shard auto_wrap"), and edit the markdown title cell (currently "Full State") to read "Shared State" so the notebook and its configuration reflect the shared-state test scenario.tests/trainer/resources/lora.ipynb (1)
568-573:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winSame stale
pod_logsbug as inosft.ipynb.Line 568 fetches a new generator, but line 571 discards it and uses the already-exhausted
pod_logsgenerator from the cell at line 524. The log output will always be empty.🐛 Fix
-logs = client.get_job_logs(name=job_name, follow=False) - -# Collect all log lines from the generator into a list -logs = list(pod_logs) +logs = list(client.get_job_logs(name=job_name, follow=False)) log_text = "\n".join(str(line) for line in logs) print(log_text)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/trainer/resources/lora.ipynb` around lines 568 - 573, The code fetches a new generator into "logs" via client.get_job_logs(name=job_name, follow=False) but then erroneously iterates the previously-exhausted "pod_logs" generator; replace uses of "pod_logs" with the newly fetched "logs" (or rename the fetched generator to pod_logs and remove the old exhausted variable) so that the list construction and join use the fresh generator returned by client.get_job_logs.tests/trainer/resources/osft.ipynb (1)
494-501:⚠️ Potential issue | 🟡 Minor | ⚡ Quick win
pod_logsis already exhausted when this cell executes — the log print is always empty.Line 496 fetches a new generator into
logs, but line 499 immediately overwrites it withlist(pod_logs)— wherepod_logswas assigned and fully consumed (list(pod_logs)) back at lines 452 and 455 in the previous cell. The result is that this cell always prints empty output.🐛 Fix: use the generator just fetched
-logs = client.get_job_logs(name=job_name, follow=False) - -# Collect all log lines from the generator into a list -logs = list(pod_logs) +# Collect all log lines from the generator into a list +logs = list(client.get_job_logs(name=job_name, follow=False)) log_text = "\n".join(str(line) for line in logs) print(log_text)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/trainer/resources/osft.ipynb` around lines 494 - 501, The cell is printing an empty log because it ignores the newly fetched generator: it assigns logs = client.get_job_logs(name=job_name, follow=False) but then overwrites logs with list(pod_logs), where pod_logs is an already-exhausted generator; replace uses of pod_logs here with the newly returned generator (logs) so you consume the fresh generator (e.g., convert logs to a list or join its lines) and then build log_text from that list; update any references to pod_logs in this cell to use logs (or rename for clarity) and ensure you don't reuse an exhausted generator from a previous cell.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/trainer/resources/mnist.ipynb`:
- Line 437: The PVC name pvc_name (assigned from os.getenv("SHARED_PVC_NAME",
"")) is injected unguarded into the k8s manifest as claimName which causes an
opaque API error when empty; update the manifest assembly in mnist.ipynb to only
include the "persistentVolumeClaim": {"claimName": pvc_name} entry when pvc_name
is non-empty (truthy) or omit the entire volumeMount/volume entry
otherwise—locate the code that constructs the dict containing
"persistentVolumeClaim" and wrap or conditionally add that key based on pvc_name
to mirror the validation used in lora.ipynb.
In `@tests/trainer/resources/torchrun_failure.ipynb`:
- Around line 350-352: The test uses hard-coded patch targets
ReplicatedJobPatch(name="node") and ContainerPatch(name="node") which silently
no-op if the loaded TrainingRuntime uses different replica or container names;
update the test to derive the replica and container names from the runtime
metadata (th_runtime) or assert/validate the expected names before creating
patches, then construct ReplicatedJobPatch and ContainerPatch using those
derived names so PVC injection applies reliably (reference identifiers:
ReplicatedJobPatch, ContainerPatch, th_runtime).
---
Outside diff comments:
In `@tests/trainer/resources/lora.ipynb`:
- Around line 568-573: The code fetches a new generator into "logs" via
client.get_job_logs(name=job_name, follow=False) but then erroneously iterates
the previously-exhausted "pod_logs" generator; replace uses of "pod_logs" with
the newly fetched "logs" (or rename the fetched generator to pod_logs and remove
the old exhausted variable) so that the list construction and join use the fresh
generator returned by client.get_job_logs.
In `@tests/trainer/resources/osft.ipynb`:
- Around line 494-501: The cell is printing an empty log because it ignores the
newly fetched generator: it assigns logs = client.get_job_logs(name=job_name,
follow=False) but then overwrites logs with list(pod_logs), where pod_logs is an
already-exhausted generator; replace uses of pod_logs here with the newly
returned generator (logs) so you consume the fresh generator (e.g., convert logs
to a list or join its lines) and then build log_text from that list; update any
references to pod_logs in this cell to use logs (or rename for clarity) and
ensure you don't reuse an exhausted generator from a previous cell.
In `@tests/trainer/resources/rhai_features_fsdp_shared_state.ipynb`:
- Around line 167-168: The notebook uses the wrong FSDP mode and title: update
the cell that sets training_args_kwargs["fsdp"] to restore the intended
shared-state mode by setting it to "shard_grad_op auto_wrap" (undo the current
"full_shard auto_wrap"), and edit the markdown title cell (currently "Full
State") to read "Shared State" so the notebook and its configuration reflect the
shared-state test scenario.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 59113999-7d72-4863-847b-835add079f0a
📒 Files selected for processing (9)
tests/trainer/resources/lora.ipynbtests/trainer/resources/mnist.ipynbtests/trainer/resources/osft.ipynbtests/trainer/resources/rhai_features.ipynbtests/trainer/resources/rhai_features_deepspeed_stage0.ipynbtests/trainer/resources/rhai_features_fsdp_full_state.ipynbtests/trainer/resources/rhai_features_fsdp_shared_state.ipynbtests/trainer/resources/sft.ipynbtests/trainer/resources/torchrun_failure.ipynb
| " volumes=[\n", | ||
| " {\n", | ||
| " \"name\": \"work\",\n", | ||
| " \"persistentVolumeClaim\": {\"claimName\": pvc_name},\n", |
There was a problem hiding this comment.
pvc_name is used without an emptiness guard; an empty value produces an opaque Kubernetes API error.
pvc_name = os.getenv("SHARED_PVC_NAME", "") (line 400) defaults to "". The changed code at line 437 injects it directly as claimName. Other notebooks in the same repo (e.g., lora.ipynb) validate the PVC name before use.
pvc_name = os.getenv("SHARED_PVC_NAME", "")
print(f"[notebook] Using PVC: {pvc_name}")
+if not pvc_name:
+ raise RuntimeError("SHARED_PVC_NAME environment variable is required")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/trainer/resources/mnist.ipynb` at line 437, The PVC name pvc_name
(assigned from os.getenv("SHARED_PVC_NAME", "")) is injected unguarded into the
k8s manifest as claimName which causes an opaque API error when empty; update
the manifest assembly in mnist.ipynb to only include the
"persistentVolumeClaim": {"claimName": pvc_name} entry when pvc_name is
non-empty (truthy) or omit the entire volumeMount/volume entry otherwise—locate
the code that constructs the dict containing "persistentVolumeClaim" and wrap or
conditionally add that key based on pvc_name to mirror the validation used in
lora.ipynb.
| " ReplicatedJobPatch(\n", | ||
| " name=\"node\",\n", | ||
| " template=JobTemplatePatch(\n", |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify whether runtime fixtures and notebooks consistently use "node" as replicated job/container names.
set -euo pipefail
echo "== Notebooks using ReplicatedJobPatch/ContainerPatch names =="
rg -n -C2 'ReplicatedJobPatch\(|ContainerPatch\(' --type-add 'ipynb:*.ipynb' -t ipynb
echo
echo "== Runtime manifests likely defining container/replicated job names =="
rg -n -C3 'kind:\s*TrainingRuntime|replicatedJobs:|containers:|name:\s*node' --glob '*.y*ml'
echo
echo "Expected result:"
echo "- Every TrainingRuntime used by these tests defines replicated job and container name 'node'."
echo "- If not, patch target names must be made runtime-aware or explicitly validated in notebook code."Repository: opendatahub-io/distributed-workloads
Length of output: 25132
🏁 Script executed:
#!/bin/bash
echo "== Search for all TrainingRuntime kind: definitions =="
rg -n 'kind:\s*TrainingRuntime' --glob '*.y*ml' -A 50 | head -200
echo
echo "== Find fixture setup or runtime references in torchrun_failure.ipynb =="
rg -n 'TrainingRuntime\|th_runtime\|runtime.*=' tests/trainer/resources/torchrun_failure.ipynb -B 2 -A 2 | head -100
echo
echo "== Check if there's a conftest or setup that defines runtime fixtures =="
find tests -name conftest.py -o -name '*fixture*' -o -name '*setup*' | head -20Repository: opendatahub-io/distributed-workloads
Length of output: 283
🏁 Script executed:
#!/bin/bash
echo "== Extract imports and patch usage from torchrun_failure.ipynb =="
rg -n 'from kubeflow\|import.*patch\|ReplicatedJobPatch\|ContainerPatch' tests/trainer/resources/torchrun_failure.ipynb -B 1 -A 1 | head -150
echo
echo "== Search for kubeflow trainer options module in codebase =="
find . -path '*/kubeflow/trainer/options*' -o -path '*/kubeflow/trainer*patch*' 2>/dev/null | head -20
echo
echo "== Look for any documentation or test files explaining patch matching =="
rg -l 'ReplicatedJobPatch\|patch.*name' --type py | head -10
echo
echo "== Check structure of torchrun_failure.ipynb around patch definitions =="
sed -n '340,380p' tests/trainer/resources/torchrun_failure.ipynb | head -50Repository: opendatahub-io/distributed-workloads
Length of output: 2506
🏁 Script executed:
#!/bin/bash
echo "== Find th_runtime definition and context in torchrun_failure.ipynb =="
rg -n 'th_runtime' tests/trainer/resources/torchrun_failure.ipynb -B 3 -A 3 | head -200Repository: opendatahub-io/distributed-workloads
Length of output: 930
🏁 Script executed:
#!/bin/bash
echo "== Check if there's any test validation or assertion that patch was applied =="
rg -n 'assert\|volume_mounts\|pvc\|PVC' tests/trainer/resources/torchrun_failure.ipynb -B 2 -A 2 | head -100
echo
echo "== Search for kubeflow.v1 or similar imports and patch logic =="
rg -n 'from kubeflow\|import.*kubeflow\|class.*Patch' --type py | head -20Repository: opendatahub-io/distributed-workloads
Length of output: 224
Hard-coded patch target names silently skip PVC injection if runtime structure differs
ReplicatedJobPatch(name="node") and ContainerPatch(name="node") assume runtime internals match a fixed naming convention. If a loaded TrainingRuntime uses different replicated job or container names, the patches silently fail to apply (Kubernetes strategic merge behavior), leaving the PVC mount uninjected and causing cryptic test failures. Derive target names from th_runtime metadata or validate expected names before submission.
Affected locations
- Line 350:
ReplicatedJobPatch(name="node") - Line 360:
ContainerPatch(name="node")
Also applies to: 360-366
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/trainer/resources/torchrun_failure.ipynb` around lines 350 - 352, The
test uses hard-coded patch targets ReplicatedJobPatch(name="node") and
ContainerPatch(name="node") which silently no-op if the loaded TrainingRuntime
uses different replica or container names; update the test to derive the replica
and container names from the runtime metadata (th_runtime) or assert/validate
the expected names before creating patches, then construct ReplicatedJobPatch
and ContainerPatch using those derived names so PVC injection applies reliably
(reference identifiers: ReplicatedJobPatch, ContainerPatch, th_runtime).
Description
How Has This Been Tested?
Merge criteria:
Summary by CodeRabbit