Skip to content

Add training failure scenario E2E tests for SFT, OSFT, and LORA#658

Merged
openshift-merge-bot[bot] merged 3 commits into
opendatahub-io:mainfrom
Fiona-Waters:e2e-updates
Mar 10, 2026
Merged

Add training failure scenario E2E tests for SFT, OSFT, and LORA#658
openshift-merge-bot[bot] merged 3 commits into
opendatahub-io:mainfrom
Fiona-Waters:e2e-updates

Conversation

@Fiona-Waters

@Fiona-Waters Fiona-Waters commented Mar 5, 2026

Copy link
Copy Markdown
Contributor

RHOAIENG-48034

Description

This PR adds two new test notebooks for validating that training failures are properly propagated through Training Hub via Kubeflow SDK:

  • failure_scenarios.ipynb — CPU-only validation failure tests across all three algorithms (SFT, OSFT, LORA)
  • torchrun_failure.ipynb — GPU-based runtime failure test that triggers an OOM crash during the torch forward pass

NOTE: The tests were added this way rather than adding individual scenarios to the existing tests, as these new tests run on CPU so it makes more sense to group the failure scenarios together.

Failure Scenarios Notebook (CPU-only, Tier1)

  1. SFT — Invalid dataset path: data_path: "/nonexistent/dataset.jsonl" → expects FileNotFoundError
  2. SFT — Invalid model path: model_path: "/nonexistent/model" → expects FileNotFoundError
  3. SFT — Invalid hyperparameters: max_batch_len: "not_a_number" → expects ValidationError
  4. OSFT — Invalid unfreeze rank ratio: unfreeze_rank_ratio: "not_a_number" → expects TypeError
  5. LORA — Invalid model path: model_path: "/nonexistent/model" → expects RuntimeError

The helper function run_failure_scenario() accepts an algorithm parameter (defaults to SFT for backward compatibility) and handles job submission, log polling, crash detection, and cleanup.

Torchrun Failure Notebook (GPU, KftoCuda)

A single-scenario GPU test that:

  1. Downloads a real model (Qwen2.5-1.5B-Instruct) and dataset (Table-GPT) from S3 or HuggingFace
  2. Submits an SFT training job with an extreme max_batch_len (10M) to trigger an OOM crash during the torch forward pass
  3. Verifies the OOM/RuntimeError appears in logs and the container crashes

This tests a fundamentally different failure mode than the validation scenarios — the model loads successfully, torchrun starts the training loop, and the crash happens during GPU memory allocation for the forward pass.

Verification Steps

The default training-hub ClusterTrainingRuntime uses an older transformers version that may not be compatible with the latest SDK. A training-hub-pr-test runtime has been created on the team IBM cluster with the updated transformers version. To run the failure scenarios test against it:

Step 1 — Temporarily update the runtime in the test. In tests/trainer/sdk_tests/failure_traininghub_tests.go, change:
trainerutils.DefaultTrainingHubRuntime, to:
training-hub-pr-test

Step 2 — Temporarily switch to using your own token. In the same file, change:
userToken := common.GenerateNotebookUserToken(test)
to:
userToken := common.GetNotebookUserToken(test)
This reads the token from the NOTEBOOK_USER_TOKEN env var instead of generating one via oc login.

Step 3 — Run the failure scenarios test (CPU-only):
NOTEBOOK_USER_NAME=
NOTEBOOK_USER_TOKEN=$(oc whoami -t)
NOTEBOOK_IMAGE=
OPENSHIFT_API_URL=$(oc whoami --show-server)
TEST_TIER=Tier1
go test -v -run TestTrainingFailureScenarios ./tests/trainer/ -timeout 60m

Step 4 — Run the torchrun failure test (requires GPU):
NOTEBOOK_USER_NAME=
NOTEBOOK_USER_TOKEN=$(oc whoami -t)
NOTEBOOK_IMAGE=
OPENSHIFT_API_URL=$(oc whoami --show-server)
TEST_TIER=KftoCuda
go test -v -run TestTorchrunTrainingFailure ./tests/trainer/ -timeout 60m

How Has This Been Tested?

These have been tested by running manually on the team IBM cluster.

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • Tests
    • Added end-to-end failure scenario workflows for training jobs covering invalid dataset/model paths, bad hyperparameters, and algorithm config errors.
    • Added GPU-specific torchrun workflows to induce OOM/crash conditions and validate GPU runtime behavior.
    • Added orchestrated test runs that create notebook-backed executions, submit jobs, monitor pod status and logs for expected failures, and perform cleanup.
    • Added new test entry points to execute and aggregate these scenarios.

@openshift-ci openshift-ci Bot requested review from robert-bell and szaher March 5, 2026 16:32
@coderabbitai

coderabbitai Bot commented Mar 5, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds end-to-end failure scenario tests for Kubeflow training jobs. It introduces two new Go test entry points that invoke SDK test helpers, a Go test harness that provisions Kubernetes resources (Namespace, ServiceAccount, RBAC, ConfigMap, RWX PVC, Notebook CRs), and orchestration to run papermill notebooks inside Notebooks and monitor pod logs. Two Jupyter notebooks are added that implement failure-injection workflows: a multi-scenario notebook (invalid dataset/model paths, malformed hyperparameters, OSFT/LORA misconfigurations) and a torchrun OOM-inducing training notebook that forces a crash on first forward pass. No existing public API signatures were removed or changed.

Security Findings

  • CWE-295 (Improper Certificate Validation): Notebooks disable SSL verification for the Kubernetes/OpenShift API client (verify=False). Actionable: load and use the cluster CA bundle or mount trusted CA certs into the notebook pod; do not disable TLS verification even in tests.
  • CWE-798 / CWE-522 (Hard-Coded Credentials and Exposure via Process/Logs): Tests pass NOTEBOOK_USER_TOKEN, AWS/S3 credentials, and similar secrets via environment variables and command-line strings (papermill shell commands), exposing secrets in process listings, pod logs, and artifacts. Actionable: store secrets in Kubernetes Secrets and mount as env vars or files with restrictive permissions; avoid embedding secrets in visible command strings; redact sensitive values from logs and artifacts.
  • CWE-209 (Information Exposure Through Error Messages): Notebooks and shell steps print environment values and repository operations that may contain sensitive data. Actionable: sanitize and redact sensitive env vars before printing; restrict notebook artifact retention and access.
  • CWE-862 / CWE-284 (Excessive Privilege & Missing Least-Privilege RBAC): Test setup creates ServiceAccounts and RBAC bindings without explicit least-privilege constraints. Actionable: scope Roles to minimal verbs/resources and bind only in the test namespace; document required permissions and avoid cluster-wide grants.
  • CWE-732 (Incorrect Permission Assignment for Critical Resource): Creation of RWX PVCs and mounting into Notebook pods can expose shared storage with sensitive data. Actionable: ensure PVC accessModes and storage class enforce isolation; provision ephemeral volumes for sensitive runs or enforce filesystem-level permissions.
  • CWE-611 (Improper Restriction of XML External Entity) — informational: notebooks execute external downloads (S3/HuggingFace). Actionable: validate and checksum remote artifacts, prefer signed/reproducible releases, and avoid executing untrusted code fetched at runtime.
  • Logging and secret leakage in artifacts: papermill-produced notebooks and downloaded logs may retain secrets. Actionable: configure papermill to suppress or redact outputs containing secrets; delete artifacts or store them in access-restricted locations.

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add training failure scenario E2E tests for SFT, OSFT, and LORA' directly and clearly summarizes the main change: adding three new end-to-end test notebooks for failure scenarios across these algorithm types.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Fiona-Waters Fiona-Waters requested review from MStokluska and briangallagher and removed request for robert-bell and szaher March 5, 2026 16:32

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/trainer/resources/failure_scenarios.ipynb`:
- Line 115: The notebook currently prints a summary but does not fail the
process when any scenario (result_1, result_2, result_3, result_4, result_5
stored in the results dict) is False; modify the end of the cell that computes
all_passed to terminate with a non-zero exit (e.g., raise a RuntimeError or call
sys.exit(1)) when all_passed is False, keeping the existing NOTEBOOK_STATUS
prints; ensure the change is placed after the all_passed = all(results.values())
line and references the results/all_passed logic so papermill or CI sees a
failure.
- Line 19: The notebook currently disables TLS verification
(configuration.verify_ssl = False and urllib3.disable_warnings) which allows
MITM attacks; update the Kubernetes client setup in this file to enable TLS
verification by removing the verify_ssl=False assignment and the
disable_warnings call, and instead load proper cluster credentials (e.g., use
k8s.config.load_kube_config() or k8s.config.load_incluster_config()) and set
configuration.ssl_ca_cert to the cluster CA certificate path or an
environment-provided CA bundle before creating k8s.ApiClient; ensure the bearer
token is still provided via configuration.api_key but only after TLS is enforced
so ApiClient and any functions that use api_client perform certificate
validation.

In `@tests/trainer/resources/torchrun_failure.ipynb`:
- Line 314: The OOM detection in OOM_ERROR_PATTERNS is too broad (e.g.,
"RuntimeError", "NCCL") and can false-pass; tighten the matcher in the
monitoring loop (the block that builds log_text and iterates OOM_ERROR_PATTERNS)
by removing generic tokens and using more specific, case-insensitive OOM
indicators (e.g., "out of memory", "CUDA out of memory",
"torch.cuda.OutOfMemoryError", "torch.OutOfMemoryError" or a regex with word
boundaries) so that matched_pattern only sets when a true OOM signature is
found; update the OOM_ERROR_PATTERNS constant and the matching logic in the
while loop that checks client.get_job_logs (and keep get_training_pod_status
logic unchanged) to avoid matching generic "RuntimeError" or "NCCL" strings
alone.
- Around line 129-136: The S3 client in the notebook is created with
verify=False which disables TLS certificate validation; update the boto3.client
call that constructs s3 (the variable named "s3" via boto3.client(...,
endpoint_url=endpoint_url, ... , verify=False)) to enable certificate validation
by removing verify=False or setting verify=True and ensure the runtime uses a
trusted CA bundle (e.g., honor AWS_CA_BUNDLE or pass a path to the CA bundle via
verify='/path/to/ca-bundle.pem'); keep endpoint_url and credentials as-is but do
not disable TLS verification.
- Line 21: The notebook currently disables TLS verification unconditionally via
urllib3.disable_warnings(...) and configuration.verify_ssl = False when creating
k8s.Configuration()/k8s.ApiClient(), and similarly uses verify=False for the
boto3 client; change this to respect an explicit environment variable (e.g.,
SKIP_TLS_VERIFY or DISABLE_TLS_VERIFY) instead of hardcoding disablement: read
the env var, and only call urllib3.disable_warnings(...) and set
configuration.verify_ssl = False (and pass verify=False to boto3) when that env
var is truthy; keep default behavior as secure (verification enabled) and
document the env var usage in the notebook near the configuration and ApiClient
creation.

In `@tests/trainer/sdk_tests/failure_traininghub_tests.go`:
- Around line 75-87: The constructed shellCmd string built via fmt.Sprintf
(using support.GetOpenShiftApiUrl, userToken, namespace.Name, rwxPvc.Name,
trainerutils.DefaultTrainingHubRuntime, failureNotebookName) is vulnerable to
command-injection; replace the single interpolated shell string with a safe
exec.Command call (or equivalent) passing each environment variable and command
argument separately, or explicitly validate/escape each interpolated value
before use; update both the shellCmd construction here and the similar block at
156-173 to build the command/args without shell interpolation (or
whitelist/escape inputs) so no untrusted content can inject shell
metacharacters.
- Around line 82-84: The wrapper currently echoes "NOTEBOOK_STATUS: SUCCESS"
unconditionally on the shell wrapper’s success path which can mask real notebook
failures; update the command string that contains "then echo 'NOTEBOOK_STATUS:
SUCCESS'; else echo 'NOTEBOOK_STATUS: FAILURE';" so it does not emit a
wrapper-level SUCCESS marker (emit only the FAILURE on the else path and remove
the then-success echo), and apply the same change to the other identical
occurrence so utils_notebook.go can reliably detect the notebook-produced
success/failure markers.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: f43884e2-7ef2-4548-bf63-dafc0948d62a

📥 Commits

Reviewing files that changed from the base of the PR and between 45ec38c and 9097db4.

📒 Files selected for processing (4)
  • tests/trainer/kubeflow_sdk_test.go
  • tests/trainer/resources/failure_scenarios.ipynb
  • tests/trainer/resources/torchrun_failure.ipynb
  • tests/trainer/sdk_tests/failure_traininghub_tests.go

Comment thread tests/trainer/resources/failure_scenarios.ipynb Outdated
Comment thread tests/trainer/resources/failure_scenarios.ipynb Outdated
Comment thread tests/trainer/resources/torchrun_failure.ipynb Outdated
Comment thread tests/trainer/resources/torchrun_failure.ipynb Outdated
Comment thread tests/trainer/resources/torchrun_failure.ipynb Outdated
Comment thread tests/trainer/sdk_tests/failure_traininghub_tests.go
Comment thread tests/trainer/sdk_tests/failure_traininghub_tests.go Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/trainer/resources/torchrun_failure.ipynb (1)

64-64: Global socket timeout side effect.

socket.setdefaulttimeout(10) affects all sockets globally for this process, including the Kubernetes client. Consider using boto3's connect_timeout/read_timeout (already configured in boto_cfg) without the global override.

Suggested fix
-socket.setdefaulttimeout(10)

The boto3 client already has explicit timeouts via BotoConfig(connect_timeout=5, read_timeout=10).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/trainer/resources/torchrun_failure.ipynb` at line 64, Remove the global
call socket.setdefaulttimeout(10) because it sets a process-wide socket timeout
that affects the Kubernetes client; instead rely on the existing boto
configuration (boto_cfg / BotoConfig(connect_timeout=5, read_timeout=10)) which
already sets per-client timeouts—delete the socket.setdefaulttimeout(10) line
and ensure boto_cfg is passed where the boto3 client is created.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/trainer/resources/torchrun_failure.ipynb`:
- Line 64: Remove the global call socket.setdefaulttimeout(10) because it sets a
process-wide socket timeout that affects the Kubernetes client; instead rely on
the existing boto configuration (boto_cfg / BotoConfig(connect_timeout=5,
read_timeout=10)) which already sets per-client timeouts—delete the
socket.setdefaulttimeout(10) line and ensure boto_cfg is passed where the boto3
client is created.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 523552f9-6313-416f-a2ba-61904fafd268

📥 Commits

Reviewing files that changed from the base of the PR and between 9097db4 and 4fa6293.

📒 Files selected for processing (4)
  • tests/trainer/kubeflow_sdk_test.go
  • tests/trainer/resources/failure_scenarios.ipynb
  • tests/trainer/resources/torchrun_failure.ipynb
  • tests/trainer/sdk_tests/failure_traininghub_tests.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/trainer/sdk_tests/failure_traininghub_tests.go

@sutaakar

sutaakar commented Mar 6, 2026

Copy link
Copy Markdown
Contributor

@Fiona-Waters What failure propagation is tested in failure_scenarios.ipynb?
I see the only SDK part checked for failure: client.wait_for_job_status(name=job_name, status={"Running", "Failed"}, timeout=300), though it checks that training starts only.

The rest is checking Pods, which is IMHO unrelated to SDK.

@Fiona-Waters

Fiona-Waters commented Mar 6, 2026

Copy link
Copy Markdown
Contributor Author

@Fiona-Waters What failure propagation is tested in failure_scenarios.ipynb? I see the only SDK part checked for failure: client.wait_for_job_status(name=job_name, status={"Running", "Failed"}, timeout=300), though it checks that training starts only.

The rest is checking Pods, which is IMHO unrelated to SDK.

We want to check that specific training failures are being reported correctly, and that the success or failure of the job/test is based on that and not only on the completion of the training pod. client.get_job_logs is checking that the correct error is reported. We want to test that the underlying error is being caught.
I am using the kubernetes API to check restart counts, as this was a secondary way to check the failure as the pods continue to restart for 10+ minutes which makes for quite a long running test. The train job status doesn't report failed for quite a while.

@sutaakar

sutaakar commented Mar 6, 2026

Copy link
Copy Markdown
Contributor

All those different failure scenarios look to me like something what verifies Training hub internal logic, which could be covered by unit test or component test on Training hub level.
For distributed workloads IMHO it may have sense to have one test checking that some failure in Training hub causes desired effects (Pod fails, proper job status from SDK, proper job log entry when retrieving job log using SDK, job events and such).

@Fiona-Waters

Copy link
Copy Markdown
Contributor Author

All those different failure scenarios look to me like something what verifies Training hub internal logic, which could be covered by unit test or component test on Training hub level. For distributed workloads IMHO it may have sense to have one test checking that some failure in Training hub causes desired effects (Pod fails, proper job status from SDK, proper job log entry when retrieving job log using SDK, job events and such).

Will discuss with the team. Thanks Karel

@briangallagher

Copy link
Copy Markdown
Contributor

I think we can leave the tests roughly as they are but try to:

  • remove the need for using core_v1 and instead use kubeflow SDK apis
  • verify use of client.get_job() and client.get_job_events()

@Fiona-Waters Fiona-Waters force-pushed the e2e-updates branch 2 times, most recently from e1f9d04 to 2e17f0f Compare March 9, 2026 12:08
@Fiona-Waters Fiona-Waters requested a review from sutaakar March 9, 2026 12:09
@Fiona-Waters

Copy link
Copy Markdown
Contributor Author

I think we can leave the tests roughly as they are but try to:

  • remove the need for using core_v1 and instead use kubeflow SDK apis
  • verify use of client.get_job() and client.get_job_events()

Have updated as per these points. Please review @briangallagher @sutaakar

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/trainer/resources/failure_scenarios.ipynb`:
- Around line 75-105: Seed valid control inputs before injecting failures:
create a small known-good data fixture (e.g., write VALID_DATA_PATH with one
JSONL message) and set VALID_MODEL_PATH to a known-good model fixture, then
update the notebook scenarios that call run_failure_scenario to use
VALID_DATA_PATH and VALID_MODEL_PATH in every scenario except the ones
intentionally testing invalid model_path or data_path; ensure PVC_NAME remains
provisioned but the tests isolate failures to the single field under test rather
than relying on missing workspace files. Use run_failure_scenario,
VALID_DATA_PATH, VALID_MODEL_PATH, and PVC_NAME to locate and modify the
scenario calls.
- Around line 9-10: The notebook installs kubeflow-sdk directly from the GitHub
main branch ("%pip install
git+https://github.com/opendatahub-io/kubeflow-sdk.git@main"), which causes
flakiness; update failure_scenarios.ipynb (and torchrun_failure.ipynb) to stop
pulling `@main` by pinning to a released version (e.g., replace the install with
"%pip install kubeflow-sdk==<version>") or, if you need unreleased changes,
vendor the built wheel into the notebook assets and change the install to use
the local wheel path instead; ensure the replacement targets the same install
cell containing the git+ URL so CI/no-internet runs succeed reliably.

In `@tests/trainer/resources/torchrun_failure.ipynb`:
- Line 73: stream_download currently writes directly to dst and leaves partial
files on failure; change it to write to a temporary path (e.g., dst + ".part"),
ensure read/write errors remove the temp file, and atomically rename the temp to
dst only on full success; also update the S3 download loop (the for page ...
loop using paginator and decisions that set dst) to treat files in MODEL_DIR as
requiring post-download validation — after renaming, verify non-zero size (and
if available, a checksum or expected file list) and if validation fails, remove
the file and mark the download as failed so retries can occur; ensure any
existing check for os.path.exists(dst) treats .part files as incomplete and
re-downloads them.

In `@tests/trainer/sdk_tests/failure_traininghub_tests.go`:
- Around line 75-83: The shell command built in the variable shellCmd currently
only emits "NOTEBOOK_STATUS: FAILURE" after papermill runs, so earlier failures
(e.g., pip install) cause timeouts; update the command string in
tests/trainer/sdk_tests/failure_traininghub_tests.go (the shellCmd that builds
the pip install + papermill pipeline) to emit NOTEBOOK_STATUS: FAILURE on
earlier-step failures by appending an immediate failure handler (for example,
use "&&" chaining with "|| { echo 'NOTEBOOK_STATUS: FAILURE'; exit 1; }" or wrap
critical steps like the pip install in a conditional that echoes the failure
marker and exits on error), and apply the same pattern to the boto3 variant
later in the file so any bootstrap failure reports NOTEBOOK_STATUS: FAILURE
immediately.
- Around line 19-23: Replace the hardcoded "resources/..." reads with
filepath.Join("..", "resources", <name>) where os.ReadFile is used so the tests
load notebooks from tests/trainer/resources (update the os.ReadFile calls
referenced in this file). For the bootstrap script that runs pip install (the
block that sets set -e and prints NOTEBOOK_STATUS), ensure NOTEBOOK_STATUS is
always printed before exiting on failure: either move the NOTEBOOK_STATUS:
SUCCESS/FAILURE marker outside the conditional that can be skipped by set -e, or
remove set -e and instead capture pip install errors, print NOTEBOOK_STATUS:
FAILURE, then exit; this fixes the log polling that waits for NOTEBOOK_STATUS.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: f9a16afe-c860-439c-8000-a40b3b711481

📥 Commits

Reviewing files that changed from the base of the PR and between 4fa6293 and 2e17f0f.

📒 Files selected for processing (4)
  • tests/trainer/kubeflow_sdk_test.go
  • tests/trainer/resources/failure_scenarios.ipynb
  • tests/trainer/resources/torchrun_failure.ipynb
  • tests/trainer/sdk_tests/failure_traininghub_tests.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/trainer/kubeflow_sdk_test.go

Comment on lines +9 to +10
"# pip Install kubeflow SDK from main branch for testing\n",
"%pip install git+https://github.com/opendatahub-io/kubeflow-sdk.git@main"

@coderabbitai coderabbitai Bot Mar 9, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check if the notebook files exist and examine their content
fd -e ipynb "(failure_scenarios|torchrun_failure)" --type f

Repository: opendatahub-io/distributed-workloads

Length of output: 174


🏁 Script executed:

# Once we locate the files, let's read them (they're JSON, so we can parse them)
# First find the exact paths
find . -name "failure_scenarios.ipynb" -o -name "torchrun_failure.ipynb" 2>/dev/null

Repository: opendatahub-io/distributed-workloads

Length of output: 178


🏁 Script executed:

# Search for references to these notebooks in test files to understand usage context
rg "(failure_scenarios|torchrun_failure)" --type go --type py -l

Repository: opendatahub-io/distributed-workloads

Length of output: 132


🏁 Script executed:

# Read the notebook JSON and extract cell content for lines 9-10 area
# Jupyter notebooks have cells array, let's parse it
python3 << 'EOF'
import json

with open('./tests/trainer/resources/failure_scenarios.ipynb', 'r') as f:
    nb = json.load(f)

# Print cells with their source to identify the pip install
for i, cell in enumerate(nb['cells']):
    source = ''.join(cell.get('source', []))
    if 'pip install' in source or 'kubeflow' in source:
        print(f"Cell {i} (type: {cell['cell_type']}):")
        print(source)
        print("---")
EOF

Repository: opendatahub-io/distributed-workloads

Length of output: 6082


🏁 Script executed:

# Read the other notebook
python3 << 'EOF'
import json

with open('./tests/trainer/resources/torchrun_failure.ipynb', 'r') as f:
    nb = json.load(f)

# Print cells with pip install
for i, cell in enumerate(nb['cells']):
    source = ''.join(cell.get('source', []))
    if 'pip install' in source or 'kubeflow' in source:
        print(f"Cell {i} (type: {cell['cell_type']}):")
        print(source)
        print("---")
EOF

Repository: opendatahub-io/distributed-workloads

Length of output: 3071


🏁 Script executed:

# Read the test file to understand how these notebooks are used
cat -n ./tests/trainer/sdk_tests/failure_traininghub_tests.go

Repository: opendatahub-io/distributed-workloads

Length of output: 9436


🏁 Script executed:

python3 << 'EOF'
import json

with open('./tests/trainer/resources/failure_scenarios.ipynb', 'r') as f:
    nb = json.load(f)

# Count total cells and line numbers
total_lines = 0
for i, cell in enumerate(nb['cells']):
    if cell['cell_type'] == 'code':
        source = ''.join(cell.get('source', []))
        lines_in_cell = len(source.split('\n'))
        if i < 10:  # Check first 10 cells
            print(f"Cell {i}: {lines_in_cell} lines")
            if 'kubeflow' in source or 'pip install' in source:
                print(f"  → Contains pip/kubeflow install")
EOF

Repository: opendatahub-io/distributed-workloads

Length of output: 350


🏁 Script executed:

# Verify which line in the cell corresponds to the pip install line referenced in review
python3 << 'EOF'
import json

files_to_check = [
    './tests/trainer/resources/failure_scenarios.ipynb',
    './tests/trainer/resources/torchrun_failure.ipynb'
]

for filepath in files_to_check:
    with open(filepath, 'r') as f:
        nb = json.load(f)
    
    print(f"\n{filepath}:")
    # The review references "Lines: 9-10" which in Jupyter notebooks usually means cell line numbers
    # Let's find which cell contains the main kubeflow install
    for i, cell in enumerate(nb['cells']):
        if cell['cell_type'] == 'code':
            source = ''.join(cell.get('source', []))
            if 'git+https://github.com/opendatahub-io/kubeflow-sdk.git@main' in source:
                print(f"  Found in Cell {i}:")
                lines = source.split('\n')
                for j, line in enumerate(lines, 1):
                    print(f"    Line {j}: {line}")
EOF

Repository: opendatahub-io/distributed-workloads

Length of output: 626


Stop pulling kubeflow-sdk from GitHub @main at runtime.

Both failure_scenarios.ipynb and torchrun_failure.ipynb depend on a moving branch head with no guaranteed egress to GitHub. This introduces unpredictable failures from unrelated upstream commits. Pin to a released version: kubeflow-sdk==<version>. If unreleased SDK changes are needed, vendor the wheel into notebook assets and install from a local path.

🔧 Proposed fix
- %pip install git+https://github.com/opendatahub-io/kubeflow-sdk.git@main
+ %pip install kubeflow-sdk==<version>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/trainer/resources/failure_scenarios.ipynb` around lines 9 - 10, The
notebook installs kubeflow-sdk directly from the GitHub main branch ("%pip
install git+https://github.com/opendatahub-io/kubeflow-sdk.git@main"), which
causes flakiness; update failure_scenarios.ipynb (and torchrun_failure.ipynb) to
stop pulling `@main` by pinning to a released version (e.g., replace the install
with "%pip install kubeflow-sdk==<version>") or, if you need unreleased changes,
vendor the built wheel into the notebook assets and change the install to use
the local wheel path instead; ensure the replacement targets the same install
cell containing the git+ URL so CI/no-internet runs succeed reliably.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to use the main branch for testing.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fiona-Waters, understood — if the tests are validating against SDK changes not yet in a release, pulling from @main is intentional.

One residual risk worth being aware of: if an unrelated commit lands on kubeflow-sdk@main between runs, it can cause flaky failures that are hard to attribute. A mitigation would be to pin to a specific commit SHA rather than the branch tip — e.g.:

- %pip install git+https://github.com/opendatahub-io/kubeflow-sdk.git@main
+ %pip install git+https://github.com/opendatahub-io/kubeflow-sdk.git@<commit-sha>

This still pulls from an unreleased state but gives you a stable, reproducible install. That said, I'll defer to your team's preference on this.


📝 Learnings were identified and not saved (knowledge base disabled). Enable

@sutaakar sutaakar Mar 9, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fiona-Waters it would actually have sense to install SDK conditionally, verifying SDK from image by default. The question is what all SDKs we want to check with the image (ODH nightly, downstream, latest from AIPCC index...). Based on that proper conditional logic can be provided.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. I think maybe that could be addressed in another PR since it would affect more tests?

Comment thread tests/trainer/resources/failure_scenarios.ipynb Outdated
Comment thread tests/trainer/resources/torchrun_failure.ipynb
Comment thread tests/trainer/sdk_tests/failure_traininghub_tests.go
Comment thread tests/trainer/sdk_tests/failure_traininghub_tests.go

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/trainer/sdk_tests/failure_traininghub_tests.go (1)

175-176: Misleading comment: "GPU test" but Notebook CR requests 0 GPUs.

The comment says "GPU test" while passing 0 for the GPU count. The Notebook pod orchestrates the test via SDK calls; the TrainJob it creates requests GPUs. Clarify to avoid confusion:

-// Create Notebook CR — GPU test, use ContainerSizeMedium
+// Create Notebook CR — orchestrates GPU training job, notebook itself needs no GPUs
 common.CreateNotebook(test, namespace, userToken, command, cm.Name, torchrunFailureNotebookName, 0, rwxPvc, common.ContainerSizeMedium)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/trainer/sdk_tests/failure_traininghub_tests.go` around lines 175 - 176,
The comment "GPU test" is misleading because CreateNotebook(...) is called with
GPU count 0; update the comment near the CreateNotebook invocation (referencing
CreateNotebook and torchrunFailureNotebookName) to clarify that the Notebook CR
itself requests 0 GPUs but will orchestrate a TrainJob that requests GPUs via
the SDK, e.g., "Create Notebook CR — Notebook requests 0 GPUs; TrainJob invoked
by the Notebook will request GPUs (GPU test)"; ensure the comment mentions
ContainerSizeMedium and the 0 GPU argument so readers aren't confused.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/trainer/sdk_tests/failure_traininghub_tests.go`:
- Around line 132-142: The test inconsistently ignores the OK flags from
support.GetStorageBucketAccessKeyId(), support.GetStorageBucketSecretKey(), and
support.GetStorageBucketSftDir(); update the code to read the returned booleans
(e.g. accessKeyOK, secretKeyOK, prefixOK) and handle them consistently with
endpointOK/bucketOK — either clear the corresponding variable to "" when not OK
or, if the test requires real S3 credentials, assert/fail fast (e.g. use
test.Expect(accessKeyOK).To(BeTrue(), "...") ) so the test fails with a clear
message; ensure you update handling for variables accessKey, secretKey, and
prefix alongside the existing endpoint and bucket logic.

---

Nitpick comments:
In `@tests/trainer/sdk_tests/failure_traininghub_tests.go`:
- Around line 175-176: The comment "GPU test" is misleading because
CreateNotebook(...) is called with GPU count 0; update the comment near the
CreateNotebook invocation (referencing CreateNotebook and
torchrunFailureNotebookName) to clarify that the Notebook CR itself requests 0
GPUs but will orchestrate a TrainJob that requests GPUs via the SDK, e.g.,
"Create Notebook CR — Notebook requests 0 GPUs; TrainJob invoked by the Notebook
will request GPUs (GPU test)"; ensure the comment mentions ContainerSizeMedium
and the 0 GPU argument so readers aren't confused.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 7d0a05be-780e-4c59-8620-f12c6f152b89

📥 Commits

Reviewing files that changed from the base of the PR and between 2e17f0f and 568fc6e.

📒 Files selected for processing (2)
  • tests/trainer/resources/failure_scenarios.ipynb
  • tests/trainer/sdk_tests/failure_traininghub_tests.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/trainer/resources/failure_scenarios.ipynb

Comment thread tests/trainer/sdk_tests/failure_traininghub_tests.go

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
tests/trainer/resources/failure_scenarios.ipynb (1)

83-105: ⚠️ Potential issue | 🟠 Major

Seed a real dataset fixture for the non-data_path scenarios.

data_path is set to /opt/app-root/src/data.jsonl, but this notebook never creates that file. That means the model-path scenarios can still satisfy the current failure checks because the dataset lookup fails first, so the notebook can go green without proving model-path propagation.

🔧 Suggested fix
VALID_DATA_PATH = "/opt/app-root/src/data.jsonl"
with open(VALID_DATA_PATH, "w") as f:
    f.write('{"messages":[{"role":"user","content":"hi"},{"role":"assistant","content":"hello"}]}\n')
-        "data_path": "/opt/app-root/src/data.jsonl",
+        "data_path": VALID_DATA_PATH,

As per coding guidelines, **: "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/trainer/resources/failure_scenarios.ipynb` around lines 83 - 105, The
notebook fails to create a real dataset file at /opt/app-root/src/data.jsonl so
many scenarios abort on missing data rather than exercising model-path errors;
add a seeded fixture before the failure scenarios by creating VALID_DATA_PATH
(e.g., "/opt/app-root/src/data.jsonl") and writing a single JSONL example to
disk so run_failure_scenario calls (used throughout the cells) can reach
model-path and algorithm-specific checks; place this setup once near the top of
the failure scenario cells (before the first call to run_failure_scenario) so
scenarios like "Invalid Model Path", "OSFT Invalid Unfreeze Rank Ratio", and
"LORA Invalid Model Path" run against the real dataset.
🧹 Nitpick comments (1)
tests/trainer/resources/torchrun_failure.ipynb (1)

142-147: Use the resolved artifact paths instead of duplicating literals.

This cell ignores the model_path and dataset_file values prepared earlier and hardcodes the same paths again. If the download layout changes, the job will fail on missing inputs and stop exercising the intended OOM path.

🔧 Suggested fix
-LOCAL_MODEL_PATH = "/opt/app-root/src/Qwen/Qwen2.5-1.5B-Instruct"
-
 training_parameters = {
-    "model_path": LOCAL_MODEL_PATH,
-    "data_path": "/opt/app-root/src/table-gpt-data/train/train_All_100.jsonl",
+    "model_path": model_path,
+    "data_path": dataset_file,

As per coding guidelines, **: "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/trainer/resources/torchrun_failure.ipynb` around lines 142 - 147, The
cell hardcodes paths into LOCAL_MODEL_PATH and training_parameters instead of
using the resolved artifact variables prepared earlier; update the cell so
training_parameters["model_path"] uses the existing resolved model variable
(e.g., LOCAL_MODEL_PATH or the earlier model_path variable) and
training_parameters["data_path"] uses the resolved dataset variable (e.g.,
dataset_file) rather than duplicating literals, ensuring model_path/data_path
references the previously computed artifact paths when constructing
training_parameters.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/trainer/resources/failure_scenarios.ipynb`:
- Around line 64-68: run_failure_scenario currently returns True even if cleanup
(client.delete_job) fails; update the finally block so any exception during
delete_job causes the scenario to be marked failed (set result = False or
re-raise) rather than silently logging and returning success. Specifically, in
run_failure_scenario's finally: when client.delete_job(job_name) raises, capture
that exception, log it, and set result = False (or raise) so the function does
not return True after leaking a job; reference the job_name variable and the
client.delete_job call to locate the code to change.

In `@tests/trainer/resources/torchrun_failure.ipynb`:
- Around line 218-224: The notebook prints "NOTEBOOK_STATUS: SUCCESS" regardless
of whether client.delete_job(job_name) succeeded, which can hide leaked GPU
jobs; modify the cleanup so that NOTEBOOK_STATUS: SUCCESS is emitted only when
delete_job succeeds — move the success print inside the try block (or on
successful completion call) and on exception either print "NOTEBOOK_STATUS:
FAILURE" (or re-raise the exception) after printing the failure message so the
test fails and the leaked job is not silently ignored; locate and update the
block around client.delete_job(job_name), the except Exception as e handler, and
the final print("NOTEBOOK_STATUS: SUCCESS") accordingly.

---

Duplicate comments:
In `@tests/trainer/resources/failure_scenarios.ipynb`:
- Around line 83-105: The notebook fails to create a real dataset file at
/opt/app-root/src/data.jsonl so many scenarios abort on missing data rather than
exercising model-path errors; add a seeded fixture before the failure scenarios
by creating VALID_DATA_PATH (e.g., "/opt/app-root/src/data.jsonl") and writing a
single JSONL example to disk so run_failure_scenario calls (used throughout the
cells) can reach model-path and algorithm-specific checks; place this setup once
near the top of the failure scenario cells (before the first call to
run_failure_scenario) so scenarios like "Invalid Model Path", "OSFT Invalid
Unfreeze Rank Ratio", and "LORA Invalid Model Path" run against the real
dataset.

---

Nitpick comments:
In `@tests/trainer/resources/torchrun_failure.ipynb`:
- Around line 142-147: The cell hardcodes paths into LOCAL_MODEL_PATH and
training_parameters instead of using the resolved artifact variables prepared
earlier; update the cell so training_parameters["model_path"] uses the existing
resolved model variable (e.g., LOCAL_MODEL_PATH or the earlier model_path
variable) and training_parameters["data_path"] uses the resolved dataset
variable (e.g., dataset_file) rather than duplicating literals, ensuring
model_path/data_path references the previously computed artifact paths when
constructing training_parameters.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 41b118ef-6ede-4a98-b527-de28289b691f

📥 Commits

Reviewing files that changed from the base of the PR and between 568fc6e and fd33bd5.

📒 Files selected for processing (4)
  • tests/trainer/kubeflow_sdk_test.go
  • tests/trainer/resources/failure_scenarios.ipynb
  • tests/trainer/resources/torchrun_failure.ipynb
  • tests/trainer/sdk_tests/failure_traininghub_tests.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/trainer/sdk_tests/failure_traininghub_tests.go

Comment thread tests/trainer/resources/failure_scenarios.ipynb
Comment thread tests/trainer/resources/torchrun_failure.ipynb
@briangallagher

Copy link
Copy Markdown
Contributor

/lgtm

@sutaakar sutaakar left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@openshift-ci

openshift-ci Bot commented Mar 10, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sutaakar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot Bot merged commit 0414f76 into opendatahub-io:main Mar 10, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants