Add training failure scenario E2E tests for SFT, OSFT, and LORA#658
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds end-to-end failure scenario tests for Kubeflow training jobs. It introduces two new Go test entry points that invoke SDK test helpers, a Go test harness that provisions Kubernetes resources (Namespace, ServiceAccount, RBAC, ConfigMap, RWX PVC, Notebook CRs), and orchestration to run papermill notebooks inside Notebooks and monitor pod logs. Two Jupyter notebooks are added that implement failure-injection workflows: a multi-scenario notebook (invalid dataset/model paths, malformed hyperparameters, OSFT/LORA misconfigurations) and a torchrun OOM-inducing training notebook that forces a crash on first forward pass. No existing public API signatures were removed or changed. Security Findings
🎯 4 (Complex) | ⏱️ ~75 minutes 🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
9097db4 to
4fa6293
Compare
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/trainer/resources/failure_scenarios.ipynb`:
- Line 115: The notebook currently prints a summary but does not fail the
process when any scenario (result_1, result_2, result_3, result_4, result_5
stored in the results dict) is False; modify the end of the cell that computes
all_passed to terminate with a non-zero exit (e.g., raise a RuntimeError or call
sys.exit(1)) when all_passed is False, keeping the existing NOTEBOOK_STATUS
prints; ensure the change is placed after the all_passed = all(results.values())
line and references the results/all_passed logic so papermill or CI sees a
failure.
- Line 19: The notebook currently disables TLS verification
(configuration.verify_ssl = False and urllib3.disable_warnings) which allows
MITM attacks; update the Kubernetes client setup in this file to enable TLS
verification by removing the verify_ssl=False assignment and the
disable_warnings call, and instead load proper cluster credentials (e.g., use
k8s.config.load_kube_config() or k8s.config.load_incluster_config()) and set
configuration.ssl_ca_cert to the cluster CA certificate path or an
environment-provided CA bundle before creating k8s.ApiClient; ensure the bearer
token is still provided via configuration.api_key but only after TLS is enforced
so ApiClient and any functions that use api_client perform certificate
validation.
In `@tests/trainer/resources/torchrun_failure.ipynb`:
- Line 314: The OOM detection in OOM_ERROR_PATTERNS is too broad (e.g.,
"RuntimeError", "NCCL") and can false-pass; tighten the matcher in the
monitoring loop (the block that builds log_text and iterates OOM_ERROR_PATTERNS)
by removing generic tokens and using more specific, case-insensitive OOM
indicators (e.g., "out of memory", "CUDA out of memory",
"torch.cuda.OutOfMemoryError", "torch.OutOfMemoryError" or a regex with word
boundaries) so that matched_pattern only sets when a true OOM signature is
found; update the OOM_ERROR_PATTERNS constant and the matching logic in the
while loop that checks client.get_job_logs (and keep get_training_pod_status
logic unchanged) to avoid matching generic "RuntimeError" or "NCCL" strings
alone.
- Around line 129-136: The S3 client in the notebook is created with
verify=False which disables TLS certificate validation; update the boto3.client
call that constructs s3 (the variable named "s3" via boto3.client(...,
endpoint_url=endpoint_url, ... , verify=False)) to enable certificate validation
by removing verify=False or setting verify=True and ensure the runtime uses a
trusted CA bundle (e.g., honor AWS_CA_BUNDLE or pass a path to the CA bundle via
verify='/path/to/ca-bundle.pem'); keep endpoint_url and credentials as-is but do
not disable TLS verification.
- Line 21: The notebook currently disables TLS verification unconditionally via
urllib3.disable_warnings(...) and configuration.verify_ssl = False when creating
k8s.Configuration()/k8s.ApiClient(), and similarly uses verify=False for the
boto3 client; change this to respect an explicit environment variable (e.g.,
SKIP_TLS_VERIFY or DISABLE_TLS_VERIFY) instead of hardcoding disablement: read
the env var, and only call urllib3.disable_warnings(...) and set
configuration.verify_ssl = False (and pass verify=False to boto3) when that env
var is truthy; keep default behavior as secure (verification enabled) and
document the env var usage in the notebook near the configuration and ApiClient
creation.
In `@tests/trainer/sdk_tests/failure_traininghub_tests.go`:
- Around line 75-87: The constructed shellCmd string built via fmt.Sprintf
(using support.GetOpenShiftApiUrl, userToken, namespace.Name, rwxPvc.Name,
trainerutils.DefaultTrainingHubRuntime, failureNotebookName) is vulnerable to
command-injection; replace the single interpolated shell string with a safe
exec.Command call (or equivalent) passing each environment variable and command
argument separately, or explicitly validate/escape each interpolated value
before use; update both the shellCmd construction here and the similar block at
156-173 to build the command/args without shell interpolation (or
whitelist/escape inputs) so no untrusted content can inject shell
metacharacters.
- Around line 82-84: The wrapper currently echoes "NOTEBOOK_STATUS: SUCCESS"
unconditionally on the shell wrapper’s success path which can mask real notebook
failures; update the command string that contains "then echo 'NOTEBOOK_STATUS:
SUCCESS'; else echo 'NOTEBOOK_STATUS: FAILURE';" so it does not emit a
wrapper-level SUCCESS marker (emit only the FAILURE on the else path and remove
the then-success echo), and apply the same change to the other identical
occurrence so utils_notebook.go can reliably detect the notebook-produced
success/failure markers.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: f43884e2-7ef2-4548-bf63-dafc0948d62a
📒 Files selected for processing (4)
tests/trainer/kubeflow_sdk_test.gotests/trainer/resources/failure_scenarios.ipynbtests/trainer/resources/torchrun_failure.ipynbtests/trainer/sdk_tests/failure_traininghub_tests.go
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/trainer/resources/torchrun_failure.ipynb (1)
64-64: Global socket timeout side effect.
socket.setdefaulttimeout(10)affects all sockets globally for this process, including the Kubernetes client. Consider using boto3'sconnect_timeout/read_timeout(already configured inboto_cfg) without the global override.Suggested fix
-socket.setdefaulttimeout(10)The boto3 client already has explicit timeouts via
BotoConfig(connect_timeout=5, read_timeout=10).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/trainer/resources/torchrun_failure.ipynb` at line 64, Remove the global call socket.setdefaulttimeout(10) because it sets a process-wide socket timeout that affects the Kubernetes client; instead rely on the existing boto configuration (boto_cfg / BotoConfig(connect_timeout=5, read_timeout=10)) which already sets per-client timeouts—delete the socket.setdefaulttimeout(10) line and ensure boto_cfg is passed where the boto3 client is created.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/trainer/resources/torchrun_failure.ipynb`:
- Line 64: Remove the global call socket.setdefaulttimeout(10) because it sets a
process-wide socket timeout that affects the Kubernetes client; instead rely on
the existing boto configuration (boto_cfg / BotoConfig(connect_timeout=5,
read_timeout=10)) which already sets per-client timeouts—delete the
socket.setdefaulttimeout(10) line and ensure boto_cfg is passed where the boto3
client is created.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 523552f9-6313-416f-a2ba-61904fafd268
📒 Files selected for processing (4)
tests/trainer/kubeflow_sdk_test.gotests/trainer/resources/failure_scenarios.ipynbtests/trainer/resources/torchrun_failure.ipynbtests/trainer/sdk_tests/failure_traininghub_tests.go
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/trainer/sdk_tests/failure_traininghub_tests.go
|
@Fiona-Waters What failure propagation is tested in failure_scenarios.ipynb? The rest is checking Pods, which is IMHO unrelated to SDK. |
We want to check that specific training failures are being reported correctly, and that the success or failure of the job/test is based on that and not only on the completion of the training pod. client.get_job_logs is checking that the correct error is reported. We want to test that the underlying error is being caught. |
890cb91 to
7d1193d
Compare
|
All those different failure scenarios look to me like something what verifies Training hub internal logic, which could be covered by unit test or component test on Training hub level. |
Will discuss with the team. Thanks Karel |
|
I think we can leave the tests roughly as they are but try to:
|
e1f9d04 to
2e17f0f
Compare
Have updated as per these points. Please review @briangallagher @sutaakar |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/trainer/resources/failure_scenarios.ipynb`:
- Around line 75-105: Seed valid control inputs before injecting failures:
create a small known-good data fixture (e.g., write VALID_DATA_PATH with one
JSONL message) and set VALID_MODEL_PATH to a known-good model fixture, then
update the notebook scenarios that call run_failure_scenario to use
VALID_DATA_PATH and VALID_MODEL_PATH in every scenario except the ones
intentionally testing invalid model_path or data_path; ensure PVC_NAME remains
provisioned but the tests isolate failures to the single field under test rather
than relying on missing workspace files. Use run_failure_scenario,
VALID_DATA_PATH, VALID_MODEL_PATH, and PVC_NAME to locate and modify the
scenario calls.
- Around line 9-10: The notebook installs kubeflow-sdk directly from the GitHub
main branch ("%pip install
git+https://github.com/opendatahub-io/kubeflow-sdk.git@main"), which causes
flakiness; update failure_scenarios.ipynb (and torchrun_failure.ipynb) to stop
pulling `@main` by pinning to a released version (e.g., replace the install with
"%pip install kubeflow-sdk==<version>") or, if you need unreleased changes,
vendor the built wheel into the notebook assets and change the install to use
the local wheel path instead; ensure the replacement targets the same install
cell containing the git+ URL so CI/no-internet runs succeed reliably.
In `@tests/trainer/resources/torchrun_failure.ipynb`:
- Line 73: stream_download currently writes directly to dst and leaves partial
files on failure; change it to write to a temporary path (e.g., dst + ".part"),
ensure read/write errors remove the temp file, and atomically rename the temp to
dst only on full success; also update the S3 download loop (the for page ...
loop using paginator and decisions that set dst) to treat files in MODEL_DIR as
requiring post-download validation — after renaming, verify non-zero size (and
if available, a checksum or expected file list) and if validation fails, remove
the file and mark the download as failed so retries can occur; ensure any
existing check for os.path.exists(dst) treats .part files as incomplete and
re-downloads them.
In `@tests/trainer/sdk_tests/failure_traininghub_tests.go`:
- Around line 75-83: The shell command built in the variable shellCmd currently
only emits "NOTEBOOK_STATUS: FAILURE" after papermill runs, so earlier failures
(e.g., pip install) cause timeouts; update the command string in
tests/trainer/sdk_tests/failure_traininghub_tests.go (the shellCmd that builds
the pip install + papermill pipeline) to emit NOTEBOOK_STATUS: FAILURE on
earlier-step failures by appending an immediate failure handler (for example,
use "&&" chaining with "|| { echo 'NOTEBOOK_STATUS: FAILURE'; exit 1; }" or wrap
critical steps like the pip install in a conditional that echoes the failure
marker and exits on error), and apply the same pattern to the boto3 variant
later in the file so any bootstrap failure reports NOTEBOOK_STATUS: FAILURE
immediately.
- Around line 19-23: Replace the hardcoded "resources/..." reads with
filepath.Join("..", "resources", <name>) where os.ReadFile is used so the tests
load notebooks from tests/trainer/resources (update the os.ReadFile calls
referenced in this file). For the bootstrap script that runs pip install (the
block that sets set -e and prints NOTEBOOK_STATUS), ensure NOTEBOOK_STATUS is
always printed before exiting on failure: either move the NOTEBOOK_STATUS:
SUCCESS/FAILURE marker outside the conditional that can be skipped by set -e, or
remove set -e and instead capture pip install errors, print NOTEBOOK_STATUS:
FAILURE, then exit; this fixes the log polling that waits for NOTEBOOK_STATUS.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: f9a16afe-c860-439c-8000-a40b3b711481
📒 Files selected for processing (4)
tests/trainer/kubeflow_sdk_test.gotests/trainer/resources/failure_scenarios.ipynbtests/trainer/resources/torchrun_failure.ipynbtests/trainer/sdk_tests/failure_traininghub_tests.go
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/trainer/kubeflow_sdk_test.go
| "# pip Install kubeflow SDK from main branch for testing\n", | ||
| "%pip install git+https://github.com/opendatahub-io/kubeflow-sdk.git@main" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check if the notebook files exist and examine their content
fd -e ipynb "(failure_scenarios|torchrun_failure)" --type fRepository: opendatahub-io/distributed-workloads
Length of output: 174
🏁 Script executed:
# Once we locate the files, let's read them (they're JSON, so we can parse them)
# First find the exact paths
find . -name "failure_scenarios.ipynb" -o -name "torchrun_failure.ipynb" 2>/dev/nullRepository: opendatahub-io/distributed-workloads
Length of output: 178
🏁 Script executed:
# Search for references to these notebooks in test files to understand usage context
rg "(failure_scenarios|torchrun_failure)" --type go --type py -lRepository: opendatahub-io/distributed-workloads
Length of output: 132
🏁 Script executed:
# Read the notebook JSON and extract cell content for lines 9-10 area
# Jupyter notebooks have cells array, let's parse it
python3 << 'EOF'
import json
with open('./tests/trainer/resources/failure_scenarios.ipynb', 'r') as f:
nb = json.load(f)
# Print cells with their source to identify the pip install
for i, cell in enumerate(nb['cells']):
source = ''.join(cell.get('source', []))
if 'pip install' in source or 'kubeflow' in source:
print(f"Cell {i} (type: {cell['cell_type']}):")
print(source)
print("---")
EOFRepository: opendatahub-io/distributed-workloads
Length of output: 6082
🏁 Script executed:
# Read the other notebook
python3 << 'EOF'
import json
with open('./tests/trainer/resources/torchrun_failure.ipynb', 'r') as f:
nb = json.load(f)
# Print cells with pip install
for i, cell in enumerate(nb['cells']):
source = ''.join(cell.get('source', []))
if 'pip install' in source or 'kubeflow' in source:
print(f"Cell {i} (type: {cell['cell_type']}):")
print(source)
print("---")
EOFRepository: opendatahub-io/distributed-workloads
Length of output: 3071
🏁 Script executed:
# Read the test file to understand how these notebooks are used
cat -n ./tests/trainer/sdk_tests/failure_traininghub_tests.goRepository: opendatahub-io/distributed-workloads
Length of output: 9436
🏁 Script executed:
python3 << 'EOF'
import json
with open('./tests/trainer/resources/failure_scenarios.ipynb', 'r') as f:
nb = json.load(f)
# Count total cells and line numbers
total_lines = 0
for i, cell in enumerate(nb['cells']):
if cell['cell_type'] == 'code':
source = ''.join(cell.get('source', []))
lines_in_cell = len(source.split('\n'))
if i < 10: # Check first 10 cells
print(f"Cell {i}: {lines_in_cell} lines")
if 'kubeflow' in source or 'pip install' in source:
print(f" → Contains pip/kubeflow install")
EOFRepository: opendatahub-io/distributed-workloads
Length of output: 350
🏁 Script executed:
# Verify which line in the cell corresponds to the pip install line referenced in review
python3 << 'EOF'
import json
files_to_check = [
'./tests/trainer/resources/failure_scenarios.ipynb',
'./tests/trainer/resources/torchrun_failure.ipynb'
]
for filepath in files_to_check:
with open(filepath, 'r') as f:
nb = json.load(f)
print(f"\n{filepath}:")
# The review references "Lines: 9-10" which in Jupyter notebooks usually means cell line numbers
# Let's find which cell contains the main kubeflow install
for i, cell in enumerate(nb['cells']):
if cell['cell_type'] == 'code':
source = ''.join(cell.get('source', []))
if 'git+https://github.com/opendatahub-io/kubeflow-sdk.git@main' in source:
print(f" Found in Cell {i}:")
lines = source.split('\n')
for j, line in enumerate(lines, 1):
print(f" Line {j}: {line}")
EOFRepository: opendatahub-io/distributed-workloads
Length of output: 626
Stop pulling kubeflow-sdk from GitHub @main at runtime.
Both failure_scenarios.ipynb and torchrun_failure.ipynb depend on a moving branch head with no guaranteed egress to GitHub. This introduces unpredictable failures from unrelated upstream commits. Pin to a released version: kubeflow-sdk==<version>. If unreleased SDK changes are needed, vendor the wheel into notebook assets and install from a local path.
🔧 Proposed fix
- %pip install git+https://github.com/opendatahub-io/kubeflow-sdk.git@main
+ %pip install kubeflow-sdk==<version>🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/trainer/resources/failure_scenarios.ipynb` around lines 9 - 10, The
notebook installs kubeflow-sdk directly from the GitHub main branch ("%pip
install git+https://github.com/opendatahub-io/kubeflow-sdk.git@main"), which
causes flakiness; update failure_scenarios.ipynb (and torchrun_failure.ipynb) to
stop pulling `@main` by pinning to a released version (e.g., replace the install
with "%pip install kubeflow-sdk==<version>") or, if you need unreleased changes,
vendor the built wheel into the notebook assets and change the install to use
the local wheel path instead; ensure the replacement targets the same install
cell containing the git+ URL so CI/no-internet runs succeed reliably.
There was a problem hiding this comment.
we need to use the main branch for testing.
There was a problem hiding this comment.
@Fiona-Waters, understood — if the tests are validating against SDK changes not yet in a release, pulling from @main is intentional.
One residual risk worth being aware of: if an unrelated commit lands on kubeflow-sdk@main between runs, it can cause flaky failures that are hard to attribute. A mitigation would be to pin to a specific commit SHA rather than the branch tip — e.g.:
- %pip install git+https://github.com/opendatahub-io/kubeflow-sdk.git@main
+ %pip install git+https://github.com/opendatahub-io/kubeflow-sdk.git@<commit-sha>This still pulls from an unreleased state but gives you a stable, reproducible install. That said, I'll defer to your team's preference on this.
📝 Learnings were identified and not saved (knowledge base disabled). Enable
There was a problem hiding this comment.
@Fiona-Waters it would actually have sense to install SDK conditionally, verifying SDK from image by default. The question is what all SDKs we want to check with the image (ODH nightly, downstream, latest from AIPCC index...). Based on that proper conditional logic can be provided.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
Understood. I think maybe that could be addressed in another PR since it would affect more tests?
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/trainer/sdk_tests/failure_traininghub_tests.go (1)
175-176: Misleading comment: "GPU test" but Notebook CR requests 0 GPUs.The comment says "GPU test" while passing
0for the GPU count. The Notebook pod orchestrates the test via SDK calls; the TrainJob it creates requests GPUs. Clarify to avoid confusion:-// Create Notebook CR — GPU test, use ContainerSizeMedium +// Create Notebook CR — orchestrates GPU training job, notebook itself needs no GPUs common.CreateNotebook(test, namespace, userToken, command, cm.Name, torchrunFailureNotebookName, 0, rwxPvc, common.ContainerSizeMedium)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/trainer/sdk_tests/failure_traininghub_tests.go` around lines 175 - 176, The comment "GPU test" is misleading because CreateNotebook(...) is called with GPU count 0; update the comment near the CreateNotebook invocation (referencing CreateNotebook and torchrunFailureNotebookName) to clarify that the Notebook CR itself requests 0 GPUs but will orchestrate a TrainJob that requests GPUs via the SDK, e.g., "Create Notebook CR — Notebook requests 0 GPUs; TrainJob invoked by the Notebook will request GPUs (GPU test)"; ensure the comment mentions ContainerSizeMedium and the 0 GPU argument so readers aren't confused.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/trainer/sdk_tests/failure_traininghub_tests.go`:
- Around line 132-142: The test inconsistently ignores the OK flags from
support.GetStorageBucketAccessKeyId(), support.GetStorageBucketSecretKey(), and
support.GetStorageBucketSftDir(); update the code to read the returned booleans
(e.g. accessKeyOK, secretKeyOK, prefixOK) and handle them consistently with
endpointOK/bucketOK — either clear the corresponding variable to "" when not OK
or, if the test requires real S3 credentials, assert/fail fast (e.g. use
test.Expect(accessKeyOK).To(BeTrue(), "...") ) so the test fails with a clear
message; ensure you update handling for variables accessKey, secretKey, and
prefix alongside the existing endpoint and bucket logic.
---
Nitpick comments:
In `@tests/trainer/sdk_tests/failure_traininghub_tests.go`:
- Around line 175-176: The comment "GPU test" is misleading because
CreateNotebook(...) is called with GPU count 0; update the comment near the
CreateNotebook invocation (referencing CreateNotebook and
torchrunFailureNotebookName) to clarify that the Notebook CR itself requests 0
GPUs but will orchestrate a TrainJob that requests GPUs via the SDK, e.g.,
"Create Notebook CR — Notebook requests 0 GPUs; TrainJob invoked by the Notebook
will request GPUs (GPU test)"; ensure the comment mentions ContainerSizeMedium
and the 0 GPU argument so readers aren't confused.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 7d0a05be-780e-4c59-8620-f12c6f152b89
📒 Files selected for processing (2)
tests/trainer/resources/failure_scenarios.ipynbtests/trainer/sdk_tests/failure_traininghub_tests.go
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/trainer/resources/failure_scenarios.ipynb
Signed-off-by: Fiona-Waters <fiwaters6@gmail.com>
568fc6e to
fd33bd5
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
tests/trainer/resources/failure_scenarios.ipynb (1)
83-105:⚠️ Potential issue | 🟠 MajorSeed a real dataset fixture for the non-
data_pathscenarios.
data_pathis set to/opt/app-root/src/data.jsonl, but this notebook never creates that file. That means the model-path scenarios can still satisfy the current failure checks because the dataset lookup fails first, so the notebook can go green without proving model-path propagation.🔧 Suggested fix
VALID_DATA_PATH = "/opt/app-root/src/data.jsonl" with open(VALID_DATA_PATH, "w") as f: f.write('{"messages":[{"role":"user","content":"hi"},{"role":"assistant","content":"hello"}]}\n')- "data_path": "/opt/app-root/src/data.jsonl", + "data_path": VALID_DATA_PATH,As per coding guidelines,
**: "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/trainer/resources/failure_scenarios.ipynb` around lines 83 - 105, The notebook fails to create a real dataset file at /opt/app-root/src/data.jsonl so many scenarios abort on missing data rather than exercising model-path errors; add a seeded fixture before the failure scenarios by creating VALID_DATA_PATH (e.g., "/opt/app-root/src/data.jsonl") and writing a single JSONL example to disk so run_failure_scenario calls (used throughout the cells) can reach model-path and algorithm-specific checks; place this setup once near the top of the failure scenario cells (before the first call to run_failure_scenario) so scenarios like "Invalid Model Path", "OSFT Invalid Unfreeze Rank Ratio", and "LORA Invalid Model Path" run against the real dataset.
🧹 Nitpick comments (1)
tests/trainer/resources/torchrun_failure.ipynb (1)
142-147: Use the resolved artifact paths instead of duplicating literals.This cell ignores the
model_pathanddataset_filevalues prepared earlier and hardcodes the same paths again. If the download layout changes, the job will fail on missing inputs and stop exercising the intended OOM path.🔧 Suggested fix
-LOCAL_MODEL_PATH = "/opt/app-root/src/Qwen/Qwen2.5-1.5B-Instruct" - training_parameters = { - "model_path": LOCAL_MODEL_PATH, - "data_path": "/opt/app-root/src/table-gpt-data/train/train_All_100.jsonl", + "model_path": model_path, + "data_path": dataset_file,As per coding guidelines,
**: "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/trainer/resources/torchrun_failure.ipynb` around lines 142 - 147, The cell hardcodes paths into LOCAL_MODEL_PATH and training_parameters instead of using the resolved artifact variables prepared earlier; update the cell so training_parameters["model_path"] uses the existing resolved model variable (e.g., LOCAL_MODEL_PATH or the earlier model_path variable) and training_parameters["data_path"] uses the resolved dataset variable (e.g., dataset_file) rather than duplicating literals, ensuring model_path/data_path references the previously computed artifact paths when constructing training_parameters.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/trainer/resources/failure_scenarios.ipynb`:
- Around line 64-68: run_failure_scenario currently returns True even if cleanup
(client.delete_job) fails; update the finally block so any exception during
delete_job causes the scenario to be marked failed (set result = False or
re-raise) rather than silently logging and returning success. Specifically, in
run_failure_scenario's finally: when client.delete_job(job_name) raises, capture
that exception, log it, and set result = False (or raise) so the function does
not return True after leaking a job; reference the job_name variable and the
client.delete_job call to locate the code to change.
In `@tests/trainer/resources/torchrun_failure.ipynb`:
- Around line 218-224: The notebook prints "NOTEBOOK_STATUS: SUCCESS" regardless
of whether client.delete_job(job_name) succeeded, which can hide leaked GPU
jobs; modify the cleanup so that NOTEBOOK_STATUS: SUCCESS is emitted only when
delete_job succeeds — move the success print inside the try block (or on
successful completion call) and on exception either print "NOTEBOOK_STATUS:
FAILURE" (or re-raise the exception) after printing the failure message so the
test fails and the leaked job is not silently ignored; locate and update the
block around client.delete_job(job_name), the except Exception as e handler, and
the final print("NOTEBOOK_STATUS: SUCCESS") accordingly.
---
Duplicate comments:
In `@tests/trainer/resources/failure_scenarios.ipynb`:
- Around line 83-105: The notebook fails to create a real dataset file at
/opt/app-root/src/data.jsonl so many scenarios abort on missing data rather than
exercising model-path errors; add a seeded fixture before the failure scenarios
by creating VALID_DATA_PATH (e.g., "/opt/app-root/src/data.jsonl") and writing a
single JSONL example to disk so run_failure_scenario calls (used throughout the
cells) can reach model-path and algorithm-specific checks; place this setup once
near the top of the failure scenario cells (before the first call to
run_failure_scenario) so scenarios like "Invalid Model Path", "OSFT Invalid
Unfreeze Rank Ratio", and "LORA Invalid Model Path" run against the real
dataset.
---
Nitpick comments:
In `@tests/trainer/resources/torchrun_failure.ipynb`:
- Around line 142-147: The cell hardcodes paths into LOCAL_MODEL_PATH and
training_parameters instead of using the resolved artifact variables prepared
earlier; update the cell so training_parameters["model_path"] uses the existing
resolved model variable (e.g., LOCAL_MODEL_PATH or the earlier model_path
variable) and training_parameters["data_path"] uses the resolved dataset
variable (e.g., dataset_file) rather than duplicating literals, ensuring
model_path/data_path references the previously computed artifact paths when
constructing training_parameters.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 41b118ef-6ede-4a98-b527-de28289b691f
📒 Files selected for processing (4)
tests/trainer/kubeflow_sdk_test.gotests/trainer/resources/failure_scenarios.ipynbtests/trainer/resources/torchrun_failure.ipynbtests/trainer/sdk_tests/failure_traininghub_tests.go
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/trainer/sdk_tests/failure_traininghub_tests.go
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sutaakar The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
RHOAIENG-48034
Description
This PR adds two new test notebooks for validating that training failures are properly propagated through Training Hub via Kubeflow SDK:
NOTE: The tests were added this way rather than adding individual scenarios to the existing tests, as these new tests run on CPU so it makes more sense to group the failure scenarios together.
Failure Scenarios Notebook (CPU-only, Tier1)
The helper function run_failure_scenario() accepts an algorithm parameter (defaults to SFT for backward compatibility) and handles job submission, log polling, crash detection, and cleanup.
Torchrun Failure Notebook (GPU, KftoCuda)
A single-scenario GPU test that:
This tests a fundamentally different failure mode than the validation scenarios — the model loads successfully, torchrun starts the training loop, and the crash happens during GPU memory allocation for the forward pass.
Verification Steps
The default training-hub ClusterTrainingRuntime uses an older transformers version that may not be compatible with the latest SDK. A training-hub-pr-test runtime has been created on the team IBM cluster with the updated transformers version. To run the failure scenarios test against it:
Step 1 — Temporarily update the runtime in the test. In tests/trainer/sdk_tests/failure_traininghub_tests.go, change:
trainerutils.DefaultTrainingHubRuntime, to:
training-hub-pr-testStep 2 — Temporarily switch to using your own token. In the same file, change:
userToken := common.GenerateNotebookUserToken(test)to:
userToken := common.GetNotebookUserToken(test)This reads the token from the NOTEBOOK_USER_TOKEN env var instead of generating one via oc login.
Step 3 — Run the failure scenarios test (CPU-only):
NOTEBOOK_USER_NAME=
NOTEBOOK_USER_TOKEN=$(oc whoami -t)
NOTEBOOK_IMAGE=
OPENSHIFT_API_URL=$(oc whoami --show-server)
TEST_TIER=Tier1
go test -v -run TestTrainingFailureScenarios ./tests/trainer/ -timeout 60m
Step 4 — Run the torchrun failure test (requires GPU):
NOTEBOOK_USER_NAME=
NOTEBOOK_USER_TOKEN=$(oc whoami -t)
NOTEBOOK_IMAGE=
OPENSHIFT_API_URL=$(oc whoami --show-server)
TEST_TIER=KftoCuda
go test -v -run TestTorchrunTrainingFailure ./tests/trainer/ -timeout 60m
How Has This Been Tested?
These have been tested by running manually on the team IBM cluster.
Merge criteria:
Summary by CodeRabbit