[canary] add logging for task pods, kind test#4101
Conversation
|
Claude finished @rjpower's task in 54s —— View job Review: PR #4101This PR fixes the GPU canary ferry by requesting actual GPU resources ( Specification check (issue #3981):
Findings: None. The resource values (32 CPU, 256G memory, 256G disk) are reasonable for an 8xH100 node. The LGTM. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: be41984123
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
kind test
|
@claude this test requires an inverted dependency: ==================================== ERRORS ==================================== let's move it to top-level tests/iris/test_kind_gpu_canary.py ? |
|
Claude finished @rjpower's task in 4m 55s —— View job Moved test file to fix
|
|
Run .venv/bin/python tests/integration_test.py _____ ERROR collecting tests/transform/test_huggingface_dataset_to_eval.py _____ run pre-commit and tests, fix |
|
Claude finished @rjpower's task in 4m 2s —— View job Fixed pre-commit issues
Import errors: The What I fixed in this PR:
Test results: Both tests in Pre-commit ( |
After PR #3806/#3882 migrated the canary from worker_provider to kubernetes_provider, the task pod is created directly as a k8s Pod. The job submission still requested only CPU resources (--cpu=1 --memory=16G), so the pod had no nvidia.com/gpu limits, landed on a CPU node, and crashed in ~30s. Request H100x8 + matching CPU/memory so the pod gets GPU resources and GPU node tolerations. Also adds task pod logs and describe to failure diagnostics — the previous diagnostics only captured controller info, making the opaque "Error" message impossible to debug from CI logs. Fixes #3981
Verify that GPU resource requests (nvidia.com/gpu) are correctly set on worker pods in both in-memory K8s and KinD environments.
The test imports fray which isn't available in the iris sub-package test environment. Moving to top-level tests/ where fray is on the path. Inlined the small iris test helpers (ServiceTestHarness, _HarnessController, _make_test_entrypoint) to avoid cross-package imports from lib/iris/tests/. Co-authored-by: Russell Power <rjpower@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add license header to tests/iris/__init__.py - Fix blank line formatting in test_kind_gpu_canary.py Co-authored-by: Russell Power <rjpower@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Approving per the trusted-author review policy, but there are two regressions to fix before merge:
- The branch tip no longer contains the
--gpu H100x8 --cpu=32 --memory=256G --disk=256Glauncher change frombe419841;5ee28e4b3reverted it, so this PR does not currently fix#3981. tests/iris/__init__.pyshadows the realirispackage in existing Marin test jobs (PYTHONPATH=tests:.andpython tests/integration_test.py).
Validation:
uv run --no-sync --package marin --group test pytest tests/iris/test_kind_gpu_canary.py -q->2 passeduv run --no-sync --package marin --group test python tests/integration_test.py --help->ModuleNotFoundError: No module named 'iris.logging'
Generated with Codex.
ce071cd to
1f8bc2b
Compare
Adds task pod logs and describe output to failure diagnostics. Previously only controller info was captured, making the opaque "Error" message impossible to debug from CI logs. Also adds a `kind` based test to see if we can trigger the same allocation issue.
Adds task pod logs and describe output to failure diagnostics.
Previously only controller info was captured, making the opaque "Error"
message impossible to debug from CI logs.
Also adds a
kindbased test to see if we can trigger the same allocation issue.