You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add local runtime support for in-situ unit test bisection (#1501)
Introduces a new `local` mode to the triage utility that allows
developers to perform a commit-level bisection for unit tests without
requiring container orchestration.
This is designed for the case where a user is already inside a failing
container and wants to find a culprit commit directly.
**Changes:**
1. Adds a new option, **--container-runtime=local**.
2. When this mode is active, all `git` operations and build/test
commands are executed directly on the local machine using `subprocess`.
3. The `run_and_log` utility was updated to accept a `cwd` parameter.
This was necessary to allow the `LocalContainer` to execute commands
like `git checkout` and `bazel` builds within the correct local
repository directories.
4. The `local` runtime skips the container search, so it requires the
user to provide the bisection range explicitly using:
- `--passing-commits="jax:<hash>,xla:<hash>"`
- `--failing-commits="jax:<hash>,xla:<hash>"`
5. The argument parser has been updated to enforce these new rules and
prevent mixing `local` mode with container-specific arguments (i.e,
`--container`, `--start-date`).
6. Unit tests to the new argument parsing logic
Now, this command can be run inside a JAX development container:
```
jax-toolbox-triage \
--container-runtime=local \
--passing-commits="jax:<some_good_hash>,xla:<some_good_hash>" \
--failing-commits="jax:<some_bad_hash>,xla:<some_bad_hash>" \
/opt/jax-toolbox/.github/container/test-jax.sh "some_target"
```
Copy file name to clipboardExpand all lines: docs/triage-tool.md
+11-12Lines changed: 11 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,21 +3,20 @@
3
3
`jax-toolbox-triage` is a tool to automate the process of attributing regressions to an
4
4
individual commit of JAX or XLA.
5
5
It takes as input a command that returns an error (non-zero) code when run in "recent"
6
-
containers, but which returns a success (zero) code when run in some "older" container.
7
-
The command must be executable within the containers, *i.e.* it cannot refer to files
6
+
environments, but which returns a success (zero) code when run in some "older" environment.
7
+
The command must be executable within the triage environment, *i.e.* it cannot refer to files
8
8
that only exist on the host system, unless those are explicitly mounted in using the
9
-
`-v` (`--container-mount`) option.
9
+
`-v` (`--container-mount`) option when using a container-based runtime.
10
10
11
-
The tool follows a three-step process:
11
+
The tool follows a process that can include up to three steps:
12
12
1. A container-level search backwards from the "recent" container where the test is
13
13
known to fail, which identifies an "older" container where the test passes. This
14
14
search proceeds with an exponentially increasing step size and is based on the
15
15
`YYYY-MM-DD` tags under `ghcr.io/nvidia/jax`.
16
16
2. A container-level binary search to refine this to the **latest** available
17
17
container where test passes and the **earliest** available container where it
18
18
fails.
19
-
3. A commit-level binary search, repeatedly building + testing inside the same
20
-
container, to identify a single commit of a software package known to the tool
19
+
3. A commit-level binary search, repeatedly building + testing, to identify a single commit of a software package known to the tool
21
20
(JAX, XLA, Flax, optionally MaxText) that causes the test to start failing, and a
22
21
set of reference commits of the other packages that can be used to reproduce the
23
22
regression.
@@ -64,20 +63,20 @@ or more machines with appropriate GPUs, *e.g.* inside an `salloc` session.
64
63
Appropriate arguments (number of nodes, number of tasks per node, *etc.*) should be
65
64
passed to `salloc` or set via `SLURM_` environment variables so that a bare `srun` will
66
65
correctly launch the test case.
66
+
If `--container-runtime=local` is used, the tool assumes it is already inside a JAX container and will execute all build and test commands directly.
67
67
68
68
## Usage
69
69
70
70
To use the tool, there are two compulsory inputs:
71
71
* A test command to triage.
72
-
* A specification of which containers to triage in. There are two choices here:
73
-
*`--container`: which of the `ghcr.io/nvidia/jax:CONTAINER-YYYY-MM-DD` container
74
-
families to execute the test command in. Example: `jax` for a JAX unit test
75
-
failure, `maxtext` for a MaxText model execution failure. The `--start-date` and
72
+
* A specification of the triage scope. There are three choices here:
73
+
***Container Search**: Usage `--container` to specify which of the `ghcr.io/nvidia/jax:CONTAINER-YYYY-MM-DD` container families to search through. Example: `jax` for a JAX unit test failure, `maxtext` for a MaxText model execution failure. The `--start-date` and
76
74
`--end-date` options can be combined with `--container` to tune the search; see
77
75
below for more details.
78
-
*`--passing-container` and `--failing-container`: a pair of URLs to containers to
76
+
***Commit Search between Containers**: `--passing-container` and `--failing-container`: a pair of URLs to containers to
79
77
use in the commit-level search; if these are passed then no container-level
80
78
search is performed.
79
+
***Local Commit Search**: Use `--container-runtime=local` when you are already inside a JAX container. This mode skips all container orchestration and performs a commit-level search directly in the local container. It requires you to specify the commit range with `--passing-commits` and `--failing-commits`.
81
80
82
81
The test command will be executed directly in the container, not inside a shell, so be
83
82
sure not to add excessive quotation marks (*i.e.* run
@@ -88,7 +87,7 @@ as fast and targeted as possible.
88
87
If you want to run multiple commands, you might want to use something like
89
88
`jax-toolbox-triage --container=jax sh -c "command1 && command2"`.
90
89
91
-
Alternatively, you can use `-v` (`--container-mount`) to mount a host directory
90
+
Alternatively, when using a container runtime, you can use `-v` (`--container-mount`) to mount a host directory
92
91
containing test scripts into the container and execute a script from there, *e.g.*
0 commit comments