Releases: NVIDIA/cloudai
v0.9.beta16
Highlights
Use subcommands instead of --mode <value> by @amaslenn in #194
New help message looks like this:
> cloudai --help
usage: cloudai [-h] [--log-file LOG_FILE] [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
{uninstall,install,dry-run,run,generate-report,verify-systems,verify-tests,verify-test-scenarios} ...
Cloud AI
optional arguments:
-h, --help show this help message and exit
--log-file LOG_FILE The name of the log file (default: debug.log).
--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
Set the logging level (default: INFO).
modes:
{uninstall,install,dry-run,run,generate-report,verify-systems,verify-tests,verify-test-scenarios}
uninstall Remove the installed dependencies.
install Prepare execution by setting up env and dependencies for the tests to run.
dry-run Perform a dry-run of the test scenarios without executing them.
run Execute the test scenarios.
generate-report Generate a report based on the test results.
verify-systems Verify the system configurations.
verify-tests Verify the test configurations.
verify-test-scenarios
Verify the test scenario configurations.- Each command (a.k.a mode) has own help message.
- Each command also has a uniq set of required and optional arguments. While for many commands options are the same, others are quite different, for example:
> cloudai run --help usage: cloudai run [-h] --system-config SYSTEM_CONFIG --tests-dir TESTS_DIR --test-scenario TEST_SCENARIO [--output-dir OUTPUT_DIR] optional arguments: -h, --help show this help message and exit --system-config SYSTEM_CONFIG Path to the system configuration file. --tests-dir TESTS_DIR Path to the test configuration directory. --test-scenario TEST_SCENARIO Path to the test scenario file. --output-dir OUTPUT_DIR Path to the output directory. > cloudai verify-tests --help usage: cloudai verify-tests [-h] test_configs positional arguments: test_configs Path to the test configuration file or directory. optional arguments: -h, --help show this help message and exit
What's Changed
- Refactor NeMoLauncherSlurmCommandGenStrategy unit tests by @TaekyungHeo in #252
- Refactor JaxToolboxSlurmCommandGenStrategy by @TaekyungHeo in #249
Full Changelog: v0.9.beta15...v0.9.beta16
v0.9.beta15
What's Changed
- Remove assigning null when the value is null (NeMo launcher) by @TaekyungHeo in #250
Full Changelog: v0.9.beta14...v0.9.beta15
v0.9.beta14
What's Changed
- Fix bug in violating Kubernetes naming rules by @TaekyungHeo in #244
- Add unit tests for SlurmCommandGenStrategy by @TaekyungHeo in #247
- Fix missing 'output_path' in cmd_args by @amaslenn in #251
Full Changelog: v0.9.beta13...v0.9.beta14
v0.9.beta13
What's Changed
- Update Sleep to ensure implementation consistency by @TaekyungHeo in #234
- Update USER_GUIDE.md and README.md by @TaekyungHeo in #235
- Remove duplicated _format_env_vars calls by @TaekyungHeo in #233
- Rename test definitions by @TaekyungHeo in #237
- Remove unnecessary arg from generate_test_command by @TaekyungHeo in #238
- Spin-off cmd_args validation logic for SlurmCommandGenStrategy by @TaekyungHeo in #236
- Expect SlurmSystem in respective cmd_gen and installer classes by @amaslenn in #239
- Move more fields from Test to TestRun by @amaslenn in #240
- Make TestDefinition a part of Test by @amaslenn in #241
- Minor refactoring on SlurmCommandGenStrategy by @TaekyungHeo in #246
- Break down test_slurm_command_gen_strategy into smaller tests by @TaekyungHeo in #245
- Resolve K8s Comments (Part 1) by @TaekyungHeo in #242
- Fix race condition during docker images caching by @amaslenn in #248
Full Changelog: v0.9.beta12...v0.9.beta13
v0.9.beta12
Highlights
We are working on schema improvements to simplify configs management and make them verifiable. This will help ensure that configs are correct before expensive runs on real hardware. Today we are enabling it for Test Scenario configs. This is a continuation of #145.
Testsbecomes and array. This helps making case names more expressive:
before:now:[Tests.1] # ...
[[Tests]] id = "any-name.you_want" # before it was just "1"
idfield is mandatory and must be unique and is used to specify dependencies:[[Tests]] id = "Tests.1" # ... [[Tests]] id = "Tests.2" # ... [[Tests.dependencies]] id = "Tests.1" # ...
name(under the list of tests) renamed totest_nameto better reflect its meaning. It still references a test defined in a separate TOML file.- Dependencies converted to a list to support multiple dependencies of the same type.
beforenow# ... [Tests.2] name = "ucc_test_alltoall" [Tests.2.dependencies] start_post_comp = { name = "Tests.1", time = 0 } # only one dependency of this type is allowed
# ... [[Tests]] id = "Tests.3" test_name = "ucc_test_alltoall" # ... [[Tests.dependencies]] type = "start_post_comp" id = "Tests.1" [[Tests.dependencies]] type = "start_post_comp" id = "Tests.2"
What's Changed
- Cover wrong python bin path in exec script bug by @amaslenn in #232
- Pydantic for Test Scenario by @amaslenn in #205
Full Changelog: v0.9.beta11...v0.9.beta12
v0.9.beta11
What's Changed
- Pass TestRun to gen_exec_command and gen_json by @TaekyungHeo in #228
- Bug fix for incorrect py_bin in NeMoLauncher by @TaekyungHeo in #231
Full Changelog: v0.9.beta10...v0.9.beta11
v0.9.beta10
What's Changed
- Remove unnecessary indirections by @TaekyungHeo in #226
- Remove Installer class to reduce code indirection by @amaslenn in #227
- Refactor SlurmCommandGenStrategy by @TaekyungHeo in #229
- Use venv for nemo launcher by @amaslenn in #230
Full Changelog: v0.9.beta9...v0.9.beta10
v0.9.beta9
What's Changed
- Generate Bash script during Nemo Launcher by @srivatsankrishnan in #219
- Remove hardcoded value for data.index.mapping by @srivatsankrishnan in #225
Full Changelog: v0.9.beta8...v0.9.beta9
v0.9.beta8
What's Changed
- Unset argument in NeMo launcher when its value is ~ by @TaekyungHeo in #223
- Rename NeMo-related TOML configurations to reflect the exact model by @TaekyungHeo in #222
- Use mock data by default in NeMo launcher by @TaekyungHeo in #224
Full Changelog: v0.9.beta7...v0.9.beta8
v0.9.beta7
What's Changed
- Use absolute paths in NeMo launcher by @TaekyungHeo in #221
Full Changelog: v0.9.beta6...v0.9.beta7