Releases: NVIDIA/cloudai
Releases · NVIDIA/cloudai
v1.3.beta2
What's Changed
- Bump version to v1.3 by @amaslenn in #508
- Support hostfile generation and override distribution for explicit nodes by @TaekyungHeo in #509
Full Changelog: v1.3.beta1...v1.3.beta2
v1.3.beta1
What's Changed
- Nemo2.0 Perf Features (next set) by @srivatsankrishnan in #496
- Remove requirements files to rely on pyproject for dependencies by @amaslenn in #505
- Revert
Integrate refactored Chakra replay #469by @TaekyungHeo in #504
Full Changelog: v1.2.rc2...v1.3.beta1
v1.2.0
CloudAI v1.1 (GA) release notes
Compatibility
CloudAI v1.1 has been tested with: PyTorch/JAX NGC Container 25.04, NCCL 2.25, and SPC-X 1.2.
Key Features and Enhancements:
- Full support for Nemo 2.0 models
- Create custom report generation
- Support for Blackwell systems
- Support for Run.AI - limited to NCCL
- Support for MegatronLM workload, Nemotron15B
v1.2.rc2
What's Changed
- Prevent NeMoRunDataStoreReportGenerationStrategy from overriding metric logic by @TaekyungHeo in #494
- Add Bokeh report generation to NeMoRunReportGenerationStrategy class by @TaekyungHeo in #499
- Fix premature runner exit by looping until all jobs are complete by @TaekyungHeo in #493
Full Changelog: v1.2.rc1...v1.2.rc2
v1.2.rc1
What's Changed
- Fix tarball name to preserve full directory name with dots intact by @TaekyungHeo in #491
Full Changelog: v1.2.beta16...v1.2.rc1
v1.2.beta16
What's Changed
- Fix typo in src/cloudai/workloads/nemo_run/slurm_command_gen_strategy.py by @TaekyungHeo in #481
- Validate save/load paths for MegatronRun by @amaslenn in #483
- Nemo2.0 fixes for Llama70b by @srivatsankrishnan in #486
- Updates for slurm metadata by @amaslenn in #488
- Clean up Nemo2.0 Configs by @srivatsankrishnan in #484
- Add context manager to ensure kube config exists by @TaekyungHeo in #485
- Support data repository for NeMoRun LLAMA models by @TaekyungHeo in #464
- Add TarballReporter for archiving test results on failure by @TaekyungHeo in #460
Full Changelog: v1.2.beta15...v1.2.beta16
v1.2.beta15
What's Changed
- Refactor: Use test definition's cmd_args for sleep command generation by @TaekyungHeo in #478
- Nemo2.0 with Recipe for Complex CLI Features (Plan B) by @srivatsankrishnan in #466
- Update output path for DSE runs by @amaslenn in #473
- Collect more metadata for slurm jobs by @amaslenn in #479
- Set sequence length explicitly in NeMoRun config and data model by @TaekyungHeo in #480
Full Changelog: v1.2.beta14...v1.2.beta15
v1.2.beta14
What's Changed
- Restore dry-run on system without slurm by @amaslenn in #470
- Refactor: Use test definition's cmd_args for UCC command generation by @TaekyungHeo in #471
- Treat warnings as errors in pytest by @amaslenn in #474
- Refactor: Use test definition's cmd_args for sleep command generation by @TaekyungHeo in #472
- Updates and fixes for metadata collection by @amaslenn in #475
- Integrate refactored Chakra replay by @TaekyungHeo in #469
Full Changelog: v1.2.beta13...v1.2.beta14
v1.2.beta13
What's Changed
- [Feature Request/Bug]: Remove Workspace Mount by @srivatsankrishnan in #462
Full Changelog: v1.2.beta12...v1.2.beta13
v1.2.beta12
What's Changed
- Support NCCL_TOPO_FILE in all SlurmCommandGenStrategy classes by @TaekyungHeo in #458
- Fix: Include CANCELLED+ state in completed job status check by @TaekyungHeo in #461
- Fix is_running for slurm by @amaslenn in #467
Full Changelog: v1.2.beta11...v1.2.beta12