Releases: NVIDIA/cloudai
Releases · NVIDIA/cloudai
v0.8.rc1
v0.8.rc0
What's Changed
- update container version by @jeffnvidia in #122
- Update copyright format by @amaslenn in #155
- Enhance job status error message for clarity and user guidance by @TaekyungHeo in #156
- Upgrade pyright, fix warning by @amaslenn in #157
- Allow extra srun args set in system config by @amaslenn in #160
- Automatically define version from the git tags by @amaslenn in #162
- add support for NeMo Launcher in the reservation by @jeffnvidia in #161
- Add job_status_check flag to disable checks for specific tests by @TaekyungHeo in #146
- Defer docker image URL accessibility check to srun when not caching locally by @TaekyungHeo in #164
- Check TOMLs formatting with taplo by @amaslenn in #163
- Hierarchical Test template for support Grok/GPT via PAXML by @srivatsankrishnan in #141
Full Changelog: v0.7.14...v0.8.rc0
v0.7.14
What's Changed
- Make sure to dry-run all tests in a test scenario by @TaekyungHeo in #149
- Add warning for insufficient epochs in JaxToolbox report generation by @TaekyungHeo in #148
- Remove subtest name check in NcclTestSlurmCommandGenStrategy by @TaekyungHeo in #147
- Handle disk quota exceeded error in cache_docker_image method by @TaekyungHeo in #150
- Remove unused properties from TestTemplateStrategy by @amaslenn in #151
- Enhance error messages to provide guidance for missing schemas by @TaekyungHeo in #152
- Bump version to v0.7.14 by @TaekyungHeo in #153
Full Changelog: v0.7.13...v0.7.14
v0.7.13
What's Changed
- Pass all env vars to final command in NeMo launcher test template by @TaekyungHeo in #134
- Added JaxToolbox (Grok) troubleshooting steps by @TaekyungHeo in #142
- Improve tokenizer path handling in NeMo Launcher Slurm strategy by @TaekyungHeo in #136
- Remove identical if-else branches by @amaslenn in #143
- Move parts of srun CLI generation into base class by @amaslenn in #140
Full Changelog: v0.7.12...v0.7.13
v0.7.12
What's Changed
- Add how to download NeMo launcher tokenizer in the USER_GUIDE by @jeffnvidia in #115
- Update logging to use dynamic log file name from args.log_file in run mode by @TaekyungHeo in #131
- Update README: add mandatory args test-templates-dir and tests-dir by @TaekyungHeo in #132
- Update NeMo launcher troubleshooting guide for clarity and conciseness by @TaekyungHeo in #135
- Add support for Sleep on Slurm systems by @TaekyungHeo in #121
- Raise descriptive exceptions when strategies are missing by @TaekyungHeo in #137
- Add L40s test configs by @srinivas212 in #138
- Bump version to v0.7.12 by @TaekyungHeo in #139
Full Changelog: v0.7.11...v0.7.12
v0.7.11
What's Changed
- Enhance Quick Start guide for Docker repo access and API key by @TaekyungHeo in #99
- Add copyright headers for TOML, update its format by @amaslenn in #117
- Add pyxis mktemp error handling and test cases in JaxToolbox strategy by @TaekyungHeo in #118
- Add section on downloading NeMo datasets to USER_GUIDE.md by @TaekyungHeo in #116
- Enhance USER_GUIDE.md with system schema description and troubleshooting steps by @TaekyungHeo in #120
- Fix bug in generating NeMo launcher command by @TaekyungHeo in #124
- Add section on describing a test scenario to USER_GUIDE.md by @TaekyungHeo in #123
- Add
cache_docker_images_locallyfield to system schema in USER_GUIDE.md by @TaekyungHeo in #125 - Allow local docker image caching for JaxToolbox by @TaekyungHeo in #126
- Update BaseRunner to include scenario name in output directory by @TaekyungHeo in #119
- Add mpi field to SlurmSystem and allow different MPI options in schema by @TaekyungHeo in #127
- Bump version to v0.7.11 by @TaekyungHeo in #128
Full Changelog: v0.7.10...v0.7.11
v0.7.10
What's Changed
- Reduce logs for stdout by @amaslenn in #108
- Update USER_GUIDE by @amaslenn in #109
- Fix incorrect argument name in README by @TaekyungHeo in #110
- Remove bisection tests from NCCL-test template by @TaekyungHeo in #113
- Fix bug in NeMo launcher report generation by @TaekyungHeo in #112
- Add pip install requirements.txt step to NeMoLauncherSlurmInstallStrategy by @TaekyungHeo in #111
- Bump version to v0.7.10 by @TaekyungHeo in #114
Full Changelog: v0.7.9...v0.7.10
v0.7.9
What's Changed
- Fix _check_docker_image_accessibility condition and add detailed logging by @TaekyungHeo in #107
- Refactor handle_install_and_uninstall to identify unique test templates by @TaekyungHeo in #105
- Bump version to v0.7.9 by @TaekyungHeo in #106
Full Changelog: v0.7.8...v0.7.9
v0.7.8
What's Changed
- Handle 401 Unauthorized error with detailed instructions for Docker image access by @TaekyungHeo in #103
- Bump version to v0.7.8 by @TaekyungHeo in #104
Full Changelog: v0.7.7...v0.7.8
v0.7.7
What's Changed
- Refactor _check_docker_image_accessibility to remove srun usage by @TaekyungHeo in #98
- Update NcclTestJobStatusRetrievalStrategy to improve error messages by @TaekyungHeo in #100
- Update JaxToolboxJobStatusRetrievalStrategy to improve error messages by @TaekyungHeo in #101
- Bump version to v0.7.7 by @TaekyungHeo in #102
Full Changelog: v0.7.6...v0.7.7