Releases: NVIDIA/cloudai
Releases · NVIDIA/cloudai
v1.3.rc2
What's Changed
- Pass extra_srun_args during install. by @karya0 in #605
- Add NIXL perftest (kvbench for sequential-ct-perftest) support by @amaslenn in #604
- Docker cache fix by @karya0 in #606
- Support for fp8 Llama3_405b by @srivatsankrishnan in #593
- Numa control in Nemo2.0 by @srivatsankrishnan in #607
Full Changelog: v1.3.rc1...v1.3.rc2
v1.3.rc1
v1.3.beta30
What's Changed
- Small improvements by @amaslenn in #599
- Fix docker image cache CLI for gres support by @amaslenn in #600
- Update doc/ai_dynamo.md by @TaekyungHeo in #601
- Remove header when using sinfo by @amaslenn in #602
- Update AI Dynamo config to use vLLM_V1 API. by @karya0 in #595
New Contributors
Full Changelog: v1.3.beta29...v1.3.beta30
v1.3.beta29
What's Changed
- Handles comma in env vars values for NemoLauncher by @amaslenn in #591
- Create CmdGenStrategy per usage by @amaslenn in #596
- Require docker image for NCCL tests to be explicitly set in config by @amaslenn in #597
- Rely on member test run object instead of args by @amaslenn in #598
Full Changelog: v1.3.beta28...v1.3.beta29
v1.3.beta28
What's Changed
- Avoid confusing post_test/pre_test folder structure by @amaslenn in #592
- Remove default_cmd_args field from TestTemplateStrategy by @amaslenn in #594
- Add AI Dynamo by @TaekyungHeo in #519
- Enable NCCL w/ K8S SPCx by @TaekyungHeo in #579
Full Changelog: v1.3.beta27...v1.3.beta28
v1.3.beta27
What's Changed
- Silently skip NIXL summary generation if no NIXL tests by @amaslenn in #587
- Llama31_405b by @srivatsankrishnan in #582
- Merge JobIdRetrieval functionality into respective runners by @amaslenn in #588
- Re-work job status fetching by @amaslenn in #589
- Update UCC configs by @amaslenn in #590
Full Changelog: v1.3.beta26...v1.3.beta27
v1.3.beta26
What's Changed
- Update regex to correctly extract full GPU type names including suffixes and variants by @TaekyungHeo in #578
- Fix missing k8s import by using lazy.k8s in MPIJob delete call by @TaekyungHeo in #580
- Align method with BaseRunner by renaming to on_job_completion and removing async by @TaekyungHeo in #581
- Add DockerImage support to Kubernetes installer methods by @TaekyungHeo in #583
- Match json_gen_strategy implementation to command_gen_strategy by @TaekyungHeo in #585
- Fix nodes allocation from the same group by @amaslenn in #586
- Guard on_job_submit with null check for _command_gen_strategy access by @TaekyungHeo in #584
Full Changelog: v1.3.beta25...v1.3.beta26
v1.3.beta25
What's Changed
- Add BashCmd workload by @amaslenn in #570
- Correctly load and save tdef as part of TestRunDetails by @amaslenn in #574
- Make NIXL work in single-sbatch mode by @amaslenn in #575
- Re-work slurm node status update by @amaslenn in #577
- Add NIXL summary report by @amaslenn in #576
Full Changelog: v1.3.beta24...v1.3.beta25
v1.3.beta24
v1.3.beta23
What's Changed
- Add configurable reward functions to CloudAIGym by @TaekyungHeo in #566
Full Changelog: v1.3.beta22...v1.3.beta23