Releases: NVIDIA/cloudai
Releases · NVIDIA/cloudai
v1.2.beta11
What's Changed
- Run copyright tests only in CI by @amaslenn in #448
- Move Slurm-specific methods to SlurmCommandGenStrategy by @TaekyungHeo in #454
- Bug fix in src/cloudai/cli/handlers.py by @TaekyungHeo in #456
Full Changelog: v1.2.beta10...v1.2.beta11
v1.2.beta10
What's Changed
- Register reports via Registry by @amaslenn in #445
- Remove duplicated comments in runners by @TaekyungHeo in #452
- Make sure error is printed for DSE + unknown metric by @amaslenn in #446
- Replace pyright ignores with explicit type handling by @TaekyungHeo in #453
- Return None when main test has no time limit in calculate_total_time_limit by @TaekyungHeo in #449
- Configurable scenario reporters by @amaslenn in #447
- Move Slurm-specific methods to SlurmCommandGenStrategy by @TaekyungHeo in #450
- Always use mock system for loading scenarios in verify-configs by @amaslenn in #455
- Check job completion before scheduling post-init dependent tests by @TaekyungHeo in #451
- Add RunAI scheduler support and enable NCCL tests submission by @TaekyungHeo in #436
Full Changelog: v1.2.beta9...v1.2.beta10
v1.2.beta9
What's Changed
- Nemo2.0 Lora + Null tokenizer by @srivatsankrishnan in #430
- Update USER_GUIDE.md by @srinivas212 in #435
- Fix metadata collection by @amaslenn in #433
- Add metrics support for UCC reporting by @amaslenn in #428
- Scenario reporter for DSE jobs by @amaslenn in #431
- Fix installation check for File by @amaslenn in #437
- Add fallback version to allow installation from tarballs by @amaslenn in #439
- Fix unknown state of the group nodes by @amaslenn in #440
- Fix UCC HTML report generation by @amaslenn in #441
- Refactor NCCL reports: Clean format, prepare for future enhancements by @TaekyungHeo in #432
- Allow multiple DSE cases in a scenario by @amaslenn in #438
- Nemo Dry-Run/Run Fix by @srivatsankrishnan in #444
Full Changelog: v1.2.beta8...v1.2.beta9
v1.2.beta8
What's Changed
- Fix typo in the name/description field for a test by @srivatsankrishnan in #405
- Support installing PythonExecutable in a subpath by @TaekyungHeo in #404
- Generate scenario-level report by @amaslenn in #400
- Bump jinja2 from 3.1.5 to 3.1.6 by @dependabot in #410
- Add Prediction Report to NCCL tests by @TaekyungHeo in #407
- Migrate NCCL performance data parsing to performance report generator by @TaekyungHeo in #414
- Use abs path to jinja2 template by @amaslenn in #417
- Extend DSE for Env Vars along with Cmd Args by @srivatsankrishnan in #408
- New constraint check for Nemo2.0 DSE by @srivatsankrishnan in #415
- Support 'unknown' cmd args in NCCL cmd generation by @amaslenn in #416
- New Constraint Check Nemo2.0 by @srivatsankrishnan in #418
- Reporting for encoded logs by @srivatsankrishnan in #412
- Fix system serialization by @amaslenn in #420
- Make is_dse_job a property of TestDefinition by @amaslenn in #425
- Collect metadata on Slurm systems by @amaslenn in #421
- Reporter-based metrics by @amaslenn in #426
- LSF System integration to CloudAI by @srivatsankrishnan in #423
- Consider only RUNNING state as running job by @amaslenn in #429
New Contributors
- @dependabot made their first contribution in #410
Full Changelog: v1.2.beta7...v1.2.beta8
v1.2.beta7
What's Changed
- Refactor NcclTestReportGenerationStrategy by @TaekyungHeo in #403
- Remove node list definition from slurm partition by @amaslenn in #385
- Fix copyright years by @amaslenn in #406
Full Changelog: v1.2.beta6...v1.2.beta7
v1.2.beta6
What's Changed
- Make sure copyright year is valid by @amaslenn in #396
- Close the loop on policy update for smarter agents by @srivatsankrishnan in #395
- Prepare for using multiple reporters per test definition by @amaslenn in #386
- Move signal handling from BaseRunner to Runner by @amaslenn in #398
- Allow skipping cache validation by @amaslenn in #394
Full Changelog: v1.2.beta5...v1.2.beta6
v1.2.beta5
What's Changed
- Fix megatron ref sbatch by @amaslenn in #384
- Refactor Callbacks on CloudAI Nemo Run Script by @srivatsankrishnan in #381
- Use non-abs bin names for NCCL tests by @amaslenn in #382
- Converged Configs for Nemotron15b 2-64 nodes for Nemo2.0 by @srivatsankrishnan in #389
- Update NeMo docker image to nvcr.io/nvidia/nemo:24.12.01 by @TaekyungHeo in #388
- Split nemo_launcher_nemotron_15b_*.toml scenarios by @TaekyungHeo in #387
- Refactor NeMoLauncher report generation to report avg, min, max, and median by @TaekyungHeo in #391
- More logs and tests for uninstall logic by @amaslenn in #392
- Ensure unique NeMo Launcher job names using timestamp to avoid conflicts by @TaekyungHeo in #390
- Modifications for adding more agents (other than Shmoo). by @srivatsankrishnan in #383
Full Changelog: v1.2.beta4...v1.2.beta5
v1.2.beta4
What's Changed
- Add nemo_launcher_nemotron_15b configurations by @TaekyungHeo in #377
- Create ranks mapping file for slurm jobs by @amaslenn in #368
- Nemotron15b (fp8/bf16) for Acceptance (Nemo2.0) by @srivatsankrishnan in #380
- POC version of Megatron Run workload by @amaslenn in #379
Full Changelog: v1.2.beta3...v1.2.beta4
v1.2.beta3
What's Changed
- Update NeMo dataset link in USER_GUIDE.md by @TaekyungHeo in #370
- Add .github/CODEOWNERS by @TaekyungHeo in #371
- Merge test definition and test templates under workloads by @amaslenn in #374
- Use custom script for NemoRun jobs by @amaslenn in #372
- Allow uint8 in NCCL ALLtoALL by @srivatsankrishnan in #376
- Allow unknown fields for cmd_args in test definitions by @amaslenn in #373
- Remove unused code from SlurmSystem by @amaslenn in #378
Full Changelog: v1.2.beta2...v1.2.beta3
v1.2.beta2
What's Changed
- Introduce 'cmd' field for SlurmContainer jobs by @amaslenn in #362
- Intial dse parameters for llama_8b by @srivatsankrishnan in #359
- Small housekeeping updates by @amaslenn in #363
- Base Config for NemoRun LLama3-8b by @srivatsankrishnan in #366
- Rework reporting logic by @amaslenn in #360
- Llama and Nemotron Configs by @srivatsankrishnan in #365
- Enable and configure Nsys tracing via test config by @amaslenn in #364
Full Changelog: v1.2.beta1...v1.2.beta2