Releases: NVIDIA/cloudai
Releases · NVIDIA/cloudai
v1.4.beta7
What's Changed
- Handle multi-section CSV format in AI Dynamo report generation by @TaekyungHeo in #620
Full Changelog: v1.4.beta6...v1.4.beta7
v1.4.beta6
What's Changed
- Improve prepare_output_dir error handling for permissions and read-only fs (continued) by @TaekyungHeo in #631
- Add GitRepo support to KubernetesInstaller with install/uninstall logic by @TaekyungHeo in #634
- Use a shell script as the entry point for AI Dynamo by @TaekyungHeo in #615
Full Changelog: v1.4.beta5...v1.4.beta6
v1.4.beta5
What's Changed
- Improve prepare_output_dir error handling for permissions and read-only fs by @TaekyungHeo in #629
Full Changelog: v1.4.beta4...v1.4.beta5
v1.4.beta4
What's Changed
- Update doc/ai_dynamo.md by @TaekyungHeo in #628
- Replace Nemo image tag from 24.12.rc3 to 25.04.rc2 in all conf TOML files by @TaekyungHeo in #626
Full Changelog: v1.4.beta3...v1.4.beta4
v1.4.beta3
What's Changed
- Replace PyTorch image tag from 24.02-py3 to 25.06-py3 in all conf TOML files by @TaekyungHeo in #627
Full Changelog: v1.4.beta2...v1.4.beta3
v1.4.beta2
What's Changed
- Update USER_GUIDE.md by @TaekyungHeo in #623
- Update doc/ai_dynamo.md by @TaekyungHeo in #624
- Update conf/common/test/nemo_run_llama3_8b.toml by @TaekyungHeo in #625
Full Changelog: v1.4.beta1...v1.4.beta2
v1.4.beta1
What's Changed
- Support custom matgen args and set valid ppn by @amaslenn in #612
- Fix gres related directives for single sbatch mode by @amaslenn in #613
- Preserve the order of environment variables specified in the system schema by @TaekyungHeo in #616
- Update docker_image_url separator from colon to hash by @TaekyungHeo in #621
- Bump default version to v1.4 by @amaslenn in #622
Full Changelog: v1.3.0...v1.4.beta1
v1.3.0
What's Changed
- Nemo2.0 Perf Features (next set) by @srivatsankrishnan in #496
- Remove requirements files to rely on pyproject for dependencies by @amaslenn in #505
- Revert
Integrate refactored Chakra replay #469by @TaekyungHeo in #504 - Bump version to v1.3 by @amaslenn in #508
- Support hostfile generation and override distribution for explicit nodes by @TaekyungHeo in #509
- Handle missing credentials when creating HttpDataRepository by @TaekyungHeo in #512
- Allow pre-test adding extra srun arguments by @amaslenn in #511
- Correctly handle values with spaces for env vars by @amaslenn in #514
- Change Scenario Report Map type from Set to List by @lilyw97 in #501
- Verify if input path exist on argparse level by @amaslenn in #515
- Apply style to scenario report by @amaslenn in #517
- Control workload settings from scenario by @amaslenn in #393
- Make installabes() non-abstract by @amaslenn in #518
- Add DeepSeek-R1 Inference by @TaekyungHeo in #503
- Add NeMoRunJobStatusRetrievalStrategy and register it in the strategy registry by @TaekyungHeo in #490
- Update MegatronRun model dump logic by @amaslenn in #521
- Updated README by @amaslenn in #523
- Fix how test-in-scenario is merge with test-in-toml by @amaslenn in #525
- Remove venv folder if requirements installation failed by @amaslenn in #527
- Use lazy imports for slow modules by @amaslenn in #526
- Detect low thread environments and adjust task limits by @TaekyungHeo in #529
- Move sweeps logic to TestRun by @amaslenn in #513
- Optimize slurm updates by @amaslenn in #535
- Add extra_srun_args & scripts in SlurmContainerTestDef by @lilyw97 in #531
- Dump CloudAI version into generated sbatch script by @amaslenn in #537
- Remove venv folder if requirements installation failed by @amaslenn in #530
- Nemo2.0 Perf Recipes (Set 2) by @srivatsankrishnan in #500
- Reduce usage of slurm_args by @amaslenn in #538
- Address comments from #513 by @amaslenn in #534
- Get rid of _parse_slurm_args by @amaslenn in #539
- Set srun job name to "-CloudAI_install_docker_image.%Y%m%d_%H%M%S" by @TaekyungHeo in #544
- Added support for additional args in cmd_args in chakra replay workload by @Eli-Siegel-nvidia in #542
- Add GPU directive support check to SlurmSystem and use it in command gen by @TaekyungHeo in #541
- Per-rank env vars evaluation by @amaslenn in #536
- Store test details and best config for DSE by @amaslenn in #524
- Add NIXL bench workload by @amaslenn in #540
- Make sure install status is populated to all duplicates by @amaslenn in #545
- Use copies for venv creation + fix tests by @amaslenn in #546
- Control if home folder should be mounted into container for slurm by @amaslenn in #547
- Add LLAMA3 8b to NeMo acceptance by @TaekyungHeo in #532
- Return absolute path for cached Docker image in installed_path method by @TaekyungHeo in #549
- Allow val_check_interval to be int, float, or list of both by @amaslenn in #551
- Support single node configuration for NIXLBench by @amaslenn in #552
- Make sure mark_as_installed respects system config by @amaslenn in #548
- NIXL reporting by @amaslenn in #550
- Allow sweeps for number of nodes by @amaslenn in #487
- Fix invalid type for image when cache is disabled by @amaslenn in #554
- BaseRunner: rename callbacks and make them synchronous by @amaslenn in #553
- Refactor supports_gpu_directives to focus on GresTypes by @TaekyungHeo in #556
- Migrate to modern datetime interface by @emmanuel-ferdman in #561
- Add single sbatch runner for slurm systems by @amaslenn in #555
- Fix DeepSeekR1 inference report by @TaekyungHeo in #560
- Rework imports by @amaslenn in #559
- Fix path to jinja template by @amaslenn in #562
- Generate reports for DSE jobs by @TaekyungHeo in #563
- Do not use --copies for venv creation by @amaslenn in #565
- Add configuration for scenario reports by @amaslenn in #564
- Support for multiple metrics in reporter by @amaslenn in #558
- Expand slurm meta to have per-step information by @amaslenn in #567
- Add configurable reward functions to CloudAIGym by @TaekyungHeo in #566
- Remove JAX configs as used image is not available by @amaslenn in #568
- Fix for handling srun with multiline commands by @amaslenn in #573
- Cleanup configs in conf/common by @amaslenn in #571
- Add BashCmd workload by @amaslenn in #570
- Correctly load and save tdef as part of TestRunDetails by @amaslenn in #574
- Make NIXL work in single-sbatch mode by @amaslenn in #575
- Re-work slurm node status update by @amaslenn in #577
- Add NIXL summary report by @amaslenn in #576
- Update regex to correctly extract full GPU type names including suffixes and variants by @TaekyungHeo in #578
- Fix missing k8s import by using lazy.k8s in MPIJob delete call by @TaekyungHeo in #580
- Align method with BaseRunner by renaming to on_job_completion and removing async by @TaekyungHeo in #581
- Add DockerImage support to Kubernetes installer methods by @TaekyungHeo in #583
- Match json_gen_strategy implementation to command_gen_strategy by @TaekyungHeo in #585
- Fix nodes allocation from the same group by @amaslenn in #586
- Guard on_job_submit with null check for _command_gen_strategy access by @TaekyungHeo in #584
- Silently skip NIXL summary generation if no NIXL tests by @amaslenn in #587
- Llama31_405b by @srivatsankrishnan in #582
- Merge JobIdRetrieval functionality into respective runners by @amaslenn in #588
- Re-work job status fetching by @amaslenn in #589
- Update UCC configs by @amaslenn in #590
- Avoid confusing post_test/pre_test folder structure by @amaslenn in #592
- Remove default_cmd_args field from TestTemplateStrategy by @amaslenn in #594
- Add AI Dynamo by @TaekyungHeo in #519
- Enable NCCL w/ K8S SPCx by @TaekyungHeo in #579
- Handles comma in env vars values for NemoLauncher by @amaslenn in #591
- Create CmdGenStrategy per usage by @amaslenn in #596
- Require docker image for NCCL tests to be explicitly set in config by @amaslenn in #597
- Rely on member test run object instead of args by @amaslenn in #598
- Small improvements by @amaslenn in #599
- Fix docker image cache CLI for gres support by @amaslenn in #600
- Update doc/ai_dynamo.md by @TaekyungHeo in #601
- Remove header when using sinfo by @amaslenn in #602
- Update AI Dynamo config to use vLLM_V1 API. by @karya0 in #595
- Register BashCmd workload by @amaslenn in https://github...
v1.3.rc4
What's Changed
- Fix to Buggy Implemention of PR 589 by @srivatsankrishnan in #609
Full Changelog: v1.3.rc3...v1.3.rc4
v1.3.rc3
What's Changed
- Update conf/common/test_scenario/nemo_run_llama3_8b.toml by @TaekyungHeo in #610
Full Changelog: v1.3.rc2...v1.3.rc3