Release v1.3.0 · NVIDIA/cloudai

What's Changed

Nemo2.0 Perf Features (next set) by @srivatsankrishnan in #496
Remove requirements files to rely on pyproject for dependencies by @amaslenn in #505
Revert Integrate refactored Chakra replay #469 by @TaekyungHeo in #504
Bump version to v1.3 by @amaslenn in #508
Support hostfile generation and override distribution for explicit nodes by @TaekyungHeo in #509
Handle missing credentials when creating HttpDataRepository by @TaekyungHeo in #512
Allow pre-test adding extra srun arguments by @amaslenn in #511
Correctly handle values with spaces for env vars by @amaslenn in #514
Change Scenario Report Map type from Set to List by @lilyw97 in #501
Verify if input path exist on argparse level by @amaslenn in #515
Apply style to scenario report by @amaslenn in #517
Control workload settings from scenario by @amaslenn in #393
Make installabes() non-abstract by @amaslenn in #518
Add DeepSeek-R1 Inference by @TaekyungHeo in #503
Add NeMoRunJobStatusRetrievalStrategy and register it in the strategy registry by @TaekyungHeo in #490
Update MegatronRun model dump logic by @amaslenn in #521
Updated README by @amaslenn in #523
Fix how test-in-scenario is merge with test-in-toml by @amaslenn in #525
Remove venv folder if requirements installation failed by @amaslenn in #527
Use lazy imports for slow modules by @amaslenn in #526
Detect low thread environments and adjust task limits by @TaekyungHeo in #529
Move sweeps logic to TestRun by @amaslenn in #513
Optimize slurm updates by @amaslenn in #535
Add extra_srun_args & scripts in SlurmContainerTestDef by @lilyw97 in #531
Dump CloudAI version into generated sbatch script by @amaslenn in #537
Remove venv folder if requirements installation failed by @amaslenn in #530
Nemo2.0 Perf Recipes (Set 2) by @srivatsankrishnan in #500
Reduce usage of slurm_args by @amaslenn in #538
Address comments from #513 by @amaslenn in #534
Get rid of _parse_slurm_args by @amaslenn in #539
Set srun job name to "-CloudAI_install_docker_image.%Y%m%d_%H%M%S" by @TaekyungHeo in #544
Added support for additional args in cmd_args in chakra replay workload by @Eli-Siegel-nvidia in #542
Add GPU directive support check to SlurmSystem and use it in command gen by @TaekyungHeo in #541
Per-rank env vars evaluation by @amaslenn in #536
Store test details and best config for DSE by @amaslenn in #524
Add NIXL bench workload by @amaslenn in #540
Make sure install status is populated to all duplicates by @amaslenn in #545
Use copies for venv creation + fix tests by @amaslenn in #546
Control if home folder should be mounted into container for slurm by @amaslenn in #547
Add LLAMA3 8b to NeMo acceptance by @TaekyungHeo in #532
Return absolute path for cached Docker image in installed_path method by @TaekyungHeo in #549
Allow val_check_interval to be int, float, or list of both by @amaslenn in #551
Support single node configuration for NIXLBench by @amaslenn in #552
Make sure mark_as_installed respects system config by @amaslenn in #548
NIXL reporting by @amaslenn in #550
Allow sweeps for number of nodes by @amaslenn in #487
Fix invalid type for image when cache is disabled by @amaslenn in #554
BaseRunner: rename callbacks and make them synchronous by @amaslenn in #553
Refactor supports_gpu_directives to focus on GresTypes by @TaekyungHeo in #556
Migrate to modern datetime interface by @emmanuel-ferdman in #561
Add single sbatch runner for slurm systems by @amaslenn in #555
Fix DeepSeekR1 inference report by @TaekyungHeo in #560
Rework imports by @amaslenn in #559
Fix path to jinja template by @amaslenn in #562
Generate reports for DSE jobs by @TaekyungHeo in #563
Do not use --copies for venv creation by @amaslenn in #565
Add configuration for scenario reports by @amaslenn in #564
Support for multiple metrics in reporter by @amaslenn in #558
Expand slurm meta to have per-step information by @amaslenn in #567
Add configurable reward functions to CloudAIGym by @TaekyungHeo in #566
Remove JAX configs as used image is not available by @amaslenn in #568
Fix for handling srun with multiline commands by @amaslenn in #573
Cleanup configs in conf/common by @amaslenn in #571
Add BashCmd workload by @amaslenn in #570
Correctly load and save tdef as part of TestRunDetails by @amaslenn in #574
Make NIXL work in single-sbatch mode by @amaslenn in #575
Re-work slurm node status update by @amaslenn in #577
Add NIXL summary report by @amaslenn in #576
Update regex to correctly extract full GPU type names including suffixes and variants by @TaekyungHeo in #578
Fix missing k8s import by using lazy.k8s in MPIJob delete call by @TaekyungHeo in #580
Align method with BaseRunner by renaming to on_job_completion and removing async by @TaekyungHeo in #581
Add DockerImage support to Kubernetes installer methods by @TaekyungHeo in #583
Match json_gen_strategy implementation to command_gen_strategy by @TaekyungHeo in #585
Fix nodes allocation from the same group by @amaslenn in #586
Guard on_job_submit with null check for _command_gen_strategy access by @TaekyungHeo in #584
Silently skip NIXL summary generation if no NIXL tests by @amaslenn in #587
Llama31_405b by @srivatsankrishnan in #582
Merge JobIdRetrieval functionality into respective runners by @amaslenn in #588
Re-work job status fetching by @amaslenn in #589
Update UCC configs by @amaslenn in #590
Avoid confusing post_test/pre_test folder structure by @amaslenn in #592
Remove default_cmd_args field from TestTemplateStrategy by @amaslenn in #594
Add AI Dynamo by @TaekyungHeo in #519
Enable NCCL w/ K8S SPCx by @TaekyungHeo in #579
Handles comma in env vars values for NemoLauncher by @amaslenn in #591
Create CmdGenStrategy per usage by @amaslenn in #596
Require docker image for NCCL tests to be explicitly set in config by @amaslenn in #597
Rely on member test run object instead of args by @amaslenn in #598
Small improvements by @amaslenn in #599
Fix docker image cache CLI for gres support by @amaslenn in #600
Update doc/ai_dynamo.md by @TaekyungHeo in #601
Remove header when using sinfo by @amaslenn in #602
Update AI Dynamo config to use vLLM_V1 API. by @karya0 in #595
Register BashCmd workload by @amaslenn in #603
Pass extra_srun_args during install. by @karya0 in #605
Add NIXL perftest (kvbench for sequential-ct-perftest) support by @amaslenn in #604
Docker cache fix by @karya0 in #606
Support for fp8 Llama3_405b by @srivatsankrishnan in #593
Numa control in Nemo2.0 by @srivatsankrishnan in #607
Update conf/common/test_scenario/nemo_run_llama3_8b.toml by @TaekyungHeo in #610
Fix to Buggy Implemention of PR 589 by @srivatsankrishnan in #609

New Contributors

@Eli-Siegel-nvidia made their first contribution in #542
@emmanuel-ferdman made their first contribution in #561
@karya0 made their first contribution in #595

Full Changelog: v1.2.0...v1.3.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.3.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!