v1.3.0
What's Changed
- Nemo2.0 Perf Features (next set) by @srivatsankrishnan in #496
- Remove requirements files to rely on pyproject for dependencies by @amaslenn in #505
- Revert
Integrate refactored Chakra replay #469by @TaekyungHeo in #504 - Bump version to v1.3 by @amaslenn in #508
- Support hostfile generation and override distribution for explicit nodes by @TaekyungHeo in #509
- Handle missing credentials when creating HttpDataRepository by @TaekyungHeo in #512
- Allow pre-test adding extra srun arguments by @amaslenn in #511
- Correctly handle values with spaces for env vars by @amaslenn in #514
- Change Scenario Report Map type from Set to List by @lilyw97 in #501
- Verify if input path exist on argparse level by @amaslenn in #515
- Apply style to scenario report by @amaslenn in #517
- Control workload settings from scenario by @amaslenn in #393
- Make installabes() non-abstract by @amaslenn in #518
- Add DeepSeek-R1 Inference by @TaekyungHeo in #503
- Add NeMoRunJobStatusRetrievalStrategy and register it in the strategy registry by @TaekyungHeo in #490
- Update MegatronRun model dump logic by @amaslenn in #521
- Updated README by @amaslenn in #523
- Fix how test-in-scenario is merge with test-in-toml by @amaslenn in #525
- Remove venv folder if requirements installation failed by @amaslenn in #527
- Use lazy imports for slow modules by @amaslenn in #526
- Detect low thread environments and adjust task limits by @TaekyungHeo in #529
- Move sweeps logic to TestRun by @amaslenn in #513
- Optimize slurm updates by @amaslenn in #535
- Add extra_srun_args & scripts in SlurmContainerTestDef by @lilyw97 in #531
- Dump CloudAI version into generated sbatch script by @amaslenn in #537
- Remove venv folder if requirements installation failed by @amaslenn in #530
- Nemo2.0 Perf Recipes (Set 2) by @srivatsankrishnan in #500
- Reduce usage of slurm_args by @amaslenn in #538
- Address comments from #513 by @amaslenn in #534
- Get rid of _parse_slurm_args by @amaslenn in #539
- Set srun job name to "-CloudAI_install_docker_image.%Y%m%d_%H%M%S" by @TaekyungHeo in #544
- Added support for additional args in cmd_args in chakra replay workload by @Eli-Siegel-nvidia in #542
- Add GPU directive support check to SlurmSystem and use it in command gen by @TaekyungHeo in #541
- Per-rank env vars evaluation by @amaslenn in #536
- Store test details and best config for DSE by @amaslenn in #524
- Add NIXL bench workload by @amaslenn in #540
- Make sure install status is populated to all duplicates by @amaslenn in #545
- Use copies for venv creation + fix tests by @amaslenn in #546
- Control if home folder should be mounted into container for slurm by @amaslenn in #547
- Add LLAMA3 8b to NeMo acceptance by @TaekyungHeo in #532
- Return absolute path for cached Docker image in installed_path method by @TaekyungHeo in #549
- Allow val_check_interval to be int, float, or list of both by @amaslenn in #551
- Support single node configuration for NIXLBench by @amaslenn in #552
- Make sure mark_as_installed respects system config by @amaslenn in #548
- NIXL reporting by @amaslenn in #550
- Allow sweeps for number of nodes by @amaslenn in #487
- Fix invalid type for image when cache is disabled by @amaslenn in #554
- BaseRunner: rename callbacks and make them synchronous by @amaslenn in #553
- Refactor supports_gpu_directives to focus on GresTypes by @TaekyungHeo in #556
- Migrate to modern datetime interface by @emmanuel-ferdman in #561
- Add single sbatch runner for slurm systems by @amaslenn in #555
- Fix DeepSeekR1 inference report by @TaekyungHeo in #560
- Rework imports by @amaslenn in #559
- Fix path to jinja template by @amaslenn in #562
- Generate reports for DSE jobs by @TaekyungHeo in #563
- Do not use --copies for venv creation by @amaslenn in #565
- Add configuration for scenario reports by @amaslenn in #564
- Support for multiple metrics in reporter by @amaslenn in #558
- Expand slurm meta to have per-step information by @amaslenn in #567
- Add configurable reward functions to CloudAIGym by @TaekyungHeo in #566
- Remove JAX configs as used image is not available by @amaslenn in #568
- Fix for handling srun with multiline commands by @amaslenn in #573
- Cleanup configs in conf/common by @amaslenn in #571
- Add BashCmd workload by @amaslenn in #570
- Correctly load and save tdef as part of TestRunDetails by @amaslenn in #574
- Make NIXL work in single-sbatch mode by @amaslenn in #575
- Re-work slurm node status update by @amaslenn in #577
- Add NIXL summary report by @amaslenn in #576
- Update regex to correctly extract full GPU type names including suffixes and variants by @TaekyungHeo in #578
- Fix missing k8s import by using lazy.k8s in MPIJob delete call by @TaekyungHeo in #580
- Align method with BaseRunner by renaming to on_job_completion and removing async by @TaekyungHeo in #581
- Add DockerImage support to Kubernetes installer methods by @TaekyungHeo in #583
- Match json_gen_strategy implementation to command_gen_strategy by @TaekyungHeo in #585
- Fix nodes allocation from the same group by @amaslenn in #586
- Guard on_job_submit with null check for _command_gen_strategy access by @TaekyungHeo in #584
- Silently skip NIXL summary generation if no NIXL tests by @amaslenn in #587
- Llama31_405b by @srivatsankrishnan in #582
- Merge JobIdRetrieval functionality into respective runners by @amaslenn in #588
- Re-work job status fetching by @amaslenn in #589
- Update UCC configs by @amaslenn in #590
- Avoid confusing post_test/pre_test folder structure by @amaslenn in #592
- Remove default_cmd_args field from TestTemplateStrategy by @amaslenn in #594
- Add AI Dynamo by @TaekyungHeo in #519
- Enable NCCL w/ K8S SPCx by @TaekyungHeo in #579
- Handles comma in env vars values for NemoLauncher by @amaslenn in #591
- Create CmdGenStrategy per usage by @amaslenn in #596
- Require docker image for NCCL tests to be explicitly set in config by @amaslenn in #597
- Rely on member test run object instead of args by @amaslenn in #598
- Small improvements by @amaslenn in #599
- Fix docker image cache CLI for gres support by @amaslenn in #600
- Update doc/ai_dynamo.md by @TaekyungHeo in #601
- Remove header when using sinfo by @amaslenn in #602
- Update AI Dynamo config to use vLLM_V1 API. by @karya0 in #595
- Register BashCmd workload by @amaslenn in #603
- Pass extra_srun_args during install. by @karya0 in #605
- Add NIXL perftest (kvbench for sequential-ct-perftest) support by @amaslenn in #604
- Docker cache fix by @karya0 in #606
- Support for fp8 Llama3_405b by @srivatsankrishnan in #593
- Numa control in Nemo2.0 by @srivatsankrishnan in #607
- Update conf/common/test_scenario/nemo_run_llama3_8b.toml by @TaekyungHeo in #610
- Fix to Buggy Implemention of PR 589 by @srivatsankrishnan in #609
New Contributors
- @Eli-Siegel-nvidia made their first contribution in #542
- @emmanuel-ferdman made their first contribution in #561
- @karya0 made their first contribution in #595
Full Changelog: v1.2.0...v1.3.0