Releases: NVIDIA/cloudai
Releases · NVIDIA/cloudai
v1.4.beta17
What's Changed
- Add error detection and retry mechanism for worker failures by @TaekyungHeo in #659
- Use single source of data for reporting and NIXL pass/fail by @amaslenn in #670
- Write trajectory file for DSE jobs in single-sbatch mode by @amaslenn in #671
Full Changelog: v1.4.beta16...v1.4.beta17
v1.4.beta16
What's Changed
Full Changelog: v1.4.beta15...v1.4.beta16
v1.4.beta15
What's Changed
- Re-use comparison report for NIXL by @amaslenn in #664
- Handle single-sbatch metadata layout in report by @amaslenn in #666
- Follow-up for PR647 (Support explicit node assignment for prefill and decode workers) by @TaekyungHeo in #665
Full Changelog: v1.4.beta14...v1.4.beta15
v1.4.beta14
What's Changed
- Small housekeeping updates by @amaslenn in #663
- nemo recipes refactor by @malay-nagda in #633
New Contributors
- @malay-nagda made their first contribution in #633
Full Changelog: v1.4.beta13...v1.4.beta14
v1.4.beta13
What's Changed
- Configure reports via scenario config by @amaslenn in #661
- Handle CancelledError gracefully during job cleanup by @TaekyungHeo in #662
Full Changelog: v1.4.beta12...v1.4.beta13
v1.4.beta12
What's Changed
- Comparison report for NCCL workloads by @amaslenn in #656
- Support explicit node assignment for prefill and decode workers by @TaekyungHeo in #647
Full Changelog: v1.4.beta11...v1.4.beta12
v1.4.beta11
What's Changed
- Support for DeepSeekR1 model with SGLang / AI Dynamo by @TaekyungHeo in #641
- Support mounting any JSON files for --dynamo-deepep-config by @TaekyungHeo in #650
- Set tp-size and dp-size from args if provided, else use total_gpus by @TaekyungHeo in #649
- Add environment validation to startup sequence by @TaekyungHeo in #651
- Follow-up for PR641 (Support for DeepSeekR1 model with SGLang / AI Dynamo) by @TaekyungHeo in #653
- Reorder the functions in ai_dynamo.sh for improved maintainability by @TaekyungHeo in #654
- Refactor GPU count to use _gpus_per_node in vllm and env validation by @TaekyungHeo in #657
- Mount huggingface_home_container_path unconditionally by @TaekyungHeo in #655
- Refactor nodelist validation to check DYNAMO_NODELIST only if both args empty by @TaekyungHeo in #658
Full Changelog: v1.4.beta10...v1.4.beta11
v1.4.beta10
What's Changed
- Preserve installables' state during apply_params_set() by @amaslenn in #643
- Control which env vars dumped for per-rand evaluation by @amaslenn in #642
- Align extra_env_vars definition in test and scenario by @amaslenn in #644
- Update USER_GUIDE.md by @TaekyungHeo in #646
- Add latency metric reporting for NCCL by @amaslenn in #645
Full Changelog: v1.4.beta9...v1.4.beta10
v1.4.beta9
What's Changed
- Updates for SlurmContainer workload by @amaslenn in #638
- Handle missing tests gracefully by adding MissingTestError to avoid backtrace by @TaekyungHeo in #640
- Clean up src/cloudai/workloads/ai_dynamo/ai_dynamo.sh by @TaekyungHeo in #639
Full Changelog: v1.4.beta8...v1.4.beta9
v1.4.beta8
What's Changed
- Add multi-worker-per-node GPU slicing support with dynamic allocation by @TaekyungHeo in #636
- Log mapping between AI Dynamo nodes and roles by @TaekyungHeo in #617
Full Changelog: v1.4.beta7...v1.4.beta8