v0.6
Compatibility
CloudAI v0.6 has been tested with: PyTorch NGC Container 24.02, CUDA 12.4, NCCL 2.19, and SPC-X 1.0.1.
Key Features and Enhancements:
- Designed and implemented extensible SW architecture with support for defining test templates and scenarios, and system schemas.
- Describe test templates and test scenarios for NeMo Megatron, JAX Toolbox/PAXML, NCCL tests, UCC tests, and Chakra replay.
- Added support for Slurm and direct job launching and checking status (for testing purposes).
- Added ability to install, uninstall, dry-run, executing test scenarios, and generating reports.
What’s next
- Use CloudAI for benchmarking upcoming systems.
- Better engineering focused on improving user experience (e.g. handle job scheduling failures), and new features (e.g. K8S support).