Skip to content

v0.6

Choose a tag to compare

@artemry-nv artemry-nv released this 09 May 00:03
· 2845 commits to main since this release

Compatibility

CloudAI v0.6 has been tested with: PyTorch NGC Container 24.02, CUDA 12.4, NCCL 2.19, and SPC-X 1.0.1.

Key Features and Enhancements:

  • Designed and implemented extensible SW architecture with support for defining test templates and scenarios, and system schemas.
  • Describe test templates and test scenarios for NeMo Megatron, JAX Toolbox/PAXML, NCCL tests, UCC tests, and Chakra replay.
  • Added support for Slurm and direct job launching and checking status (for testing purposes).
  • Added ability to install, uninstall, dry-run, executing test scenarios, and generating reports.

What’s next

  • Use CloudAI for benchmarking upcoming systems.
  • Better engineering focused on improving user experience (e.g. handle job scheduling failures), and new features (e.g. K8S support).