Skip to content

Releases: NVIDIA/cloudai

v0.9.beta6

30 Sep 20:06
0aed180

Choose a tag to compare

v0.9.beta6 Pre-release
Pre-release

What's Changed

  • Enhance Error Handling for Missing default_partition by @TaekyungHeo in #216
  • Remove Dead Code by Eliminating Unused install_path by @TaekyungHeo in #215
  • Simplify slurm system by @jeffnvidia in #167
  • Remove unnecessary permission checks and exception handling in NeMoLauncherSlurmInstallStrategy by @TaekyungHeo in #218

Full Changelog: v0.9.beta5...v0.9.beta6

v0.9.beta5

30 Sep 10:46
2b27be4

Choose a tag to compare

v0.9.beta5 Pre-release
Pre-release

What's Changed

Full Changelog: v0.9.beta4...v0.9.beta5

v0.9.beta4

27 Sep 21:59
b305b86

Choose a tag to compare

v0.9.beta4 Pre-release
Pre-release

What's Changed

  • Fix Bug in DockerImageCacheResult to Correctly Retrieve Absolute Paths by @TaekyungHeo in #212
  • Add Support for Cluster Account and gpus_per_node in Command Generation by @TaekyungHeo in #210

Full Changelog: v0.9.beta3...v0.9.beta4

v0.9.beta3

27 Sep 14:21
85d4276

Choose a tag to compare

v0.9.beta3 Pre-release
Pre-release

What's Changed

  • Extend CI for high level use cases by @amaslenn in #204
  • Replace 'training.values' Key with 'training' in final_cmd_args by @TaekyungHeo in #209
  • Use System Install Path Instead of Local Member Variables by @TaekyungHeo in #206
  • Remove Unused env_vars From Initialization Code by @TaekyungHeo in #207
  • Remove Config File Handling and TOML Dependency from SlurmInstaller by @TaekyungHeo in #211

Full Changelog: v0.9.beta2...v0.9.beta3

v0.9.beta2

25 Sep 16:36
953f04b

Choose a tag to compare

v0.9.beta2 Pre-release
Pre-release

Release notes

We are working on schema improvements to simplify configs management and make them verifiable. This will help ensure that configs are correct before expensive runs on real hardware. Today we are enabling it for Test configs. This is a continuation of #158.

  1. Test Template TOML files were replaced with Pydantic models. That ensures mandatory arguments as well as its types and requires less code to maintain.
  2. --test-templates-dir option was removed for all commands. All supported tests are registered in code using Registry().add_test_definition(...) and Registry().add_test_template(...). Documentation was updated to reflect this change.
  3. Test TOML files now take advantage of standard TOML format for all know arguments.
    Before:
    [cmd_args]
    "training" = "llama/llama2_70b"
    "training.trainer.max_steps" = "120"
    "training.model.global_batch_size" = "256"
    "training.model.pipeline_model_parallel_size" = "1"
    Now:
    [cmd_args]
      [cmd_args.training]
      values = "llama/llama2_70b"
        [cmd_args.training.trainer]
        max_steps = "120"
        [cmd_args.training.model]
        global_batch_size = "256"
        pipeline_model_parallel_size = "2"
  4. extra_cmd_args converted from str to dict[str, str]:
    Before:
    extra_cmd_args = "--stepfactor 2"
    Now:
    [extra_cmd_args]
    "--stepfactor" = "2"
  5. Add a new mode to verify if Tests TOMLs are valid: cloudai --mode verify-tests --system-config conf/common/system/standalone_system.toml --tests-dir conf/common/test/chakra_replay.toml

Full Changelog: v0.9.beta1...v0.9.beta2

v0.9.beta1

24 Sep 15:13
c3542c7

Choose a tag to compare

v0.9.beta1 Pre-release
Pre-release

What's Changed

Full Changelog: v0.9.dev1...v0.9.beta1

v0.9.dev1

16 Sep 08:19
11c5592

Choose a tag to compare

v0.9.dev1 Pre-release
Pre-release

Highlights

We are working on schema improvements to simplify configs management and make them verifiable. This will help ensure that configs are correct before expensive runs on real hardware. Today we are enabling it for System configs.

Added new command for verifying the configs: cloudai --mode verify-systems. --system-config can be a file or a directory to verify all configs in the directory.
Slurm system config format was updated to take advantage of TOML features:

[partitions]
[partitions.partition_1]
name = "partition_1"
nodes = ["node-[001-100]"]

[partitions.partition_2]
name = "partition_2"
nodes = ["node-[101-200]"]

is now

[[partitions]]
name = "partition_1"
nodes = ["node-[001-100]"]

[[partitions]]
name = "partition_2"
nodes = ["node-[101-200]"]

The same is for groups inside partitions.
System parser objects were removed, this functionality is now handled by Pydantic.

What's Changed

Full Changelog: v0.9.dev0...v0.9.dev1

v0.9.dev0

28 Aug 18:37
e8a959a

Choose a tag to compare

v0.9.dev0 Pre-release
Pre-release

What's Changed

  • Refactor to Use pathlib.Path for Path-Related Variables by @TaekyungHeo in #183

Full Changelog: v0.8.1...v0.9.dev0

v0.8.1

27 Aug 06:12
35d1489

Choose a tag to compare

Minor enhancements to v0.8.0 release. Improves nccl test html generation and slurm reservation features.

v0.8.0

19 Aug 15:53
b13bafe

Choose a tag to compare

CloudAI v0.8 release notes

Compatibility

CloudAI v0.8 has been tested with: PyTorch/JAX NGC Container 24.05, NCCL 2.19/2.21, and SPC-X 1.1.

Key Features and Enhancements:

  • Applied the registry pattern to enhance the flexibility and scalability of CloudAI.
  • Extensive unit and integration testing framework using PyTest
  • Enhanced error messages and user guide to improve user experience, helping users troubleshoot issues swiftly.
  • Enhanced the installation feature, focusing on Slurm systems.

What’s next

  • Improve schema for easier validation
  • Support automated grading mechanism
  • Support K8S scheduler
  • Support preflight and post-flight tests