[BUG]Zero++ training failed

**Describe the bug**
I have 4 nodes, each with 8 A100 gpu. In order to reduce communication between nodes, I used zero++ training, which indeed accelerated the training process. However, during the training process, the loss remained at 11.9321 and the grad_norm remained at 0, resulting in training failure

My Deepspeed configuration file is as follows：
{
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "zero_allow_untested_optimizer": true,
    "fp16": {
      "enabled": "auto",
      "loss_scale": 0,
      "loss_scale_window": 1000,
      "initial_scale_power": 16,
      "hysteresis": 2,
      "min_loss_scale": 1
    },
    "bf16": {
      "enabled": "auto"
    },
    "zero_optimization": {
      "stage": 3,
      "offload_optimizer": {
        "device": "cpu",
        "pin_memory": true
      },
      "offload_param": {
        "device": "cpu",
        "pin_memory": true
      },
      "zero_hpz_partition_size": 8,
      "zero_quantized_weights": false,
      "zero_quantized_gradients": false,
      "overlap_comm": true,
      "contiguous_gradients": true,
      "sub_group_size": 1e9,
      "reduce_bucket_size": "auto",
      "stage3_prefetch_bucket_size": "auto",
      "stage3_param_persistence_threshold": "auto",
      "stage3_max_live_parameters": 1e9,
      "stage3_max_reuse_distance": 1e9,
      "stage3_gather_16bit_weights_on_model_save": true
    }
  }
  
Excuse me, where is the problem and how should I solve it？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG]Zero++ training failed #6926

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]Zero++ training failed #6926

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions