Skip to content

generate.py not utilizing GPU in full #476

Closed
@frankxu2004

Description

@frankxu2004

I tried to run text generation with prompts using generate.py. I provided a large list of prompts, approximately 20K, and tried to run the generation on 10 RTX 8000 GPUs. However, the GPU utilization by nvidia-smi shows that the GPU utilization during generation is averaging at about 50-60%, which is not ideal. Thank you!

My configuration is:

{
  # Text gen type: `input-file`, `unconditional` or `interactive`
  "text-gen-type": "input-file", #"input-file",
 
  # Params for all
  "maximum_tokens": 256,
  "temperature": 0.2,
  "top_p": 0.95,
  "top_k": 0,
  "recompute": false,
  
  # `unconditional`/`input-file`: samples
  "num-samples": 100,

  # input/output file
  "sample-input-file": "0",
  
  "data-path": "data/code/code_text_document",
  
  # or for weighted datasets: 
  # "train-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],
  # "test-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],
  # "valid-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],
  # "train-data-weights": [1., 2.],
  # "test-data-weights": [2., 1.],
  # "valid-data-weights": [0.5, 0.4],

  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. 
  # WARNING: setting this to True will override any user provided weights
  # "weight_by_num_documents": false,
  # "weighted_sampler_alpha": 0.3,

  "vocab-file": "data/code-vocab.json",
  "merge-file": "data/code-merges.txt",

  "save": "checkpoints",
  "load": "checkpoints",
  "checkpoint_validation_with_forward_pass": False,
  
  "tensorboard-dir": "tensorboard",
  "log-dir": "logs",
  "use_wandb": True,
  "wandb_host": "https://api.wandb.ai",
  "wandb_project": "neox",
}

And the model config:

# GPT-2 pretraining setup
{
   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages
   # across the node boundaries )
   "pipe-parallel-size": 1,
   "model-parallel-size": 1,

   # model settings
   "num-layers": 32,
   "hidden-size": 2560,
   "num-attention-heads": 32,
   "seq-length": 2048,
   "max-position-embeddings": 2048,
   "norm": "layernorm",
   "pos-emb": "rotary",
   "no-weight-tying": true,

   # these should provide some speedup but takes a while to build, set to true if desired
   "scaled-upper-triang-masked-softmax-fusion": true,
   "bias-gelu-fusion": true,

   # optimizer settings
   "zero_allow_untested_optimizer": true,
   "optimizer": {
     "type": "adam",
     "params": {
       "lr": 0.00016,
       "betas": [0.9, 0.999],
       "eps": 1.0e-8,
     }
   },
   "zero_optimization": {
    "stage": 1,
    "allgather_partitions": True,
    "allgather_bucket_size": 500000000,
    "overlap_comm": True,
    "reduce_scatter": True,
    "reduce_bucket_size": 500000000,
    "contiguous_gradients": True,
    "cpu_offload": False
  },

   # batch / data settings
   "train_micro_batch_size_per_gpu": 16,
   "gradient_accumulation_steps": 1,
   "data-impl": "mmap",
   "split": "989,10,1",

   # activation checkpointing
   "checkpoint-activations": true,
   "checkpoint-num-layers": 1,
   "partition-activations": true,
   "synchronize-each-layer": true,

   # regularization
   "gradient_clipping": 1.0,
   "weight-decay": 0,
   "hidden-dropout": 0,
   "attention-dropout": 0,

   # precision settings
   "fp16": { 
     "fp16": true,
     "enabled": true,
     "loss_scale": 0,
     "initial_scale_power": 16,
     "loss_scale_window": 1000,
     "hysteresis": 2,
     "min_loss_scale": 1
   },

   # misc. training settings
   "train-iters": 160000,
   "lr-decay-iters": 160000,
   "distributed-backend": "nccl",
   "lr-decay-style": "cosine",
   "warmup": 0.01,
   "save-interval": 1000,
   "eval-interval": 1000,
   "eval-iters": 10,

   # logging
   "log-interval": 100,
   "steps_per_print": 10,
   "keep-last-n-checkpoints": 1,
   "wall_clock_breakdown": true,
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions