Skip to content

Scheduling multiple GPUs on GCP via skypilot versus Vertex AI #2239

Open
@fozziethebeat

Description

@fozziethebeat

Hi All! I found Skypilot last week and it's been a huge improvement for deploying single gpu workflows (training and serving).

But I've found a situation where it's not working so well when scheduling multi-gpu setups on GCP.

Right now I'm trying to run a simple trainer on GCP with a little HuggingFace based trainer library I've got. I want a n1-highmem-16 or n1-standard-x type machine attached to two V100s. When trying to run skypilot, it does its best to try and find quota but consistently fails. When making a similar request via GCP's VertexAI command line tools, the job gets quota much faster and schedules.

Is there some kind of bias in GCP where it prefers VertexAI based scheduling over the tools skypilot has access to?

For reference, he's a copy of my two configs:

skypilot:

name: unified-pythia160m-peft
resources:
  accelerators: V100:2

workdir: .

file_mounts:
  /gcs-data:
    source: gs://my-bucket
    mode: MOUNT

setup: |
  sudo apt-get install -y git-lfs
  conda create -n cubrio-trainer python=3.9 -y
  conda activate cubrio-trainer
  pip install .

run: |
  conda activate cubrio-trainer
  python -m torch.distributed.launch --nproc_per_node=${NUM_GPUS=2} \
    -m cubrio_ml_training.train_peft \
    --model_name_or_path=EleutherAI/pythia-160m \
    --project_name=unified_pythia160m \
    --chat_dataset_path=/gcs-data//data_path.jsonl \
    --output_dir /gcs-data/checkpoints/unified_pythia160m \
    --save_model \
    --save_merged_model

vertex AI:

workerPoolSpecs:
  machineSpec:
    machineType: n1-standard-8
    acceleratorType: NVIDIA_TESLA_V100
    acceleratorCount: 2
  replicaCount: 1
  containerSpec:
    imageUri: gcr.io/my_project_id/pytorch_gpu_train_hf_peft_creator:latest
    command:
      - python3.10
    args:
      - -m
      - torch.distributed.launch
      - --nproc_per_node=2
      - -m
      - cubrio_ml_training.train_peft
      - --model_name_or_path=EleutherAI/pythia-160m
      - --project_name=unified_pythia160m
      - --chat_dataset_path=/gcs/data_path.jsonl
      - --output_dir=checkpoints/unified_pythia160m
      - --save_model
      - --save_merged_model

The only substantial difference between the two is the machine type, skypilot picks a n1-highmem-16 VM but I doubt that's causing the issue.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions