Scheduling multiple GPUs on GCP via skypilot versus Vertex AI

Hi All! I found Skypilot last week and it's been a huge improvement for deploying single gpu workflows (training and serving).

But I've found a situation where it's not working so well when scheduling multi-gpu setups on GCP.

Right now I'm trying to run a simple trainer on GCP with a little HuggingFace based trainer library I've got.  I want a `n1-highmem-16` or `n1-standard-x` type machine attached to two V100s.  When trying to run skypilot, it does its best to try and find quota but consistently fails.  When making a similar request via GCP's VertexAI command line tools, the job gets quota much faster and schedules.

Is there some kind of bias in GCP where it prefers VertexAI based scheduling over the tools skypilot has access to?

For reference, he's a copy of my two configs:

skypilot:

```yaml
name: unified-pythia160m-peft
resources:
  accelerators: V100:2

workdir: .

file_mounts:
  /gcs-data:
    source: gs://my-bucket
    mode: MOUNT

setup: |
  sudo apt-get install -y git-lfs
  conda create -n cubrio-trainer python=3.9 -y
  conda activate cubrio-trainer
  pip install .

run: |
  conda activate cubrio-trainer
  python -m torch.distributed.launch --nproc_per_node=${NUM_GPUS=2} \
    -m cubrio_ml_training.train_peft \
    --model_name_or_path=EleutherAI/pythia-160m \
    --project_name=unified_pythia160m \
    --chat_dataset_path=/gcs-data//data_path.jsonl \
    --output_dir /gcs-data/checkpoints/unified_pythia160m \
    --save_model \
    --save_merged_model
```

vertex AI:

```yaml
workerPoolSpecs:
  machineSpec:
    machineType: n1-standard-8
    acceleratorType: NVIDIA_TESLA_V100
    acceleratorCount: 2
  replicaCount: 1
  containerSpec:
    imageUri: gcr.io/my_project_id/pytorch_gpu_train_hf_peft_creator:latest
    command:
      - python3.10
    args:
      - -m
      - torch.distributed.launch
      - --nproc_per_node=2
      - -m
      - cubrio_ml_training.train_peft
      - --model_name_or_path=EleutherAI/pythia-160m
      - --project_name=unified_pythia160m
      - --chat_dataset_path=/gcs/data_path.jsonl
      - --output_dir=checkpoints/unified_pythia160m
      - --save_model
      - --save_merged_model
```

The only substantial difference between the two is the machine type, skypilot picks a `n1-highmem-16` VM but I doubt that's causing the issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scheduling multiple GPUs on GCP via skypilot versus Vertex AI #2239

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scheduling multiple GPUs on GCP via skypilot versus Vertex AI #2239

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions