Description
Hi All! I found Skypilot last week and it's been a huge improvement for deploying single gpu workflows (training and serving).
But I've found a situation where it's not working so well when scheduling multi-gpu setups on GCP.
Right now I'm trying to run a simple trainer on GCP with a little HuggingFace based trainer library I've got. I want a n1-highmem-16
or n1-standard-x
type machine attached to two V100s. When trying to run skypilot, it does its best to try and find quota but consistently fails. When making a similar request via GCP's VertexAI command line tools, the job gets quota much faster and schedules.
Is there some kind of bias in GCP where it prefers VertexAI based scheduling over the tools skypilot has access to?
For reference, he's a copy of my two configs:
skypilot:
name: unified-pythia160m-peft
resources:
accelerators: V100:2
workdir: .
file_mounts:
/gcs-data:
source: gs://my-bucket
mode: MOUNT
setup: |
sudo apt-get install -y git-lfs
conda create -n cubrio-trainer python=3.9 -y
conda activate cubrio-trainer
pip install .
run: |
conda activate cubrio-trainer
python -m torch.distributed.launch --nproc_per_node=${NUM_GPUS=2} \
-m cubrio_ml_training.train_peft \
--model_name_or_path=EleutherAI/pythia-160m \
--project_name=unified_pythia160m \
--chat_dataset_path=/gcs-data//data_path.jsonl \
--output_dir /gcs-data/checkpoints/unified_pythia160m \
--save_model \
--save_merged_model
vertex AI:
workerPoolSpecs:
machineSpec:
machineType: n1-standard-8
acceleratorType: NVIDIA_TESLA_V100
acceleratorCount: 2
replicaCount: 1
containerSpec:
imageUri: gcr.io/my_project_id/pytorch_gpu_train_hf_peft_creator:latest
command:
- python3.10
args:
- -m
- torch.distributed.launch
- --nproc_per_node=2
- -m
- cubrio_ml_training.train_peft
- --model_name_or_path=EleutherAI/pythia-160m
- --project_name=unified_pythia160m
- --chat_dataset_path=/gcs/data_path.jsonl
- --output_dir=checkpoints/unified_pythia160m
- --save_model
- --save_merged_model
The only substantial difference between the two is the machine type, skypilot picks a n1-highmem-16
VM but I doubt that's causing the issue.