These examples show how to use SkyPilot to deploy development nodes or clusters.
All examples assume Lambda as the cloud provider, but many others are possible: https://docs.skypilot.co/en/latest/overview.html#cloud-vms.
- Deploying a development node
- Configure a development node
- Run training on a single node
- Miscellaneous
This is the simplest use case where you just want a node with a single GPU. You can pick any directory, but typically a GitHub repo root, and do this:
# Configure Lambda access: https://docs.skypilot.co/en/latest/getting-started/installation.html#lambda-cloud
# Go here https://cloud.lambda.ai/api-keys/cloud-api and create an API key
mkdir -p ~/.lambda_cloud
echo "api_key = <your_api_key_here>" > ~/.lambda_cloud/lambda_keys
# Pin uvicorn version bound to avoid: https://github.com/skypilot-org/skypilot/issues/7287
pip install skypilot[lambda]
# Deploy a single A100 node
sky launch --infra lambda --cluster dllm --gpus A100 --disk-size 100
# To sync code in your current directory (i.e. this repo) to the node, specify a workdir:
sky launch --infra lambda --cluster dllm --gpus A100 --disk-size 100 --workdir .Then you can ssh to the node via ssh dllm, or whatever "cluster name" you chose. If your cluster is more than one node, the rest will be named like dllm-worker1, dllm-worker2, etc. as hosts in your SSH config.
See also:
The above is example is fine, except it doesn't take care of creating an environment, setting env vars, etc. Here's an example of how to at least create an environment:
cat << 'EOF' > cluster.sky.yaml
resources:
cloud: lambda
# Use cheaper of A100s or H100s, whatever is available
accelerators: ["A100:8", "H100:8"]
disk: 100 # GB
setup: |
# Create dllm-dev conda environment
conda env create -f requirements.yaml
EOFThen you can launch the cluster with:
sky launch --cluster dllm cluster.sky.yaml
# Once that's done:
ssh dllm
conda deactivate # deactivate base conda env skypilot creates
conda activate dllm-dev # activate dllm-dev envBuilding on the example above in Configure a development node, you can now run training with some extra initial cluster setup:
cat << 'EOF' > cluster.sky.yaml
# Use these env vars fromyour LOCAL machine so you only ever
# have to set them in one place
envs:
HUGGING_FACE_HUB_TOKEN: null
WANDB_API_KEY: null
# This determines the local dir for /home/ubuntu/sky_workdir,
# which is where `setup` runs below
workdir: .
setup: |
# Create dllm-dev conda environment
conda create -f requirements.yaml
# Create local env for future jobs and ssh sessions
> ~/.env # Clear first
# Add tokens/secrets from client env
for var in HUGGING_FACE_HUB_TOKEN WANDB_API_KEY; do
declare -n ref=$var
echo "$var=$ref" >> ~/.env
done
# Add other project-specific configuration
cat << EOF >> ~/.env
WANDB__SERVICE_WAIT=600
_WANDB_STARTUP_DEBUG=true
WANDB_ENTITY=kuleshov-group
HF_HOME=${PWD}/.hf_cache
PYTHONPATH=${PWD}:${PWD}/.hf_cache/modules
HYDRA_FULL_ERROR=1
NCCL_P2P_LEVEL=NVL
EOF
# Source ~/.env on login
if ! grep -q "set -a; source ~/.env; set +a" ~/.bashrc; then
echo "set -a; source ~/.env; set +a" >> ~/.bashrc
fi
EOFNotably, this will take your W&B and HuggingFace tokens from your local machine and set them in the cluster so that you don't ever have to copy these around manually in repo-specific files (usually what you want). They will be set if you ssh in to a cluster node as well. Then, training can be run this way:
# Launch the single-node cluster
sky launch --cluster dllm cluster.sky.yaml \
--env HUGGING_FACE_HUB_TOKEN --env WANDB_API_KEY
# Define the task to run
cat << 'EOF' > task.sky.yaml
workdir: .
run: |
set -exo pipefail
conda deactivate
conda activate dllm-dev
python -c "import torch; assert torch.cuda.is_available()"
export RUN_DIR=outputs
export DATA_DIR=data
export NUM_VISIBLE_DEVICES=8
bash bash_scripts/run_train_e2d2_wmt_lambda.sh
EOF
# Run the task
sky exec -c dllm task.sky.yamlSee also:
How do I shut a cluster down?
sky down dllmHow do I reset SkyPilot state?
This is very important when jumping around between projects or when using different versions of SkyPilot installed locally. There is no good reason not to do this frequently, aside from certain cloud providers that require global sky configurations in ~/.sky/config.yaml (e.g. kubernetes).
sky api stop; [ -d ~/.sky ] && rm -rf ~/.skyHow do I check if Lambda is setup correctly?
sky check lambdaHow do I cancel a task that is running?
Normally ctrl-c disconnects streaming logs without cancel a task. To do that manually, use:
# Show all running jobs
> sky queue
Fetching and parsing job queue...
Fetching job queue for: dllm
Job queue of current user on cluster dllm
ID NAME USER SUBMITTED STARTED DURATION RESOURCES STATUS LOG GIT COMMIT
10 - eczech 11 mins ago 11 mins ago 11m 19s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2025-09-22-19-51-11-954378 e610162171a5eaaff76d7c6b31074f4ac9fbadf8
9 - eczech 18 mins ago 18 mins ago 4m 36s 1x[CPU:1+] FAILED ~/sky_logs/sky-2025-09-22-19-44-00-948488 e610162171a5eaaff76d7c6b31074f4ac9fbadf8
8 - eczech 2 hrs ago 2 hrs ago 9m 59s 1x[CPU:1+] FAILED ~/sky_logs/sky-2025-09-22-17-12-50-492436 e610162171a5eaaff76d7c6b31074f4ac9fbadf8
7 - eczech 2 hrs ago 2 hrs ago < 1s 1x[CPU:1+] SUCCEEDED ~/sky_logs/sky-2025-09-22-17-11-42-124874 e610162171a5eaaff76d7c6b31074f4ac9fbadf8
# Cancel the job you want to finish
> sky cancel dllm 10Confusingly, there are Cluster Jobs and then there are Managed Jobs in SkyPilot. The example above shows how to cancel a "Cluster Job". The "Managed Jobs" are not purely user-defined like the "Cluster Jobs" and SkyPilot tries to add features around them for pipelining jobs together and handling preemptions. There is a special sky jobs command for the "Managed Jobs".
How do I show all GPUs of a certain type available on Lambda?
sky show-gpus A100 --infra lambda
GPU QTY CLOUD INSTANCE_TYPE DEVICE_MEM vCPUs HOST_MEM HOURLY_PRICE HOURLY_SPOT_PRICE REGION
A100 1.0 Lambda gpu_1x_a100 40GB 30 200GB $ 1.290 - europe-central-1
A100 1.0 Lambda gpu_1x_a100_sxm4 40GB 30 200GB $ 1.290 - europe-central-1
A100 2.0 Lambda gpu_2x_a100 40GB 60 400GB $ 2.580 - europe-central-1
A100 4.0 Lambda gpu_4x_a100 40GB 120 800GB $ 5.160 - europe-central-1
A100 8.0 Lambda gpu_8x_a100 40GB 124 1800GB $ 10.320 - europe-central-1
GPU QTY CLOUD INSTANCE_TYPE DEVICE_MEM vCPUs HOST_MEM HOURLY_PRICE HOURLY_SPOT_PRICE REGION
A100-80GB 8.0 Lambda gpu_8x_a100_80gb_sxm4 80GB 240 1800GB $ 14.320 - europe-central-1