Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion dask_cloudprovider/cloudprovider.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,9 @@ cloudprovider:
network_projectid: null # GCP project id where the network exists
projectid: "" # name of the google cloud project
on_host_maintenance: "TERMINATE"
machine_type: "n1-standard-1" # size of the machine type to use
machine_type: "n1-standard-1" # size of the machine type to use for the scheduler and all workers
scheduler_machine_type: "n1-standard-1" # size of the machine type to use for the scheduler
worker_machine_type: "n1-standard-1" # size of the machine type to use for all workers
filesystem_size: 50 # amount in GBs of hard drive space to allocate
ngpus: "" # number of GPUs to use
gpu_type: "" # type of gpus to use: nvidia-tesla-k80, nvidia-tesla-p100, nvidia-tesla-t4
Expand Down
42 changes: 38 additions & 4 deletions dask_cloudprovider/gcp/instances.py
Original file line number Diff line number Diff line change
Expand Up @@ -417,7 +417,15 @@ class GCPCluster(VMCluster):
be cases (i.e. Shared VPC) when network configurations from a different GCP project are used.
machine_type: str
The VM machine_type. You can get a full list with ``gcloud compute machine-types list``.
The default is ``n1-standard-1`` which is 3.75GB RAM and 1 vCPU
The default is ``n1-standard-1`` which is 3.75GB RAM and 1 vCPU.
This will determine the resources available to both the sceduler and all workers.
If supplied, you may not specify ``scheduler_machine_type`` or ``worker_machine_type``.
scheduler_machine_type: str
The VM machine_type. This will determine the resources available to the scheduler.
The default is ``n1-standard-1`` which is 3.75GB RAM and 1 vCPU.
worker_machine_type: str
The VM machine_type. This will determine the resources available to all workers.
The default is ``n1-standard-1`` which is 3.75GB RAM and 1 vCPU.
source_image: str
The OS image to use for the VM. Dask Cloudprovider will boostrap Ubuntu based images automatically.
Other images require Docker and for GPUs the NVIDIA Drivers and NVIDIA Docker.
Expand Down Expand Up @@ -445,10 +453,11 @@ class GCPCluster(VMCluster):
extra_bootstrap: list[str] (optional)
Extra commands to be run during the bootstrap phase.
ngpus: int (optional)
The number of GPUs to atatch to the instance.
The number of GPUs to atatch to the worker instance. No work is expected to be done on scheduler, so no
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't true. Due to the way that Dask uses pickle to move things around there are cases where the scheduler might deserialize a meta object which may try and allocate a small amount of GPU memory. It's always recommended to have a small GPU available on the scheduler.

https://docs.rapids.ai/deployment/stable/guides/scheduler-gpu-requirements/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback!

While it makes sense to have a GPU on the scheduler to avoid these issues, I think it would be beneficial to allow some flexibility in the configuration. Some users might want different GPU configurations (e.g., a smaller/cheaper GPU on the scheduler vs. more powerful ones on workers), or in some cases might want to explicitly disable scheduler GPUs for cost reasons despite the potential pickle issues.

I've updated the PR to support both approaches:

  • Unified configuration (existing behavior): ngpus and gpu_type apply to both scheduler and workers
  • Separate configuration (new): scheduler_ngpus/scheduler_gpu_type and worker_ngpus/worker_gpu_type for fine-grained control

The default behavior remains the same (same GPU config for both), but now users have the flexibility to choose different configurations when needed. I've also updated the documentation to mention the scheduler GPU requirements you referenced

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup totally agree with all of that!

GPU there.
Default is ``0``.
gpu_type: str (optional)
The name of the GPU to use. This must be set if ``ngpus>0``.
The name of the GPU to use on worker. This must be set if ``ngpus>0``.
You can see a list of GPUs available in each zone with ``gcloud compute accelerator-types list``.
filesystem_size: int (optional)
The VM filesystem size in GB. Defaults to ``50``.
Expand Down Expand Up @@ -573,6 +582,8 @@ def __init__(
network=None,
network_projectid=None,
machine_type=None,
scheduler_machine_type=None,
worker_machine_type=None,
on_host_maintenance=None,
source_image=None,
docker_image=None,
Expand Down Expand Up @@ -603,7 +614,16 @@ def __init__(
bootstrap if bootstrap is not None else self.config.get("bootstrap")
)
self.machine_type = machine_type or self.config.get("machine_type")
self.gpu_instance = "gpu" in self.machine_type or bool(ngpus)
if machine_type is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmiasnychenko it would be great if we could check that machine_type is set XOR scheduler/worker_machine_type; otherwise, we should throw an error. It should be a BC safe check.

self.scheduler_machine_type = scheduler_machine_type or self.config.get("scheduler_machine_type")
self.worker_machine_type = worker_machine_type or self.config.get("worker_machine_type")
if self.scheduler_machine_type is None or self.worker_machine_type is None:
raise ValueError("machine_type and scheduler_machine_type must be set")
else:
if scheduler_machine_type is not None or worker_machine_type is not None:
raise ValueError("If you specify machine_type, you may not specify scheduler_machine_type or worker_machine_type")
self.scheduler_machine_type = machine_type
self.worker_machine_type = machine_type
self.debug = debug
self.options = {
"cluster": self,
Expand All @@ -617,6 +637,8 @@ def __init__(
or self.config.get("on_host_maintenance"),
"zone": zone or self.config.get("zone"),
"machine_type": self.machine_type,
"scheduler_machine_type": self.scheduler_machine_type,
"worker_machine_type": self.worker_machine_type,
"ngpus": ngpus or self.config.get("ngpus"),
"network": network or self.config.get("network"),
"network_projectid": network_projectid
Expand All @@ -635,6 +657,18 @@ def __init__(
}
self.scheduler_options = {**self.options}
self.worker_options = {**self.options}
self.scheduler_options["machine_type"] = self.scheduler_machine_type
self.worker_options["machine_type"] = self.worker_machine_type

# Scheduler always does not have GPUs as no work is expected to be done there
self.scheduler_options["ngpus"] = 0
self.scheduler_options["gpu_type"] = None
self.scheduler_options["gpu_instance"] = False

if ngpus or self.config.get("ngpus"):
self.worker_options["ngpus"] = ngpus or self.config.get("ngpus")
self.worker_options["gpu_type"] = gpu_type or self.config.get("gpu_type")
self.worker_options["gpu_instance"] = True

if "extra_bootstrap" not in kwargs:
kwargs["extra_bootstrap"] = self.config.get("extra_bootstrap")
Expand Down
Loading