-
-
Notifications
You must be signed in to change notification settings - Fork 115
Fix issue 390 support different machine types on gcp #451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
jacobtomlinson
merged 7 commits into
dask:main
from
gmiasnychenko:fix-issue-390-support-different-machine-types-on-gcp
Jun 3, 2025
Merged
Changes from 5 commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
aa8aeb3
Adding support for different machine types on GCP
gmiasnychenko 38ae998
Making GPUs present only on worker instances
gmiasnychenko e6c95d4
Fixing the GPU instance on scheduler
gmiasnychenko 5c19ceb
Cleanup
gmiasnychenko 087bda8
Adjusting to feedback:
gmiasnychenko d7efeb8
Adjusting to feedback:
gmiasnychenko 323ccb9
Adjusting to feedback:
gmiasnychenko File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -417,7 +417,15 @@ class GCPCluster(VMCluster): | |
| be cases (i.e. Shared VPC) when network configurations from a different GCP project are used. | ||
| machine_type: str | ||
| The VM machine_type. You can get a full list with ``gcloud compute machine-types list``. | ||
| The default is ``n1-standard-1`` which is 3.75GB RAM and 1 vCPU | ||
| The default is ``n1-standard-1`` which is 3.75GB RAM and 1 vCPU. | ||
| This will determine the resources available to both the sceduler and all workers. | ||
| If supplied, you may not specify ``scheduler_machine_type`` or ``worker_machine_type``. | ||
| scheduler_machine_type: str | ||
| The VM machine_type. This will determine the resources available to the scheduler. | ||
| The default is ``n1-standard-1`` which is 3.75GB RAM and 1 vCPU. | ||
| worker_machine_type: str | ||
| The VM machine_type. This will determine the resources available to all workers. | ||
| The default is ``n1-standard-1`` which is 3.75GB RAM and 1 vCPU. | ||
| source_image: str | ||
| The OS image to use for the VM. Dask Cloudprovider will boostrap Ubuntu based images automatically. | ||
| Other images require Docker and for GPUs the NVIDIA Drivers and NVIDIA Docker. | ||
|
|
@@ -445,10 +453,11 @@ class GCPCluster(VMCluster): | |
| extra_bootstrap: list[str] (optional) | ||
| Extra commands to be run during the bootstrap phase. | ||
| ngpus: int (optional) | ||
| The number of GPUs to atatch to the instance. | ||
| The number of GPUs to atatch to the worker instance. No work is expected to be done on scheduler, so no | ||
| GPU there. | ||
| Default is ``0``. | ||
| gpu_type: str (optional) | ||
| The name of the GPU to use. This must be set if ``ngpus>0``. | ||
| The name of the GPU to use on worker. This must be set if ``ngpus>0``. | ||
| You can see a list of GPUs available in each zone with ``gcloud compute accelerator-types list``. | ||
| filesystem_size: int (optional) | ||
| The VM filesystem size in GB. Defaults to ``50``. | ||
|
|
@@ -573,6 +582,8 @@ def __init__( | |
| network=None, | ||
| network_projectid=None, | ||
| machine_type=None, | ||
| scheduler_machine_type=None, | ||
| worker_machine_type=None, | ||
| on_host_maintenance=None, | ||
| source_image=None, | ||
| docker_image=None, | ||
|
|
@@ -603,7 +614,16 @@ def __init__( | |
| bootstrap if bootstrap is not None else self.config.get("bootstrap") | ||
| ) | ||
| self.machine_type = machine_type or self.config.get("machine_type") | ||
| self.gpu_instance = "gpu" in self.machine_type or bool(ngpus) | ||
| if machine_type is None: | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @gmiasnychenko it would be great if we could check that |
||
| self.scheduler_machine_type = scheduler_machine_type or self.config.get("scheduler_machine_type") | ||
| self.worker_machine_type = worker_machine_type or self.config.get("worker_machine_type") | ||
| if self.scheduler_machine_type is None or self.worker_machine_type is None: | ||
| raise ValueError("machine_type and scheduler_machine_type must be set") | ||
| else: | ||
| if scheduler_machine_type is not None or worker_machine_type is not None: | ||
| raise ValueError("If you specify machine_type, you may not specify scheduler_machine_type or worker_machine_type") | ||
| self.scheduler_machine_type = machine_type | ||
| self.worker_machine_type = machine_type | ||
| self.debug = debug | ||
| self.options = { | ||
| "cluster": self, | ||
|
|
@@ -617,6 +637,8 @@ def __init__( | |
| or self.config.get("on_host_maintenance"), | ||
| "zone": zone or self.config.get("zone"), | ||
| "machine_type": self.machine_type, | ||
| "scheduler_machine_type": self.scheduler_machine_type, | ||
| "worker_machine_type": self.worker_machine_type, | ||
| "ngpus": ngpus or self.config.get("ngpus"), | ||
| "network": network or self.config.get("network"), | ||
| "network_projectid": network_projectid | ||
|
|
@@ -635,6 +657,18 @@ def __init__( | |
| } | ||
| self.scheduler_options = {**self.options} | ||
| self.worker_options = {**self.options} | ||
| self.scheduler_options["machine_type"] = self.scheduler_machine_type | ||
| self.worker_options["machine_type"] = self.worker_machine_type | ||
|
|
||
| # Scheduler always does not have GPUs as no work is expected to be done there | ||
| self.scheduler_options["ngpus"] = 0 | ||
| self.scheduler_options["gpu_type"] = None | ||
| self.scheduler_options["gpu_instance"] = False | ||
|
|
||
| if ngpus or self.config.get("ngpus"): | ||
| self.worker_options["ngpus"] = ngpus or self.config.get("ngpus") | ||
| self.worker_options["gpu_type"] = gpu_type or self.config.get("gpu_type") | ||
| self.worker_options["gpu_instance"] = True | ||
|
|
||
| if "extra_bootstrap" not in kwargs: | ||
| kwargs["extra_bootstrap"] = self.config.get("extra_bootstrap") | ||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't true. Due to the way that Dask uses pickle to move things around there are cases where the scheduler might deserialize a meta object which may try and allocate a small amount of GPU memory. It's always recommended to have a small GPU available on the scheduler.
https://docs.rapids.ai/deployment/stable/guides/scheduler-gpu-requirements/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback!
While it makes sense to have a GPU on the scheduler to avoid these issues, I think it would be beneficial to allow some flexibility in the configuration. Some users might want different GPU configurations (e.g., a smaller/cheaper GPU on the scheduler vs. more powerful ones on workers), or in some cases might want to explicitly disable scheduler GPUs for cost reasons despite the potential pickle issues.
I've updated the PR to support both approaches:
ngpusandgpu_typeapply to both scheduler and workersscheduler_ngpus/scheduler_gpu_typeandworker_ngpus/worker_gpu_typefor fine-grained controlThe default behavior remains the same (same GPU config for both), but now users have the flexibility to choose different configurations when needed. I've also updated the documentation to mention the scheduler GPU requirements you referenced
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup totally agree with all of that!