Fix issue 390 support different machine types on gcp#451
Fix issue 390 support different machine types on gcp#451jacobtomlinson merged 7 commits intodask:mainfrom gmiasnychenko:fix-issue-390-support-different-machine-types-on-gcp
Conversation
dask_cloudprovider/gcp/instances.py
Outdated
| if ngpus is not None: | ||
| self.scheduler_options["ngpus"] = 0 | ||
| self.scheduler_options["gpu_type"] = None | ||
| self.scheduler_options["gpu_instance"] = False |
There was a problem hiding this comment.
@gmiasnychenko should we set scheduler gpus settings always, whatever the number of GPUs?
Also, please leave a comment that we don't run tasks on scheduler, so we don't need a GPU there.
There was a problem hiding this comment.
As for setting GPUs settings, I believe the answer is yes. All the settings are going into the self.options, which is the base for later self.scheduler_options and self.worker_options. If we don't override the scheduler GPU settings, they will stay from up above, and we will have the same configuration for both scheduler and worker.
I can move the overriding outside the if statement, if that's what you mean. It provides more clarity, but functionally should be the same
I agree with providing more documentation. I will add it for the ngpus and gpu_type argument descriptions.
| ) | ||
| self.machine_type = machine_type or self.config.get("machine_type") | ||
| self.gpu_instance = "gpu" in self.machine_type or bool(ngpus) | ||
| if machine_type is None: |
There was a problem hiding this comment.
@gmiasnychenko it would be great if we could check that machine_type is set XOR scheduler/worker_machine_type; otherwise, we should throw an error. It should be a BC safe check.
- added info on GPU logic in docs - adjusted scheduler GPU logic - fixed the machine type checker
jacobtomlinson
left a comment
There was a problem hiding this comment.
Thanks for this, seems like a great improvement!
dask_cloudprovider/gcp/instances.py
Outdated
| Extra commands to be run during the bootstrap phase. | ||
| ngpus: int (optional) | ||
| The number of GPUs to atatch to the instance. | ||
| The number of GPUs to atatch to the worker instance. No work is expected to be done on scheduler, so no |
There was a problem hiding this comment.
This isn't true. Due to the way that Dask uses pickle to move things around there are cases where the scheduler might deserialize a meta object which may try and allocate a small amount of GPU memory. It's always recommended to have a small GPU available on the scheduler.
https://docs.rapids.ai/deployment/stable/guides/scheduler-gpu-requirements/
There was a problem hiding this comment.
Thank you for the feedback!
While it makes sense to have a GPU on the scheduler to avoid these issues, I think it would be beneficial to allow some flexibility in the configuration. Some users might want different GPU configurations (e.g., a smaller/cheaper GPU on the scheduler vs. more powerful ones on workers), or in some cases might want to explicitly disable scheduler GPUs for cost reasons despite the potential pickle issues.
I've updated the PR to support both approaches:
- Unified configuration (existing behavior):
ngpusandgpu_typeapply to both scheduler and workers - Separate configuration (new):
scheduler_ngpus/scheduler_gpu_typeandworker_ngpus/worker_gpu_typefor fine-grained control
The default behavior remains the same (same GPU config for both), but now users have the flexibility to choose different configurations when needed. I've also updated the documentation to mention the scheduler GPU requirements you referenced
There was a problem hiding this comment.
Yup totally agree with all of that!
- Maintain existing GPU logic - Add ability to specify different GPU configurations for workers and scheduler
- Updating the documentation to get rid of old GPUs
As per #390, there is a feature request for allowing to choose different machine types on GCP. Here I tried to implement that, and make only the workers to use GPU.
I used #369 as a reference