Skip to content

Conversation

@danielgafni
Copy link
Owner

@danielgafni danielgafni commented Oct 8, 2025

This PR implements cluster sharing for KubeRayCluster​ Dagster resource. Cluster sharing allows reusing the same RayCluster created by one of a previously executed Dagster steps across subsequent Dagster steps. It can dramatically speed up step setup, making if effectively instant in the presence of existing clusters.

Because KubeRay doesn't currently provide a TTL mechanism for RayCluster​, this PR does 2 things:

  1. Shared cluster discovery: KubeRayCluster.cluster_sharing​ config is responsible for matching on existing RayCluster​ based on system (generated by Dagster and dagster-ray) and user-provided Kubernetes labels
  2. A custom TTL mechanism based on Kubernetes annotations and a Dagster sensor:
    1. KubeRayCluster​ places dagster/lock-<run-id>-<step-key>​ annotations on RayCluster​ resources targeted by the current Dagster step. The annotation value is a serialized ClusterSharingLock​ object with creation time and ttl set.
    2. dagster_ray.kuberay.cleanup_expired_rayclusters​ sensor monitors RayClsuter​ resources and submits run requests for expired clusters using dagster_ray.kuberay.delete_kuberay_rayclusters​ job.

Resolve #131

Copy link
Owner Author

danielgafni commented Oct 8, 2025

This stack of pull requests is managed by Graphite. Learn more about stacking.

@danielgafni danielgafni force-pushed the 10-08-_sparkles_implement_cluster_sharing_for_kuberay_s_raycluster branch 7 times, most recently from 7e6224e to 2ccfb8d Compare October 8, 2025 21:05
@danielgafni danielgafni changed the title ✨ implement cluster sharing for KubeRay's RayCluster ✨ implement cluster sharing for KubeRayCluster Oct 8, 2025
@danielgafni danielgafni force-pushed the 10-08-_sparkles_implement_cluster_sharing_for_kuberay_s_raycluster branch 4 times, most recently from 2b6fffb to 64fe3a4 Compare October 9, 2025 07:19
@danielgafni danielgafni force-pushed the 10-08-_sparkles_implement_cluster_sharing_for_kuberay_s_raycluster branch 8 times, most recently from 4765c04 to a2e2f9b Compare October 9, 2025 12:48
@danielgafni danielgafni force-pushed the 10-08-_sparkles_implement_cluster_sharing_for_kuberay_s_raycluster branch from a2e2f9b to 5872d8f Compare October 9, 2025 13:04
@danielgafni danielgafni force-pushed the 10-08-_sparkles_implement_cluster_sharing_for_kuberay_s_raycluster branch from 5872d8f to d504e74 Compare October 9, 2025 13:32
@danielgafni danielgafni merged commit 41769a1 into master Oct 9, 2025
33 checks passed
Copy link
Owner Author

Merge activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Run RayCluster cleanup with a sensor

2 participants