diff --git a/CHANDELOG.md b/CHANDELOG.md index fd3a0d0..ab7d54d 100644 --- a/CHANDELOG.md +++ b/CHANDELOG.md @@ -13,11 +13,11 @@ Learn more in [Cluster Sharing docs](tutorial/kuberay.md/#cluster-sharing). ### Added - `KubeRayCluster.cluster_sharing` parameter that controls cluster sharing behavior. -- `dagster_ray.kuberay.sensors.cleanup_expired_kuberay_clusters` sensor that cleans up expired clusters (both shared and non-shared). Learn mode in [docs](api/kuberay.md#dagster_ray.kuberay.sensors.cleanup_expired_kuberay_clusters) -- `dagster-ray` entry now appears in the Dagster libraries list in the web UI +- `dagster_ray.kuberay.sensors.cleanup_expired_kuberay_clusters` sensor that cleans up expired clusters (both shared and non-shared). Learn more in [docs](api/kuberay.md#dagster_ray.kuberay.sensors.cleanup_expired_kuberay_clusters). +- `dagster-ray` entry now appears in the Dagster libraries list in the web UI. ### Changed -- [:bomb: breaking] - removed `cleanup_kuberay_clusters_op` and other associated definitions in favor of `dagster_ray.kuberay.sensors.cleanup_expired_kuberay_clusters` sensor that is more flexible +- [:bomb: breaking] - removed `cleanup_kuberay_clusters_op` and other associated definitions in favor of `dagster_ray.kuberay.sensors.cleanup_expired_kuberay_clusters` sensor that is more flexible. ## 0.3.1 @@ -25,33 +25,33 @@ Learn more in [Cluster Sharing docs](tutorial/kuberay.md/#cluster-sharing). - `failure_tolerance_timeout` configuration parameter for `KubeRayInteractiveJob` and `KubeRayCluster`. It can be set to a positive value to give the cluster some time to transition out of `failed` state (which can be transient in some scenarios) before raising an error. ### Fixes -- ensure both `.head.serviceIP` and `.head.serviceName` are set on the `RayCluster` while waiting for cluster readiness +- ensure both `.head.serviceIP` and `.head.serviceName` are set on the `RayCluster` while waiting for cluster readiness. ## 0.3.0 -This release includes massive docs improvements and drops support for Python 3.9 +This release includes massive docs improvements and drops support for Python 3.9. ### Changes -- [:bomb: breaking] dropped Python 3.9 support (EOL October 2025) -- [internal] most of the general, backend-agnostic code has been moved to `dagster_ray.core` (top-level imports still work) +- [:bomb: breaking] dropped Python 3.9 support (EOL October 2025). +- [internal] most of the general, backend-agnostic code has been moved to `dagster_ray.core` (top-level imports still work). ## 0.2.1 ### Fixes -- Fixed broken wheel on PyPI +- Fixed broken wheel on PyPI. ## 0.2.0 ### Changed - `KubeRayInteractiveJob.deletion_strategy` now defaults to `DeleteCluster` for both successful and failed executions. This is a reasonable default for the use case. - `KubeRayInteractiveJob.ttl_seconds_after_finished` now defaults to `600` seconds. -- `KubeRayCluster.lifecycle.cleanup` now defaults to `always` +- `KubeRayCluster.lifecycle.cleanup` now defaults to `always`. - [:bomb: breaking] `RayJob` and `RayCluster` clients and resources Kubernetes init parameters have been renamed to `kube_config` and `kube_context`. ### Added -- `enable_legacy_debugger` configuration parameter to subclasses of `RayResource` +- `enable_legacy_debugger` configuration parameter to subclasses of `RayResource`. - `on_exception` option for `lifecycle.cleanup` policy. It's triggered during resource setup/cleanup (including `KeyboardInterrupt`), but not by user `@op`/`@asset` code. - `KubeRayInteractiveJob` now respects `lifecycle.cleanup`. It defaults to `on_exception`. Users are advised to rely on built-in `RayJob` cleanup mechanisms, such as `ttlSecondsAfterFinished` and `deletionStrategy`. @@ -64,7 +64,7 @@ This release includes massive docs improvements and drops support for Python 3.9 - [:bomb: breaking] `RayResource`: top-level `skip_init` and `skip_setup` configuration parameters have been removed. The `lifecycle` field is the new way of configuring steps performed during resource initialization. `KubeRayCluster`'s `skip_cleanup` has been moved to `lifecycle` as well. - [:bomb: breaking] injected `dagster.io/run_id` Kubernetes label has been renamed to `dagster/run-id`. Keys starting with `dagster.io/` have been converted to `dagster/` to match how `dagster-k8s` does it. - [:bomb: breaking] `dagster_ray.kuberay` Configurations have been unified with KubeRay APIs. -- `dagster-ray` now populates Kubernetes labels with more values (including some useful Dagster Cloud values such as `git-sha`) +- `dagster-ray` now populates Kubernetes labels with more values (including some useful Dagster Cloud values such as `git-sha`). ### Added - `KubeRayInteractiveJob` -- a resource that utililizes the new `InteractiveMode` for `RayJob`. It can be used to connect to Ray in Client mode -- like `KubeRayCluster` -- but gives access to `RayJob` features, such as automatic cleanup (`ttlSecondsAfterFinished`), retries (`backoffLimit`) and timeouts (`activeDeadlineSeconds`). diff --git a/docs/api/kuberay.md b/docs/api/kuberay.md index 6dbd8ae..44b0a63 100644 --- a/docs/api/kuberay.md +++ b/docs/api/kuberay.md @@ -132,7 +132,7 @@ These resources initialize Ray client connection with a remote cluster. A Dagster sensor that monitors shared `RayCluster` resources created by the current Dagster [code location](https://docs.dagster.io/deployment/code-locations/managing-code-locations-with-definitions) (with a `dagster/code-location=` label selector) and submits jobs to delete clusters either: - use [Cluster Sharing](../tutorial/kuberay.md#cluster-sharing) (`dagster/cluster-sharing=true`) and have expired -- are older than `DAGSTER_RAY_CLUSTER_EXPIRATION_SECONDS` (defaults to 24 hours) +- are older than `DAGSTER_RAY_CLUSTER_EXPIRATION_SECONDS` (defaults to 4 hours) By default it monitors the `ray` namespace. This can be configured by setting `DAGSTER_RAY_NAMESPACES` (accepts a comma-separated list of namespaces). diff --git a/docs/tutorial/kuberay.md b/docs/tutorial/kuberay.md index 874d6dc..9af1d63 100644 --- a/docs/tutorial/kuberay.md +++ b/docs/tutorial/kuberay.md @@ -144,7 +144,7 @@ from dagster_ray.kuberay.configs import RayClusterConfig, ClusterSharing ray_cluster = KubeRayCluster( ray_cluster=RayClusterConfig( - cluster_sharing=ClusterSharing(enabled=True, ttl_seconds=3600) + cluster_sharing=ClusterSharing(enabled=True, ttl_seconds=30 * 60) ) ) ``` diff --git a/src/dagster_ray/configs.py b/src/dagster_ray/configs.py index c5c7b6b..a1366a6 100644 --- a/src/dagster_ray/configs.py +++ b/src/dagster_ray/configs.py @@ -10,7 +10,7 @@ DAGSTER_RAY_NAMESPACES_ENV_VAR = "DAGSTER_RAY_NAMESPACES" DAGSTER_RAY_NAMESPACES_DEFAULT_VALUE = "ray" DAGSTER_RAY_CLUSTER_EXPIRATION_SECONDS_ENV_VAR = "DAGSTER_RAY_CLUSTER_EXPIRATION_SECONDS" -DAGSTER_RAY_CLUSTER_EXPIRATION_SECONDS_DEFAULT_VALUE = str(24 * 60 * 60) +DAGSTER_RAY_CLUSTER_EXPIRATION_SECONDS_DEFAULT_VALUE = str(4 * 60 * 60) class Lifecycle(dg.Config): diff --git a/src/dagster_ray/kuberay/configs.py b/src/dagster_ray/kuberay/configs.py index 6d59f63..5f454bc 100644 --- a/src/dagster_ray/kuberay/configs.py +++ b/src/dagster_ray/kuberay/configs.py @@ -309,9 +309,6 @@ class MatchDagsterLabels(dg.Config): ) -DEFAULT_CLUSTER_SHARING_TTL_SECONDS = 60 * 60.0 - - class ClusterSharing(dg.Config): """Defines the strategy for sharing `RayCluster` resources with other Dagster steps. @@ -325,6 +322,5 @@ class ClusterSharing(dg.Config): default=None, description="Additional user-provided labels to match on." ) ttl_seconds: float = Field( - default=DEFAULT_CLUSTER_SHARING_TTL_SECONDS, description="Time to live for the lock placed on the `RayCluster` resource, marking it as in use by the current Dagster step.", ) diff --git a/src/dagster_ray/kuberay/resources/raycluster.py b/src/dagster_ray/kuberay/resources/raycluster.py index 5898d23..3ccfb69 100644 --- a/src/dagster_ray/kuberay/resources/raycluster.py +++ b/src/dagster_ray/kuberay/resources/raycluster.py @@ -45,7 +45,7 @@ class KubeRayCluster(BaseKubeRayResource): ) cluster_sharing: ClusterSharing = Field( - default_factory=ClusterSharing, + default_factory=lambda: ClusterSharing(enabled=False, ttl_seconds=15 * 60), description="Configuration for sharing the `RayCluster` across Dagster steps. Existing clusters matching this configuration will be reused without recreating them. A `dagster/sharing=true` label will be applied to the `RayCluster`, and a `dagster/lock--=` annotation will be placed on the `RayCluster` to mark it as being used by this step. Cleanup will only proceed if the `RayCluster` is not being used by any other steps, therefore cluster sharing should be used in conjunction with [dagster_ray.kuberay.sensors.cleanup_expired_kuberay_clusters][] sensor.", ) diff --git a/src/dagster_ray/kuberay/sensors.py b/src/dagster_ray/kuberay/sensors.py index 139ac4f..b32b2e6 100644 --- a/src/dagster_ray/kuberay/sensors.py +++ b/src/dagster_ray/kuberay/sensors.py @@ -23,7 +23,7 @@ def cleanup_expired_kuberay_clusters( ) -> Generator[dg.RunRequest | dg.SkipReason, None, None]: f"""A Dagster sensor that monitors shared `RayCluster` resources created by the current code location and submits jobs to delete clusters that either: - use [Cluster Sharing](../tutorial/#cluster-sharing) (`dagster/cluster-sharing=true`) and have expired - - are older than `DAGSTER_RAY_CLUSTER_EXPIRATION_SECONDS` (defaults to 24 hours) + - are older than `DAGSTER_RAY_CLUSTER_EXPIRATION_SECONDS` (defaults to 4 hours) By default it monitors the `ray` namespace. This can be configured by setting `{DAGSTER_RAY_NAMESPACES_ENV_VAR}` (accepts a comma-separated list of namespaces).""" assert context.code_location_origin is not None diff --git a/tests/kuberay/test_raycluster.py b/tests/kuberay/test_raycluster.py index ccda169..dfa45b0 100644 --- a/tests/kuberay/test_raycluster.py +++ b/tests/kuberay/test_raycluster.py @@ -483,7 +483,7 @@ def test_cluster_sharing( spec=RayClusterSpec(head_group_spec=head_group_spec, worker_group_specs=worker_group_specs), ), redis_port=get_random_free_port(), - cluster_sharing=ClusterSharing(enabled=True), + cluster_sharing=ClusterSharing(enabled=True, ttl_seconds=10 * 60), ) @dg.asset