Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 11 additions & 11 deletions CHANDELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,45 +13,45 @@ Learn more in [Cluster Sharing docs](tutorial/kuberay.md/#cluster-sharing).

### Added
- `KubeRayCluster.cluster_sharing` parameter that controls cluster sharing behavior.
- `dagster_ray.kuberay.sensors.cleanup_expired_kuberay_clusters` sensor that cleans up expired clusters (both shared and non-shared). Learn mode in [docs](api/kuberay.md#dagster_ray.kuberay.sensors.cleanup_expired_kuberay_clusters)
- `dagster-ray` entry now appears in the Dagster libraries list in the web UI
- `dagster_ray.kuberay.sensors.cleanup_expired_kuberay_clusters` sensor that cleans up expired clusters (both shared and non-shared). Learn more in [docs](api/kuberay.md#dagster_ray.kuberay.sensors.cleanup_expired_kuberay_clusters).
- `dagster-ray` entry now appears in the Dagster libraries list in the web UI.

### Changed
- [:bomb: breaking] - removed `cleanup_kuberay_clusters_op` and other associated definitions in favor of `dagster_ray.kuberay.sensors.cleanup_expired_kuberay_clusters` sensor that is more flexible
- [:bomb: breaking] - removed `cleanup_kuberay_clusters_op` and other associated definitions in favor of `dagster_ray.kuberay.sensors.cleanup_expired_kuberay_clusters` sensor that is more flexible.

## 0.3.1

### Added
- `failure_tolerance_timeout` configuration parameter for `KubeRayInteractiveJob` and `KubeRayCluster`. It can be set to a positive value to give the cluster some time to transition out of `failed` state (which can be transient in some scenarios) before raising an error.

### Fixes
- ensure both `.head.serviceIP` and `.head.serviceName` are set on the `RayCluster` while waiting for cluster readiness
- ensure both `.head.serviceIP` and `.head.serviceName` are set on the `RayCluster` while waiting for cluster readiness.

## 0.3.0

This release includes massive docs improvements and drops support for Python 3.9
This release includes massive docs improvements and drops support for Python 3.9.

### Changes

- [:bomb: breaking] dropped Python 3.9 support (EOL October 2025)
- [internal] most of the general, backend-agnostic code has been moved to `dagster_ray.core` (top-level imports still work)
- [:bomb: breaking] dropped Python 3.9 support (EOL October 2025).
- [internal] most of the general, backend-agnostic code has been moved to `dagster_ray.core` (top-level imports still work).

## 0.2.1

### Fixes

- Fixed broken wheel on PyPI
- Fixed broken wheel on PyPI.

## 0.2.0

### Changed
- `KubeRayInteractiveJob.deletion_strategy` now defaults to `DeleteCluster` for both successful and failed executions. This is a reasonable default for the use case.
- `KubeRayInteractiveJob.ttl_seconds_after_finished` now defaults to `600` seconds.
- `KubeRayCluster.lifecycle.cleanup` now defaults to `always`
- `KubeRayCluster.lifecycle.cleanup` now defaults to `always`.
- [:bomb: breaking] `RayJob` and `RayCluster` clients and resources Kubernetes init parameters have been renamed to `kube_config` and `kube_context`.

### Added
- `enable_legacy_debugger` configuration parameter to subclasses of `RayResource`
- `enable_legacy_debugger` configuration parameter to subclasses of `RayResource`.
- `on_exception` option for `lifecycle.cleanup` policy. It's triggered during resource setup/cleanup (including `KeyboardInterrupt`), but not by user `@op`/`@asset` code.
- `KubeRayInteractiveJob` now respects `lifecycle.cleanup`. It defaults to `on_exception`. Users are advised to rely on built-in `RayJob` cleanup mechanisms, such as `ttlSecondsAfterFinished` and `deletionStrategy`.

Expand All @@ -64,7 +64,7 @@ This release includes massive docs improvements and drops support for Python 3.9
- [:bomb: breaking] `RayResource`: top-level `skip_init` and `skip_setup` configuration parameters have been removed. The `lifecycle` field is the new way of configuring steps performed during resource initialization. `KubeRayCluster`'s `skip_cleanup` has been moved to `lifecycle` as well.
- [:bomb: breaking] injected `dagster.io/run_id` Kubernetes label has been renamed to `dagster/run-id`. Keys starting with `dagster.io/` have been converted to `dagster/` to match how `dagster-k8s` does it.
- [:bomb: breaking] `dagster_ray.kuberay` Configurations have been unified with KubeRay APIs.
- `dagster-ray` now populates Kubernetes labels with more values (including some useful Dagster Cloud values such as `git-sha`)
- `dagster-ray` now populates Kubernetes labels with more values (including some useful Dagster Cloud values such as `git-sha`).

### Added
- `KubeRayInteractiveJob` -- a resource that utililizes the new `InteractiveMode` for `RayJob`. It can be used to connect to Ray in Client mode -- like `KubeRayCluster` -- but gives access to `RayJob` features, such as automatic cleanup (`ttlSecondsAfterFinished`), retries (`backoffLimit`) and timeouts (`activeDeadlineSeconds`).
Expand Down
2 changes: 1 addition & 1 deletion docs/api/kuberay.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ These resources initialize Ray client connection with a remote cluster.

A Dagster sensor that monitors shared `RayCluster` resources created by the current Dagster [code location](https://docs.dagster.io/deployment/code-locations/managing-code-locations-with-definitions) (with a `dagster/code-location=<current-code-location>` label selector) and submits jobs to delete clusters either:
- use [Cluster Sharing](../tutorial/kuberay.md#cluster-sharing) (`dagster/cluster-sharing=true`) and have expired
- are older than `DAGSTER_RAY_CLUSTER_EXPIRATION_SECONDS` (defaults to 24 hours)
- are older than `DAGSTER_RAY_CLUSTER_EXPIRATION_SECONDS` (defaults to 4 hours)

By default it monitors the `ray` namespace. This can be configured by setting `DAGSTER_RAY_NAMESPACES` (accepts a comma-separated list of namespaces).

Expand Down
2 changes: 1 addition & 1 deletion docs/tutorial/kuberay.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@ from dagster_ray.kuberay.configs import RayClusterConfig, ClusterSharing

ray_cluster = KubeRayCluster(
ray_cluster=RayClusterConfig(
cluster_sharing=ClusterSharing(enabled=True, ttl_seconds=3600)
cluster_sharing=ClusterSharing(enabled=True, ttl_seconds=30 * 60)
)
)
```
Expand Down
2 changes: 1 addition & 1 deletion src/dagster_ray/configs.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
DAGSTER_RAY_NAMESPACES_ENV_VAR = "DAGSTER_RAY_NAMESPACES"
DAGSTER_RAY_NAMESPACES_DEFAULT_VALUE = "ray"
DAGSTER_RAY_CLUSTER_EXPIRATION_SECONDS_ENV_VAR = "DAGSTER_RAY_CLUSTER_EXPIRATION_SECONDS"
DAGSTER_RAY_CLUSTER_EXPIRATION_SECONDS_DEFAULT_VALUE = str(24 * 60 * 60)
DAGSTER_RAY_CLUSTER_EXPIRATION_SECONDS_DEFAULT_VALUE = str(4 * 60 * 60)


class Lifecycle(dg.Config):
Expand Down
4 changes: 0 additions & 4 deletions src/dagster_ray/kuberay/configs.py
Original file line number Diff line number Diff line change
Expand Up @@ -309,9 +309,6 @@ class MatchDagsterLabels(dg.Config):
)


DEFAULT_CLUSTER_SHARING_TTL_SECONDS = 60 * 60.0


class ClusterSharing(dg.Config):
"""Defines the strategy for sharing `RayCluster` resources with other Dagster steps.

Expand All @@ -325,6 +322,5 @@ class ClusterSharing(dg.Config):
default=None, description="Additional user-provided labels to match on."
)
ttl_seconds: float = Field(
default=DEFAULT_CLUSTER_SHARING_TTL_SECONDS,
description="Time to live for the lock placed on the `RayCluster` resource, marking it as in use by the current Dagster step.",
)
2 changes: 1 addition & 1 deletion src/dagster_ray/kuberay/resources/raycluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ class KubeRayCluster(BaseKubeRayResource):
)

cluster_sharing: ClusterSharing = Field(
default_factory=ClusterSharing,
default_factory=lambda: ClusterSharing(enabled=False, ttl_seconds=15 * 60),
description="Configuration for sharing the `RayCluster` across Dagster steps. Existing clusters matching this configuration will be reused without recreating them. A `dagster/sharing=true` label will be applied to the `RayCluster`, and a `dagster/lock-<run-id>-<step-id>=<lock>` annotation will be placed on the `RayCluster` to mark it as being used by this step. Cleanup will only proceed if the `RayCluster` is not being used by any other steps, therefore cluster sharing should be used in conjunction with [dagster_ray.kuberay.sensors.cleanup_expired_kuberay_clusters][] sensor.",
)

Expand Down
2 changes: 1 addition & 1 deletion src/dagster_ray/kuberay/sensors.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ def cleanup_expired_kuberay_clusters(
) -> Generator[dg.RunRequest | dg.SkipReason, None, None]:
f"""A Dagster sensor that monitors shared `RayCluster` resources created by the current code location and submits jobs to delete clusters that either:
- use [Cluster Sharing](../tutorial/#cluster-sharing) (`dagster/cluster-sharing=true`) and have expired
- are older than `DAGSTER_RAY_CLUSTER_EXPIRATION_SECONDS` (defaults to 24 hours)
- are older than `DAGSTER_RAY_CLUSTER_EXPIRATION_SECONDS` (defaults to 4 hours)

By default it monitors the `ray` namespace. This can be configured by setting `{DAGSTER_RAY_NAMESPACES_ENV_VAR}` (accepts a comma-separated list of namespaces)."""
assert context.code_location_origin is not None
Expand Down
2 changes: 1 addition & 1 deletion tests/kuberay/test_raycluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -483,7 +483,7 @@ def test_cluster_sharing(
spec=RayClusterSpec(head_group_spec=head_group_spec, worker_group_specs=worker_group_specs),
),
redis_port=get_random_free_port(),
cluster_sharing=ClusterSharing(enabled=True),
cluster_sharing=ClusterSharing(enabled=True, ttl_seconds=10 * 60),
)

@dg.asset
Expand Down
Loading