Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 2 additions & 13 deletions CHANDELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,20 +7,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## Unreleased

This release introduces a new feature that is very useful in dev and staging environments: cluster sharing. Cluster sharing allows reusing existing `RayCluster` resources created by previous Dagster steps. This feature enables faster iteration speed and reduces costs. It's recommended for use in dev/staging environments. It's available with `KubeRayCluster`, which is now recommended over `KubeRayInteractiveJob` for non-production environments due to increased iteration speed. By default it selects existing Ray clusters based on the following labels:
This release introduces a new feature that is very useful in dev environments: **Cluster Sharing**. Cluster sharing allows reusing existing `RayCluster` resources created by previous Dagster steps. It's implemented for `KubeRayCluster` Dagster resource. This feature enables faster iteration speed and reduced infrastructure costs (at the expense of job isolation). Therefore `KubeRayCluster` is now recommended over `KubeRayInteractiveJob` for use in **dev** environments.

- `dagster/code-location`
- `dagster/git-sha`
- `dagster/resource-key`

This feature is opt-in and can be enabled with:


```py
from dagster_ray.kuberay import KubeRayCluster, ClusterSharing

ray_cluster = KubeRayCluster(cluster_sharing=ClusterSharing(enabled=True))
```
Learn more in [Cluster Sharing docs](tutorial/kuberay.md/#cluster-sharing).

### Added
- `KubeRayCluster.cluster_sharing` parameter that controls cluster sharing behavior.
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ Learn more in the [docs](https://danielgafni.github.io/dagster-ray)
- **📡 Dagster Pipes Integration**: Submit external scripts as Ray jobs, stream back logs and rich Dagster metadata
- **☸️ KubeRay Support**: Utilize `RayJob` and `RayCluster` custom resources in client or job submission mode ([tutorial](tutorial/kuberay.md))
- **🏭 Production Ready**: Tested against a matrix of core dependencies, integrated with Dagster+
- **⚡ Instant Startup**: Leverage `RayCluster` with cluster sharing for lightning-fast development cycles with zero cold start times

## Installation

Expand Down
25 changes: 25 additions & 0 deletions docs/api/kuberay.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ These resources initialize Ray client connection with a remote cluster.
options:
members:
- "__init__"
- "cluster_sharing"
- "lifecycle"
- "ray_cluster"
- "client"
Expand Down Expand Up @@ -90,6 +91,16 @@ These resources initialize Ray client connection with a remote cluster.
inherited_members: true
members: true

::: dagster_ray.kuberay.configs.MatchDagsterLabels
options:
inherited_members: true
members: true

::: dagster_ray.kuberay.configs.ClusterSharing
options:
inherited_members: true
members: true

--

::: dagster_ray.kuberay.resources.base.BaseKubeRayResourceConfig
Expand All @@ -115,6 +126,20 @@ These resources initialize Ray client connection with a remote cluster.

---

## Sensors

::: dagster_ray.kuberay.sensors.cleanup_expired_kuberay_clusters

A Dagster sensor that monitors shared `RayCluster` resources created by the current code location and submits jobs to delete clusters that have expired.

Selects clusters based on the following labels:
- `dagster/cluster-sharing=true`
- `dagster/code-location=<current-code-location>`

By default it monitors the `ray` namespace. This can be configured by setting `DAGSTER_RAY_NAMESPACES` (accepts a comma-separated list of namespaces).

---

## Kubernetes API Clients

::: dagster_ray.kuberay.client.RayClusterClient
Expand Down
4 changes: 3 additions & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
- **📡 Dagster Pipes Integration**: Submit external scripts as Ray jobs, stream back logs and rich Dagster metadata
- **☸️ KubeRay Support**: Utilize `RayJob` and `RayCluster` custom resources in client or job submission mode ([tutorial](tutorial/kuberay.md))
- **🏭 Production Ready**: Tested against a matrix of core dependencies, integrated with Dagster+
- **⚡ Instant Startup**: Leverage `RayCluster` with cluster sharing for lightning-fast development cycles with zero cold start times

## ⚡ Quick Start

Expand Down Expand Up @@ -142,7 +143,8 @@ Learn more by reading the [tutorials](tutorial/index.md).

`dagster-ray` supports running Ray on Kubernetes with [KubeRay](https://docs.ray.io/en/latest/cluster/kubernetes/index.html).

- Use [`KubeRayInteractiveJob`](api/kuberay.md#dagster_ray.kuberay.KubeRayInteractiveJob) to create a `RayJob` and connect in client mode
- Use [`KubeRayInteractiveJob`](api/kuberay.md#dagster_ray.kuberay.KubeRayInteractiveJob) to create a `RayJob` and connect in client mode (recommended for production)
- Use [`KubeRayCluster`](api/kuberay.md#dagster_ray.kuberay.KubeRayCluster) with cluster sharing enabled to create a new `RayCluster` or immediately connect to an existing one (recommended for dev environments)
- Use [`PipesKubeRayJobClient`](api/kuberay.md#dagster_ray.kuberay.PipesKubeRayJobClient) to submit external scripts as `RayJob`

!!! tip
Expand Down
49 changes: 39 additions & 10 deletions docs/tutorial/kuberay.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,25 +124,54 @@ ray_cluster = KubeRayInteractiveJob(
)
```

### KubeRayCluster Alternative
## KubeRayCluster

While [`KubeRayInteractiveJob`](../api/kuberay.md#dagster_ray.kuberay.KubeRayInteractiveJob) is recommended for most use cases, you can also use [`KubeRayCluster`](../api/kuberay.md#dagster_ray.kuberay.KubeRayCluster):
While [`KubeRayInteractiveJob`](../api/kuberay.md#dagster_ray.kuberay.KubeRayInteractiveJob) is recommended for production environments, [`KubeRayCluster`](../api/kuberay.md#dagster_ray.kuberay.KubeRayCluster) might be better alternative for dev environments.

Unlike `KubeRayInteractiveJob`, which can outsource garbage collection to the KubeRay controller, `KubeRayCluster` is entirely responsible for cluster management. This is bad for production environments (may result in dangling `RayCluster` instances if the Dagster step pod fails unexpectedly), but good for dev environments, because it allows `dagster-ray` to implement **cluster sharing**.

### Cluster Sharing

With cluster sharing, `dagster-ray` can reuse existing `RayCluster` instances left from previous Dagster steps, making `KubeRayCluster` startup immediate.

Therefore, `KubeRayCluster` is a good choice for dev environments as it can speed up iteration cycles and reduce infrastructure costs at the cost of lower job isolation/stability.

Cluster sharing has to be enabled explicitly.

```python
from dagster_ray.kuberay import KubeRayCluster
from dagster_ray.kuberay.configs import RayClusterConfig
from dagster_ray.kuberay.configs import RayClusterConfig, ClusterSharing

definitions = dg.Definitions(
assets=[compute_sum_of_squares],
resources={"ray_cluster": KubeRayCluster(ray_cluster=RayClusterConfig(...))},
ray_cluster = KubeRayCluster(
ray_cluster=RayClusterConfig(
cluster_sharing=ClusterSharing(enabled=True, ttl_seconds=3600)
)
)
```

!!! note
`KubeRayCluster` is generally a weaker alternative to `KubeRayInteractiveJob` because:
When enabled, `dagster-ray` will use configured user-provided and dagster-generated labels to select appropriate clusters from all the existing ones. By default `dagster-ray` will match on the following labels:

- `dagster/cluster-sharing`
- `dagster/code-location`
- `dagster/git-sha`
- `dagster/resource-key`

Configuration options for cluster sharing can be found [here](../api/kuberay.md#dagster_ray.kuberay.KubeRayCluster.cluster_sharing).

### Shared Clusters Garbage Collection

If a `RayCluster` has active locks from other Dagster steps, the `KubeRayCluster` resource that manages it won't clean it up, so such clusters will remain after the Dagster step completes. `dagster-ray` provides a sensor to perform garbage collection on these clusters once the locks on them finally expire:

```py
import dagster as dg
from dagster_ray.kuberay import cleanup_expired_rayclusters

defs = dg.Definitions(
sensors=[cleanup_expired_rayclusters],
)
```

- It creates persistent clusters that may not get cleaned up properly, for example if something happens to the Dagster pod (responsible for the cleanup)
- It lacks `RayJob`'s features such as timeouts and existing cluster selection
The sensor monitors clusters with `dagster/cluster-sharing=true` and matching `dagster/code-location` labels. It's recommended to use this sensor when enabling cluster sharing.

## PipesKubeRayJobClient

Expand Down