You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/geneva/jobs/contexts.mdx
+87-78Lines changed: 87 additions & 78 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,15 +7,17 @@ icon: circle-nodes
7
7
8
8
Geneva automatically packages and deploys your Python execution environment to its worker nodes. This ensures that distributed execution occurs in the same environment and dependencies as your prototype.
9
9
10
-
We currently support one processing backend: **Ray**. Geneva jobs can be deployed on a Kubernetes cluster on demand or on an existing Ray cluster.
10
+
We currently support one processing backend: **Ray**. There are 3 ways to connect to a Ray cluster:
11
11
12
-
<Cardicon="lightbulb">
13
-
If you are using a remote Ray cluster, you will need to have the notebook or script that code is packaged on running the same CPU architecture / OS. By default, Ray clusters are run in Linux. If you host a Jupyter service on a Mac, Geneva will attempt to deploy Mac shared libraries to a Linux cluster and result in "Module not found" errors. You can instead use a hosted Jupyter notebook, or host your Jupyter or Python environment on a Linux VM or container.
14
-
</Card>
12
+
1. Local Ray
13
+
2. KubeRay: create a clusteron demand in your Kubernetes cluster.
14
+
3. Existing Ray Cluster
15
15
16
-
## Ray Auto Connect
16
+
## Ray Clusters
17
17
18
-
To execute jobs without an external Ray cluster, you can just trigger the `Table.backfill` method. This will auto-create a local Ray cluster and is only suitable for prototyping on small datasets. Simply define the UDF, add a column, and trigger the job:
18
+
### Local Ray
19
+
20
+
To execute jobs without an external Ray cluster, you can just trigger the `Table.backfill` method. This will auto-create a Ray cluster on your machine. Because it's on your laptop/desktop, this is only suitable for prototyping on small datasets. But it is the easiest way to get started. Simply define the UDF, add a column, and trigger the job:
Geneva can execute jobs against an existing Ray cluster. You can define a RayCluster by specifying the address of the cluster and packages needed on your workers.
34
-
35
-
This approach makes it easy to tailor resource requirements to your particular UDFs.
36
-
37
-
You can then wrap your table backfill call with the RayCluster context.
38
-
39
-
<CodeGroup>
40
-
```python Python icon="python"
41
-
from geneva.runners.ray.raycluster import _HeadGroupSpec, _WorkerGroupSpec
42
-
from geneva.runners.ray._mgr import ray_cluster
43
-
44
-
geneva.connect(my_db_uri)
45
-
46
-
with ray_cluster(
47
-
addr="ray-head:10001"# replace ray head with the address of your ray head node
48
-
skip_site_packages=False, # optionally skip shipping python site packages if already in image
49
-
use_portforwarding=False, # Must be False when byo ray cluster
50
-
pip=[], # list of pip package or urls to install on each image.
51
-
):
52
-
53
-
tbl.backfill("xy_product")
54
-
```
55
-
</CodeGroup>
56
-
57
-
> **Note**: If your Ray cluster is managed by KubeRay, you'll need to setup kubectl port forwarding setup so Geneva can connect.
58
-
59
-
For more interactive usage, you can use this pattern:
60
-
61
-
<CodeGroup>
62
-
```python Python icon="python"
63
-
# this is a k8s pod spec.
64
-
raycluster = ray_cluster(...)
65
-
raycluster.__enter__() # equivalent of ray.init()
66
-
67
-
# trigger the backfill on column "filename_len"
68
-
tbl.backfill("filename_len")
69
-
70
-
raycluster.__exit__(None, None, None)
71
-
```
72
-
</CodeGroup>
73
-
74
-
## Ray on Kubernetes
75
-
76
-
Geneva uses KubeRay to deploy Ray on Kubernetes. To do so, you need to create an execution context, which needs two things:
77
-
78
-
- a Geneva Cluster. This includes the name, resource demands, Docker images, and other Ray configurations
79
-
- a Geneva Manifest. This includes the python packages and any local files that each worker will need
80
-
81
-
These Clusters and Manifests can be persisted and shared between different users.
33
+
Geneva will package up your local environment and send it to each worker node, so they'll have access to all the same dependencies as if you ran a simple Python script yourself.
82
34
83
-
### Define a Cluster
35
+
### KubeRay
84
36
85
-
A Geneva Cluster represents the resource needs, Docker images, and other Ray configurations necessary to run your job. Make sure that the resources requested will fit inside the kuberay cluster you're connecting to.
37
+
If you have a Kubernetes cluster with kuberay-operator, you can use Geneva to automatically provision RayClusters. To do so, define a Geneva cluster, representing the resource needs, Docker images, and other Ray configurations necessary to run your job. Make sure your cluster has adequate compute resources to provision the RayCluster. Here is an example Geneva cluster definition:
86
38
87
39
<CodeGroup>
88
40
```python Python icon="python"
89
41
import sys
90
42
import ray
43
+
import geneva
91
44
from geneva.cluster.builder import GenevaClusterBuilder
45
+
from geneva.cluster import K8sConfigMethod
92
46
from geneva.runners.ray.raycluster import get_ray_image
cluster_name ="my-geneva-cluster"# lowercase, numbers, hyphens only
98
53
service_account ="my_k8s_service_account"# k8s service account bound geneva runs as
54
+
k8s_namespace ="geneva"# k8s namespace
99
55
100
-
cluster = GenevaClusterBuilder()
56
+
cluster = (
57
+
GenevaClusterBuilder()
101
58
.name(cluster_name)
102
59
.namespace(k8s_namespace)
103
60
.portforwarding(True) # required for kuberay to expose ray ports
104
61
.aws_config(region="us-east-1") # only required if using AWS
105
-
.k8s_config_method(K8sConfigMethod.LOCAL) # Load k8s config from `~/.kube.config`
62
+
.config_method(K8sConfigMethod.LOCAL) # Load k8s config from `~/.kube.config`
106
63
# (other options include EKS_AUTH to load from AWS EKS, or IN_CLUSTER to load the
107
64
# config when running inside a pod in the cluster)
108
65
.head_group(
109
66
service_account=service_account,
110
-
image=get_ray_image(ray_version, python_version)
111
67
cpus=2,
112
68
node_selector={"geneva.lancedb.com/ray-head":""}, # k8s label required for head in your cluster
113
69
)
114
70
.add_cpu_worker_group(
115
71
cpus=4,
116
72
memory="8Gi",
117
73
service_account=service_account,
118
-
image=get_ray_image(ray_version, python_version)
119
74
node_selector={"geneva.lancedb.com/ray-worker-cpu":""}, # k8s label for cpu worker in your cluster
120
75
)
121
76
.add_gpu_worker_group(
122
77
cpus=2,
123
78
memory="8Gi",
124
79
gpus=1,
125
80
service_account=service_account,
126
-
image=get_ray_image(ray_version, python_version, gpu=True) # Note the GPU image here
81
+
image=get_ray_image(ray_version, python_version, gpu=True),# Note the GPU image here
127
82
node_selector={"geneva.lancedb.com/ray-worker-gpu":""}, # k8s label for gpu worker in your cluster
128
83
)
129
84
.build()
85
+
)
130
86
131
-
db.define_cluster("my_geneva_cluster", cluster)
87
+
db.define_cluster(cluster_name, cluster)
132
88
# define_cluster stores the cluster metadata in persistent storage. The Cluster can then be referenced by name and provisioned when creating an execution context.
89
+
90
+
table = db.get_table("my_table")
91
+
with db.context(cluster=cluster_name):
92
+
table.backfill("my_udf")
93
+
```
94
+
See [the API docs](https://lancedb.github.io/geneva/api/cluster/) for all the parameters GenevaClusterBuilder can use.
95
+
96
+
</CodeGroup>
97
+
98
+
### External Ray cluster
99
+
If you already have a Ray cluster, Geneva can execute jobs against it too. You do so by defining a Geneva cluster which has the address of the cluster. Here's an example:
100
+
101
+
<CodeGroup>
102
+
```python Python icon="python"
103
+
import geneva
104
+
from geneva.cluster.builder import GenevaClusterBuilder
105
+
from geneva.cluster import GenevaClusterType
106
+
107
+
db = geneva.connect(my_db_uri)
108
+
cluster_name ="my-geneva-external-cluster"
109
+
110
+
cluster = (
111
+
GenevaClusterBuilder()
112
+
.name(cluster_name)
113
+
.cluster_type(GenevaClusterType.EXTERNAL_RAY)
114
+
.ray_address("ray://my_ip:my_port")
115
+
.portforwarding(False) # This must be False when using an external Ray cluster
116
+
.build()
117
+
)
118
+
db.define_cluster(cluster_name, cluster)
119
+
133
120
```
134
121
</CodeGroup>
135
122
123
+
## Dependencies
124
+
125
+
Most UDFs require some dependencies: helper libraries like `pillow` for image processing, pre-trained models like `open-clip` to calculate embeddings, or even small config files. We have two ways to get them to workers:
126
+
127
+
1. Use defaults
128
+
2. Define a manifest
129
+
130
+
### Use Defaults
131
+
By default, LanceDB packages your local environment and sends it to Ray workers. This includes your local Python `site-packages` (defined by `site.getsitepackages()`) and either the current workspace root (if you're in a python repo) or the current working directory (if you're not). If you don't explicitly define a manifest, this is what will happen.
132
+
136
133
### Define a Manifest
137
134
138
-
A Geneva Manifest represents the files and dependencies used in the execution environment. Calling `define_manifest()` packages files in the local environment and stores the Manifest metadata and files in persistent storage.
139
-
The Manifest can then be referenced by name when creating an execution context. Persistent Manifests allow for deterministic execution environments that can be shared and reused.
135
+
Sometimes you need more control over what the workers get access to. For example:
136
+
- you might need to include files from another directory, or another python package
137
+
- you might not want to send all your local dependencies (if your repo has lots of dependencies but your UDF will only need a few)
138
+
- you might need packages to be built separately for the worker's architecture (for example, you can't build `pyarrow` on a Mac and run it on a Linux Ray worker).
139
+
- you might want to reuse dependencies between two backfill jobs, so you know they are running with the same environment.
140
+
141
+
For these use cases, you can define a Manifest. Calling `define_manifest()` packages files in the local environment and stores the Manifest metadata and files in persistent storage. The Manifest can then be referenced by name, shared, and reused.
140
142
141
143
<CodeGroup>
142
144
```python Python icon="python"
@@ -152,29 +154,36 @@ manifest = (
152
154
.pip(["lancedb", "numpy"])
153
155
.py_modules(["my_module"])
154
156
).build()
155
-
)
156
157
157
158
db.define_manifest(manifest_name, manifest)
158
159
```
159
160
</CodeGroup>
160
161
161
-
This manifest contains a couple ways to get files and resources to the Ray workers:
162
-
- local environment: Everything in your local environment, including your local working directory and python `site_packages`, will be zipped and sent to workers.
163
-
- you can see these zip files by setting `.local_zip_output_dir(path)` in the builder, or set `.delete_local_zips(True)` if you don't care
164
-
- you can set `skip_site_packages=True` if you don't want to upload your local `site_packages`. This is useful, for example, if you’re developing on an ARM64 machine (e.g., Apple Silicon Macs) and want to avoid sending ARM64 prebuilt packages to x86-64 Ray nodes. In that case, you will probably need `pip` and `py_modules`:
165
-
-`pip` and `py_modules`: packages that you want to be installed, but are not installed locally. These are passed in to Ray's [RuntimeEnv](https://docs.ray.io/en/latest/ray-core/api/doc/ray.runtime_env.RuntimeEnv.html#ray-runtime-env-runtimeenv), which has more details about how they are included. In short, `pip` is a list of packages that you want it to install from pip, and `py_modules` is a list of local or remote zip files that Ray unzips and adds to workers' PYTHONPATHs.
162
+
What's in a manifest and how can you define it? (methods are all on `GenevaManifestBuilder`)
163
+
164
+
|Contents|How you can define it|
165
+
|---|---|
166
+
|Local python packages|Will be uploaded automatically, unless you set `.skip_site_packages(True)`.|
167
+
|Local working directory (or workspace root, if in a python repo)|Will be uploaded automatically.|
168
+
|Python packages to be installed with `pip`|Use `.pip(packages: list[str])` or `.add_pip(package: str)`. See [Ray's RuntimeEnv docs](https://docs.ray.io/en/latest/ray-core/api/doc/ray.runtime_env.RuntimeEnv.html) for details.|
169
+
|Local python packages outside of `site_packages`|Use `.py_modules(modules: list[str])` or `.add_py_module(module: str)`. See [Ray's RuntimeEnv docs](https://docs.ray.io/en/latest/ray-core/api/doc/ray.runtime_env.RuntimeEnv.html) for details.|
170
+
|Container image for head node|Use `.head_image(head_image: str)` or `default_head_image()` to use the default. Note that, if the image is also defined in the GenevaCluster, the image set here in the Manifest will take priority.|
171
+
|Container image for worker nodes|Use `.worker_image(worker_image: str)` or `default_worker_image()` to use the default for the current platform. As with the head image, this takes priority over any images set in the Cluster.|
172
+
173
+
If you want to see exactly what is being uploaded to the cluster, set `.delete_local_zips(False)` and `.local_zip_output_dir(path)` then examine the zip files in `path`.
166
174
167
-
### Create an Execution Context
175
+
##Putting it all together: Execution Contexts
168
176
169
-
An execution context represents the concrete execution environment used to execute a distributed Job.
177
+
An execution context represents the concrete execution environment (Cluster and Manifest) used to execute a distributed job.
170
178
171
-
Calling `context` will enter a context manager that will provision an execution cluster and execute the Job using the Cluster and Manifest definitions provided. Once completed, the context manager will automatically de-provision the cluster.
179
+
Calling `context` will enter a context manager that will provision an execution cluster and execute the Job using the Cluster and Manifest definitions provided. Because you've already defined the cluster and manifest, you can just reference them by name. Note that providing a manifest is optional. Once completed, the context manager will automatically de-provision the cluster.
172
180
173
181
<CodeGroup>
174
182
```python Python icon="python"
175
183
db = geneva.connect(my_db_uri)
176
184
tbl = db.get_table("my_table")
177
185
186
+
# Providing a manifest is optional; if not provided, it will work as described in "Use defaults" above.
178
187
with db.context(cluster=cluster_name, manifest=manifest_name):
0 commit comments