You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/geneva/jobs/contexts.mdx
+89-26Lines changed: 89 additions & 26 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,20 +17,25 @@ We currently support one processing backend: **Ray**. There are 3 ways to connec
17
17
18
18
### Local Ray
19
19
20
-
To execute jobs without an external Ray cluster, you can just trigger the `Table.backfill` method. This will auto-create a Ray cluster on your machine. Because it's on your laptop/desktop, this is only suitable for prototyping on small datasets. But it is the easiest way to get started. Simply define the UDF, add a column, and trigger the job:
20
+
To execute jobs without an external Ray cluster, you can use `LocalRayContext`. This will auto-create a Ray cluster on your machine. Because it's on your laptop/desktop, this is only suitable for prototyping on small datasets. But it is the easiest way to get started. Simply define the UDF, add a column, call `Connection.local_ray_context()`, and trigger the job:
21
21
22
22
<CodeGroup>
23
23
```python Python icon="python"
24
+
from geneva import udf
25
+
from geneva.db import Connection
26
+
24
27
@udf
25
28
deffilename_len(filename: str) -> int:
26
29
returnlen(filename)
27
30
28
31
tbl.add_columns({"filename_len": filename_len})
29
-
tbl.backfill("filename_len")
32
+
33
+
with Connection.local_ray_context():
34
+
tbl.backfill("filename_len")
30
35
```
31
36
</CodeGroup>
32
37
33
-
Geneva will package up your local environment and send it to each worker node, so they'll have access to all the same dependencies as if you ran a simple Python script yourself.
38
+
Geneva will package up your local environment and send it to each worker process, so they'll have access to all the same dependencies as if you ran a simple Python script yourself.
34
39
35
40
### KubeRay
36
41
@@ -50,7 +55,7 @@ db = geneva.connect("s3://my-bucket/my-db")
Most UDFs require some dependencies: helper libraries like `pillow` for image processing, pre-trained models like `open-clip` to calculate embeddings, or even small config files. We have two ways to get them to workers:
126
-
127
-
1. Use defaults
128
-
2. Define a manifest
128
+
## Dependencies and Manifests
129
129
130
-
### Use Defaults
131
-
By default, LanceDB packages your local environment and sends it to Ray workers. This includes your local Python `site-packages` (defined by `site.getsitepackages()`) and either the current workspace root (if you're in a python repo) or the current working directory (if you're not). If you don't explicitly define a manifest, this is what will happen.
130
+
Most UDFs require some dependencies: helper libraries like `pillow` for image processing, pre-trained models like `open-clip` to calculate embeddings, or even small config files. We have three ways to get them to workers:
132
131
133
-
### Define a Manifest
132
+
1. Define dependencies explicitly in a manifest
133
+
2. Bake dependencies into an image
134
+
3. Auto-upload local dependencies
134
135
135
-
Sometimes you need more control over what the workers get access to. For example:
136
-
- you might need to include files from another directory, or another python package
137
-
- you might not want to send all your local dependencies (if your repo has lots of dependencies but your UDF will only need a few)
138
-
- you might need packages to be built separately for the worker's architecture (for example, you can't build `pyarrow` on a Mac and run it on a Linux Ray worker).
139
-
- you might want to reuse dependencies between two backfill jobs, so you know they are running with the same environment.
136
+
### Define dependencies explicitly in a manifest
140
137
141
-
For these use cases, you can define a Manifest. Calling `define_manifest()` packages files in the local environment and stores the Manifest metadata and files in persistent storage. The Manifest can then be referenced by name, shared, and reused.
138
+
We recommend defining dependencies explicitly: it's the easiest way to understand exactly what's running, and the least error-prone. To do so, define a Manifest, like so:
What's in a manifest and how can you define it? (methods are all on `GenevaManifestBuilder`)
157
+
After workers start up, this will run `pip install lancedb numpy` on them. Geneva also supports defining `conda` dependencies, or sending a path to a `requirements.txt` or `environment.yml` file, via the following methods on `GenevaManifestBuilder`:
158
+
159
+
```
160
+
.pip(deps) # list of pip dependencies, like "numpy" or "numpy==2.3.5"
161
+
.conda(deps) # list of conda dependencies, like "numpy" or "numpy=2.3.5"
162
+
.requirements_path(path) # path to local requirements.txt file, like "./requirements.txt"
163
+
.conda_environment_path(path) # path to local conda environment.yml file, like "./environment.yml"
164
+
# Note that file paths are relative to the execution directory.
165
+
```
166
+
Note that attempting to use both `pip` and `requirements_path` will raise an exception. Similarly, you can't use both `conda` and `conda_environment_path`.
167
+
168
+
### Bake dependencies into an image
169
+
170
+
Because the `pip` or `conda` methods involve installing packages, they will incur some startup costs. When your jobs are stable in production, therefore, it will be faster to build all your dependencies into the workers' images, then specify them like so:
171
+
172
+
<CodeGroup>
173
+
```python Python icon="python"
174
+
from geneva.manifest.builder import GenevaManifestBuilder
However, if an image is defined in both a Cluster and a Manifest, the definition in the Manifest will take priority.
203
+
</Tip>
204
+
205
+
### Auto-upload local dependencies
206
+
207
+
Geneva can package your local environment and send it to Ray workers. This includes the current workspace root (if you're in a python repo) or the current working directory (if you're not). However, if you set `.upload_site_packages()`, your Python site-packages (defined by `site.getsitepackages()`) will be uploaded to workers as well. This is not recommended for production use, as it is prone to issues like architecture mismatches of built dependencies, but it can be a good way to iterate quickly during development.
208
+
209
+
To upload site packages:
210
+
211
+
```python Python icon="python"
212
+
from geneva.manifest.builder import GenevaManifestBuilder
213
+
db = geneva.connect(my_db_uri)
214
+
manifest_name ="dev-manifest"
215
+
manifest = (
216
+
GenevaManifestBuilder()
217
+
.name(manifest_name)
218
+
.upload_site_packages()
219
+
).build()
220
+
221
+
db.define_manifest(manifest_name, manifest)
222
+
```
223
+
224
+
### What's in a manifest?
225
+
Here's a summary of what's in a manifest and how you can define it. (methods are all on `GenevaManifestBuilder`)
163
226
164
227
|Contents|How you can define it|
165
228
|---|---|
166
-
|Local python packages|Will be uploaded automatically, unless you set `.skip_site_packages(True)`.|
167
229
|Local working directory (or workspace root, if in a python repo)|Will be uploaded automatically.|
168
-
|Python packages to be installed with `pip`|Use `.pip(packages: list[str])` or `.add_pip(package: str)`. See [Ray's RuntimeEnv docs](https://docs.ray.io/en/latest/ray-core/api/doc/ray.runtime_env.RuntimeEnv.html) for details.|
230
+
|Local python packages|Will be uploaded if you set `.upload_site_packages()`.|
231
+
|Python packages to be installed|Use `.pip(packages: list[str])` or `.conda(packages: dict[str, Any])`. See [Ray's RuntimeEnv docs](https://docs.ray.io/en/latest/ray-core/api/doc/ray.runtime_env.RuntimeEnv.html) for details.|
232
+
|Python dependency lists|Use `.requirements_path(path: str)` or `.conda_environment_path(path: str)`|
169
233
|Local python packages outside of `site_packages`|Use `.py_modules(modules: list[str])` or `.add_py_module(module: str)`. See [Ray's RuntimeEnv docs](https://docs.ray.io/en/latest/ray-core/api/doc/ray.runtime_env.RuntimeEnv.html) for details.|
170
234
|Container image for head node|Use `.head_image(head_image: str)` or `default_head_image()` to use the default. Note that, if the image is also defined in the GenevaCluster, the image set here in the Manifest will take priority.|
171
235
|Container image for worker nodes|Use `.worker_image(worker_image: str)` or `default_worker_image()` to use the default for the current platform. As with the head image, this takes priority over any images set in the Cluster.|
@@ -183,7 +247,6 @@ Calling `context` will enter a context manager that will provision an execution
183
247
db = geneva.connect(my_db_uri)
184
248
tbl = db.get_table("my_table")
185
249
186
-
# Providing a manifest is optional; if not provided, it will work as described in "Use defaults" above.
187
250
with db.context(cluster=cluster_name, manifest=manifest_name):
188
251
tbl.backfill("embedding")
189
252
```
@@ -194,7 +257,7 @@ In a notebook environment, you can manually enter and exit the context manager i
0 commit comments