Skip to content

Commit 8fd5708

Browse files
authored
Update Execution Contexts to recommend explicit dependencies (#112)
1 parent 4a2fb9b commit 8fd5708

1 file changed

Lines changed: 89 additions & 26 deletions

File tree

docs/geneva/jobs/contexts.mdx

Lines changed: 89 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -17,20 +17,25 @@ We currently support one processing backend: **Ray**. There are 3 ways to connec
1717

1818
### Local Ray
1919

20-
To execute jobs without an external Ray cluster, you can just trigger the `Table.backfill` method. This will auto-create a Ray cluster on your machine. Because it's on your laptop/desktop, this is only suitable for prototyping on small datasets. But it is the easiest way to get started. Simply define the UDF, add a column, and trigger the job:
20+
To execute jobs without an external Ray cluster, you can use `LocalRayContext`. This will auto-create a Ray cluster on your machine. Because it's on your laptop/desktop, this is only suitable for prototyping on small datasets. But it is the easiest way to get started. Simply define the UDF, add a column, call `Connection.local_ray_context()`, and trigger the job:
2121

2222
<CodeGroup>
2323
```python Python icon="python"
24+
from geneva import udf
25+
from geneva.db import Connection
26+
2427
@udf
2528
def filename_len(filename: str) -> int:
2629
return len(filename)
2730

2831
tbl.add_columns({"filename_len": filename_len})
29-
tbl.backfill("filename_len")
32+
33+
with Connection.local_ray_context():
34+
tbl.backfill("filename_len")
3035
```
3136
</CodeGroup>
3237

33-
Geneva will package up your local environment and send it to each worker node, so they'll have access to all the same dependencies as if you ran a simple Python script yourself.
38+
Geneva will package up your local environment and send it to each worker process, so they'll have access to all the same dependencies as if you ran a simple Python script yourself.
3439

3540
### KubeRay
3641

@@ -50,7 +55,7 @@ db = geneva.connect("s3://my-bucket/my-db")
5055
ray_version = ray.__version__
5156
python_version = f"{sys.version_info.major}.{sys.version_info.minor}"
5257
cluster_name = "my-geneva-cluster" # lowercase, numbers, hyphens only
53-
service_account = "my_k8s_service_account" # k8s service account bound geneva runs as
58+
service_account = "my_k8s_service_account" # k8s service account that Geneva runs as
5459
k8s_namespace = "geneva" # k8s namespace
5560

5661
cluster = (
@@ -120,25 +125,17 @@ db.define_cluster(cluster_name, cluster)
120125
```
121126
</CodeGroup>
122127

123-
## Dependencies
124-
125-
Most UDFs require some dependencies: helper libraries like `pillow` for image processing, pre-trained models like `open-clip` to calculate embeddings, or even small config files. We have two ways to get them to workers:
126-
127-
1. Use defaults
128-
2. Define a manifest
128+
## Dependencies and Manifests
129129

130-
### Use Defaults
131-
By default, LanceDB packages your local environment and sends it to Ray workers. This includes your local Python `site-packages` (defined by `site.getsitepackages()`) and either the current workspace root (if you're in a python repo) or the current working directory (if you're not). If you don't explicitly define a manifest, this is what will happen.
130+
Most UDFs require some dependencies: helper libraries like `pillow` for image processing, pre-trained models like `open-clip` to calculate embeddings, or even small config files. We have three ways to get them to workers:
132131

133-
### Define a Manifest
132+
1. Define dependencies explicitly in a manifest
133+
2. Bake dependencies into an image
134+
3. Auto-upload local dependencies
134135

135-
Sometimes you need more control over what the workers get access to. For example:
136-
- you might need to include files from another directory, or another python package
137-
- you might not want to send all your local dependencies (if your repo has lots of dependencies but your UDF will only need a few)
138-
- you might need packages to be built separately for the worker's architecture (for example, you can't build `pyarrow` on a Mac and run it on a Linux Ray worker).
139-
- you might want to reuse dependencies between two backfill jobs, so you know they are running with the same environment.
136+
### Define dependencies explicitly in a manifest
140137

141-
For these use cases, you can define a Manifest. Calling `define_manifest()` packages files in the local environment and stores the Manifest metadata and files in persistent storage. The Manifest can then be referenced by name, shared, and reused.
138+
We recommend defining dependencies explicitly: it's the easiest way to understand exactly what's running, and the least error-prone. To do so, define a Manifest, like so:
142139

143140
<CodeGroup>
144141
```python Python icon="python"
@@ -150,22 +147,89 @@ manifest_name="dev-manifest"
150147
manifest = (
151148
GenevaManifestBuilder()
152149
.name(manifest_name)
153-
.skip_site_packages(False)
154150
.pip(["lancedb", "numpy"])
155-
.py_modules(["my_module"])
156151
).build()
157152

158153
db.define_manifest(manifest_name, manifest)
159154
```
160155
</CodeGroup>
161156

162-
What's in a manifest and how can you define it? (methods are all on `GenevaManifestBuilder`)
157+
After workers start up, this will run `pip install lancedb numpy` on them. Geneva also supports defining `conda` dependencies, or sending a path to a `requirements.txt` or `environment.yml` file, via the following methods on `GenevaManifestBuilder`:
158+
159+
```
160+
.pip(deps) # list of pip dependencies, like "numpy" or "numpy==2.3.5"
161+
.conda(deps) # list of conda dependencies, like "numpy" or "numpy=2.3.5"
162+
.requirements_path(path) # path to local requirements.txt file, like "./requirements.txt"
163+
.conda_environment_path(path) # path to local conda environment.yml file, like "./environment.yml"
164+
# Note that file paths are relative to the execution directory.
165+
```
166+
Note that attempting to use both `pip` and `requirements_path` will raise an exception. Similarly, you can't use both `conda` and `conda_environment_path`.
167+
168+
### Bake dependencies into an image
169+
170+
Because the `pip` or `conda` methods involve installing packages, they will incur some startup costs. When your jobs are stable in production, therefore, it will be faster to build all your dependencies into the workers' images, then specify them like so:
171+
172+
<CodeGroup>
173+
```python Python icon="python"
174+
from geneva.manifest.builder import GenevaManifestBuilder
175+
176+
db = geneva.connect(my_db_uri)
177+
178+
manifest_name = "prod-manifest"
179+
manifest = (
180+
GenevaManifestBuilder()
181+
.name(manifest_name)
182+
.worker_image("myregistry.example.com/my-custom-worker-image:latest")
183+
).build()
184+
185+
db.define_manifest(manifest_name, manifest)
186+
```
187+
</CodeGroup>
188+
189+
<Tip>
190+
You can also define images in a GenevaCluster, e.g.:
191+
```python Python icon="python"
192+
cluster = (
193+
GenevaClusterBuilder()
194+
.name(cluster_name)
195+
.add_cpu_worker_group(
196+
WorkerGroupBuilder()
197+
.image("myregistry.example.com/my-custom-worker-image:latest")
198+
)
199+
.build()
200+
)
201+
```
202+
However, if an image is defined in both a Cluster and a Manifest, the definition in the Manifest will take priority.
203+
</Tip>
204+
205+
### Auto-upload local dependencies
206+
207+
Geneva can package your local environment and send it to Ray workers. This includes the current workspace root (if you're in a python repo) or the current working directory (if you're not). However, if you set `.upload_site_packages()`, your Python site-packages (defined by `site.getsitepackages()`) will be uploaded to workers as well. This is not recommended for production use, as it is prone to issues like architecture mismatches of built dependencies, but it can be a good way to iterate quickly during development.
208+
209+
To upload site packages:
210+
211+
```python Python icon="python"
212+
from geneva.manifest.builder import GenevaManifestBuilder
213+
db = geneva.connect(my_db_uri)
214+
manifest_name = "dev-manifest"
215+
manifest = (
216+
GenevaManifestBuilder()
217+
.name(manifest_name)
218+
.upload_site_packages()
219+
).build()
220+
221+
db.define_manifest(manifest_name, manifest)
222+
```
223+
224+
### What's in a manifest?
225+
Here's a summary of what's in a manifest and how you can define it. (methods are all on `GenevaManifestBuilder`)
163226

164227
|Contents|How you can define it|
165228
|---|---|
166-
|Local python packages|Will be uploaded automatically, unless you set `.skip_site_packages(True)`.|
167229
|Local working directory (or workspace root, if in a python repo)|Will be uploaded automatically.|
168-
|Python packages to be installed with `pip`|Use `.pip(packages: list[str])` or `.add_pip(package: str)`. See [Ray's RuntimeEnv docs](https://docs.ray.io/en/latest/ray-core/api/doc/ray.runtime_env.RuntimeEnv.html) for details.|
230+
|Local python packages|Will be uploaded if you set `.upload_site_packages()`.|
231+
|Python packages to be installed|Use `.pip(packages: list[str])` or `.conda(packages: dict[str, Any])`. See [Ray's RuntimeEnv docs](https://docs.ray.io/en/latest/ray-core/api/doc/ray.runtime_env.RuntimeEnv.html) for details.|
232+
|Python dependency lists|Use `.requirements_path(path: str)` or `.conda_environment_path(path: str)`|
169233
|Local python packages outside of `site_packages`|Use `.py_modules(modules: list[str])` or `.add_py_module(module: str)`. See [Ray's RuntimeEnv docs](https://docs.ray.io/en/latest/ray-core/api/doc/ray.runtime_env.RuntimeEnv.html) for details.|
170234
|Container image for head node|Use `.head_image(head_image: str)` or `default_head_image()` to use the default. Note that, if the image is also defined in the GenevaCluster, the image set here in the Manifest will take priority.|
171235
|Container image for worker nodes|Use `.worker_image(worker_image: str)` or `default_worker_image()` to use the default for the current platform. As with the head image, this takes priority over any images set in the Cluster.|
@@ -183,7 +247,6 @@ Calling `context` will enter a context manager that will provision an execution
183247
db = geneva.connect(my_db_uri)
184248
tbl = db.get_table("my_table")
185249

186-
# Providing a manifest is optional; if not provided, it will work as described in "Use defaults" above.
187250
with db.context(cluster=cluster_name, manifest=manifest_name):
188251
tbl.backfill("embedding")
189252
```
@@ -194,7 +257,7 @@ In a notebook environment, you can manually enter and exit the context manager i
194257
<CodeGroup>
195258
```python Python icon="python"
196259
ctx = db.context(cluster=cluster_name, manifest=manifest_name)
197-
ctx.__enter()__
260+
ctx.__enter__()
198261

199262
# ... do stuff
200263

0 commit comments

Comments
 (0)