Update Execution Contexts to recommend explicit dependencies (#112)

dantasse · web-flow · commit 8fd5708f0e39 · 2026-02-05T17:22:54.000-05:00
diff --git a/docs/geneva/jobs/contexts.mdx b/docs/geneva/jobs/contexts.mdx
@@ -17,20 +17,25 @@ We currently support one processing backend: **Ray**. There are 3 ways to connec
 
 ### Local Ray
 
-To execute jobs without an external Ray cluster, you can just trigger the `Table.backfill` method. This will auto-create a Ray cluster on your machine. Because it's on your laptop/desktop, this is only suitable for prototyping on small datasets. But it is the easiest way to get started. Simply define the UDF, add a column, and trigger the job:
+To execute jobs without an external Ray cluster, you can use `LocalRayContext`. This will auto-create a Ray cluster on your machine. Because it's on your laptop/desktop, this is only suitable for prototyping on small datasets. But it is the easiest way to get started. Simply define the UDF, add a column, call `Connection.local_ray_context()`, and trigger the job:
 
 <CodeGroup>
 ```python Python icon="python"
+from geneva import udf
+from geneva.db import Connection
+
 @udf
 def filename_len(filename: str) -> int:
     return len(filename)
 
 tbl.add_columns({"filename_len": filename_len})
-tbl.backfill("filename_len")
+
+with Connection.local_ray_context():
+    tbl.backfill("filename_len")
 ```
 </CodeGroup>
 
-Geneva will package up your local environment and send it to each worker node, so they'll have access to all the same dependencies as if you ran a simple Python script yourself.
+Geneva will package up your local environment and send it to each worker process, so they'll have access to all the same dependencies as if you ran a simple Python script yourself.
 
 ### KubeRay
 
@@ -50,7 +55,7 @@ db = geneva.connect("s3://my-bucket/my-db")
 ray_version = ray.__version__
 python_version = f"{sys.version_info.major}.{sys.version_info.minor}"
 cluster_name = "my-geneva-cluster" # lowercase, numbers, hyphens only
-service_account = "my_k8s_service_account" # k8s service account bound geneva runs as
+service_account = "my_k8s_service_account" # k8s service account that Geneva runs as
 k8s_namespace = "geneva"  # k8s namespace
 
 cluster = (
@@ -120,25 +125,17 @@ db.define_cluster(cluster_name, cluster)
 ```
 </CodeGroup>
 
-## Dependencies
-
-Most UDFs require some dependencies: helper libraries like `pillow` for image processing, pre-trained models like `open-clip` to calculate embeddings, or even small config files. We have two ways to get them to workers:
-
-1. Use defaults
-2. Define a manifest
+## Dependencies and Manifests
 
-### Use Defaults
-By default, LanceDB packages your local environment and sends it to Ray workers. This includes your local Python `site-packages` (defined by `site.getsitepackages()`) and either the current workspace root (if you're in a python repo) or the current working directory (if you're not). If you don't explicitly define a manifest, this is what will happen.
+Most UDFs require some dependencies: helper libraries like `pillow` for image processing, pre-trained models like `open-clip` to calculate embeddings, or even small config files. We have three ways to get them to workers:
 
-### Define a Manifest
+1. Define dependencies explicitly in a manifest
+2. Bake dependencies into an image
+3. Auto-upload local dependencies
 
-Sometimes you need more control over what the workers get access to. For example:
-- you might need to include files from another directory, or another python package
-- you might not want to send all your local dependencies (if your repo has lots of dependencies but your UDF will only need a few)
-- you might need packages to be built separately for the worker's architecture (for example, you can't build `pyarrow` on a Mac and run it on a Linux Ray worker).
-- you might want to reuse dependencies between two backfill jobs, so you know they are running with the same environment.
+### Define dependencies explicitly in a manifest
 
-For these use cases, you can define a Manifest. Calling `define_manifest()` packages files in the local environment and stores the Manifest metadata and files in persistent storage. The Manifest can then be referenced by name, shared, and reused.
+We recommend defining dependencies explicitly: it's the easiest way to understand exactly what's running, and the least error-prone. To do so, define a Manifest, like so:
 
 <CodeGroup>
 ```python Python icon="python"
@@ -150,22 +147,89 @@ manifest_name="dev-manifest"
 manifest = (
     GenevaManifestBuilder()
         .name(manifest_name)
-        .skip_site_packages(False)
         .pip(["lancedb", "numpy"])
-        .py_modules(["my_module"])
     ).build()
 
 db.define_manifest(manifest_name, manifest)
 ```
 </CodeGroup>
 
-What's in a manifest and how can you define it? (methods are all on `GenevaManifestBuilder`)
+After workers start up, this will run `pip install lancedb numpy` on them. Geneva also supports defining `conda` dependencies, or sending a path to a `requirements.txt` or `environment.yml` file, via the following methods on `GenevaManifestBuilder`:
+
+```
+.pip(deps) # list of pip dependencies, like "numpy" or "numpy==2.3.5"
+.conda(deps) # list of conda dependencies, like "numpy" or "numpy=2.3.5"
+.requirements_path(path) # path to local requirements.txt file, like "./requirements.txt"
+.conda_environment_path(path) # path to local conda environment.yml file, like "./environment.yml"
+# Note that file paths are relative to the execution directory.
+```
+Note that attempting to use both `pip` and `requirements_path` will raise an exception. Similarly, you can't use both `conda` and `conda_environment_path`.
+
+### Bake dependencies into an image
+
+Because the `pip` or `conda` methods involve installing packages, they will incur some startup costs. When your jobs are stable in production, therefore, it will be faster to build all your dependencies into the workers' images, then specify them like so:
+
+<CodeGroup>
+```python Python icon="python"
+from geneva.manifest.builder import GenevaManifestBuilder
+
+db = geneva.connect(my_db_uri)
+
+manifest_name = "prod-manifest"
+manifest = (
+    GenevaManifestBuilder()
+        .name(manifest_name)
+        .worker_image("myregistry.example.com/my-custom-worker-image:latest")
+    ).build()
+
+db.define_manifest(manifest_name, manifest)
+```
+</CodeGroup>
+
+<Tip>
+You can also define images in a GenevaCluster, e.g.:
+```python Python icon="python"
+cluster = (
+    GenevaClusterBuilder()
+        .name(cluster_name)
+        .add_cpu_worker_group(
+            WorkerGroupBuilder()
+            .image("myregistry.example.com/my-custom-worker-image:latest")
+        )
+    .build()
+)
+```
+However, if an image is defined in both a Cluster and a Manifest, the definition in the Manifest will take priority.
+</Tip>
+
+### Auto-upload local dependencies
+
+Geneva can package your local environment and send it to Ray workers. This includes the current workspace root (if you're in a python repo) or the current working directory (if you're not). However, if you set `.upload_site_packages()`, your Python site-packages (defined by `site.getsitepackages()`) will be uploaded to workers as well. This is not recommended for production use, as it is prone to issues like architecture mismatches of built dependencies, but it can be a good way to iterate quickly during development.
+
+To upload site packages:
+
+```python Python icon="python"
+from geneva.manifest.builder import GenevaManifestBuilder
+db = geneva.connect(my_db_uri)
+manifest_name = "dev-manifest"
+manifest = (
+    GenevaManifestBuilder()
+        .name(manifest_name)
+        .upload_site_packages()
+    ).build()
+
+db.define_manifest(manifest_name, manifest)
+```
+
+### What's in a manifest?
+Here's a summary of what's in a manifest and how you can define it. (methods are all on `GenevaManifestBuilder`)
 
 |Contents|How you can define it|
 |---|---|
-|Local python packages|Will be uploaded automatically, unless you set `.skip_site_packages(True)`.|
 |Local working directory (or workspace root, if in a python repo)|Will be uploaded automatically.|
-|Python packages to be installed with `pip`|Use `.pip(packages: list[str])` or `.add_pip(package: str)`. See [Ray's RuntimeEnv docs](https://docs.ray.io/en/latest/ray-core/api/doc/ray.runtime_env.RuntimeEnv.html) for details.|
+|Local python packages|Will be uploaded if you set `.upload_site_packages()`.|
+|Python packages to be installed|Use `.pip(packages: list[str])` or `.conda(packages: dict[str, Any])`. See [Ray's RuntimeEnv docs](https://docs.ray.io/en/latest/ray-core/api/doc/ray.runtime_env.RuntimeEnv.html) for details.|
+|Python dependency lists|Use `.requirements_path(path: str)` or `.conda_environment_path(path: str)`|
 |Local python packages outside of `site_packages`|Use `.py_modules(modules: list[str])` or `.add_py_module(module: str)`. See [Ray's RuntimeEnv docs](https://docs.ray.io/en/latest/ray-core/api/doc/ray.runtime_env.RuntimeEnv.html) for details.|
 |Container image for head node|Use `.head_image(head_image: str)` or `default_head_image()` to use the default. Note that, if the image is also defined in the GenevaCluster, the image set here in the Manifest will take priority.|
 |Container image for worker nodes|Use `.worker_image(worker_image: str)` or `default_worker_image()` to use the default for the current platform. As with the head image, this takes priority over any images set in the Cluster.|
@@ -183,7 +247,6 @@ Calling `context` will enter a context manager that will provision an execution
 db = geneva.connect(my_db_uri)
 tbl = db.get_table("my_table")
 
-# Providing a manifest is optional; if not provided, it will work as described in "Use defaults" above.
 with db.context(cluster=cluster_name, manifest=manifest_name):
     tbl.backfill("embedding")
 ```
@@ -194,7 +257,7 @@ In a notebook environment, you can manually enter and exit the context manager i
 <CodeGroup>
 ```python Python icon="python"
 ctx = db.context(cluster=cluster_name, manifest=manifest_name)
-ctx.__enter()__
+ctx.__enter__()
 
 # ... do stuff