Add examples on create index and training (#2194)

prrao87 · lhoestq · julien-c · web-flow · commit 72c928930474 · 2026-02-02T15:00:59.000+01:00
* Add examples on create index and training

* Update openvid path and minor fixes

* Update docs/hub/datasets-lance.md

Co-authored-by: Julien Chaumond &lt;julien@huggingface.co&gt;

---------

Co-authored-by: Quentin Lhoest &lt;42851186+lhoestq@users.noreply.github.com&gt;
Co-authored-by: Julien Chaumond &lt;julien@huggingface.co&gt;
diff --git a/docs/hub/datasets-lance.md b/docs/hub/datasets-lance.md
@@ -16,7 +16,7 @@ pip install pylance pyarrow
 
 ## Why Lance?
 
-- Optimized for ML/AI workloads: Lance is a modern columnar format designed for fast random access without compromising scan performance.
+- Optimized for ML/AI workloads: Lance is a modern columnar format designed for fast random access without compromising scan performance, making it useful for search, analytics, training, feature engineering and many more use cases.
 - Multimodal assets are stored as bytes, or binary objects ("[blobs as files](https://lance.org/guide/blob/)") in Lance alongside embeddings, and traditional scalar data -- this makes it easier to govern, share, and distribute your large datasets via the Hub.
 - Indexing is a first-class citizen (native to the format itself): Lance comes with fast, on-disk, scalable [vector](https://lance.org/quickstart/vector-search) and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them.
 - Flexible schema and [data evolution](https://lance.org/guide/data_evolution) let you incrementally add new features/columns (moderation tags, embeddings, etc.) **without** needing to rewrite the entire table.
@@ -108,11 +108,29 @@ subset = scanner.to_table()
 lance.write_dataset(subset, "./laion_subset")
 ```
 
+## Create index
+
+If your dataset doesn't already have an index associated with it, you can create one after downloading it locally.
+
+```python
+# ds is a local Lance dataset
+ds.create_index(
+    "img_emb",
+    index_type="IVF_PQ",
+    num_partitions=256,
+    num_sub_vectors=96,
+    replace=True,
+)
+```
+
+See the [Lance docs](https://lance.org/quickstart/vector-search/) on vector index creation for a more detailed example. Once you have a vector index created, you can run similarity search on the data via embeddings.
+
 ## Vector search
 
-Because indexes are first-class citizens in Lance, you can store not only your data but also your embeddings and indexes together in a dataset and query them directly on the Hub. Simply use the `describe_indices()` method to list the index information for the dataset. If an index doesn't exist in the dataset, you can use `lance.write_dataset()` to write a local version of the dataset and use [LanceDataset.create_index](https://lance-format.github.io/lance-python-doc/all-modules.html#lance.dataset.LanceDataset.create_index) to create an index for your needs.
+Because indexes are first-class citizens in Lance, you can store not only your data but also your embeddings and indexes together and query them **directly on the Hub**. Simply use the `describe_indices()` method to list the index information for the dataset. If an index doesn't exist in the dataset, you can use `lance.write_dataset()` to write a local version of the dataset and use [LanceDataset.create_index](https://lance-format.github.io/lance-python-doc/all-modules.html#lance.dataset.LanceDataset.create_index) to create an index for your needs.
+
+The example below shows a dataset for which we already have a vector index on the `img_emb` field:
 
-The example below shows a dataset for which we have a vector index on the `img_emb` field, as well as its index statistics.
 ```python
 import lance
 
@@ -133,7 +151,7 @@ print(ds.list_indices())
 # ]
 ```
 
-You can run vector search queries directly on the remote dataset without downloading it. The example below shows how to run a nearest neighbor search on a vector index using an image embedding as the query vector.
+You can run vector search queries directly on the remote dataset without downloading it (or, if you prefer, download the dataset locally and create a new index). The example below shows how to run a nearest neighbor search on a vector index using an image embedding as the query vector.
 
 ```python
 import lance
@@ -193,7 +211,7 @@ Lance tables also support large inline video blobs. The `OpenVid-1M` dataset (fr
 ```python
 import lance
 
-lance_ds = lance.dataset("hf://datasets/lance-format/openvid-lance/data/train.lance")
+lance_ds = lance.dataset("hf://datasets/lance-format/Openvid-1M/data/train.lance")
 blob_file = lance_ds.take_blobs("video_blob", ids=[0])[0]
 video_bytes = blob_file.read()
 ```
@@ -203,7 +221,7 @@ Unlike other data formats, large multimodal binary objects (blobs) are first-cla
 ```python
 import lance
 
-ds = lance.dataset("hf://datasets/lance-format/openvid-lance/data/train.lance")
+ds = lance.dataset("hf://datasets/lance-format/Openvid-1M/data/train.lance")
 
 # 1. Browse metadata without loading video blobs.
 metadata = ds.scanner(
@@ -219,6 +237,22 @@ with open("video_0.mp4", "wb") as f:
     f.write(blob_file.read())
 ```
 
+## Prepare data for training
+
+Training is another area where Lance's fast random access and scan performance can be useful. You can use Lance datasets as the storage mechanism for your training data, shuffling it and loading into batches as part of your training pipelines.
+
+The blob API in Lance is compatible with `torchcodec`, so you can easily decode video blobs as `torch` tensors:
+
+```python
+from torchcodec.decoders import VideoDecoder
+decoder = VideoDecoder(blob_file)
+tensor = decoder[0]  # uint8 tensor of shape [C, H, W]
+```
+
+See the [torchcodec docs](https://docs.pytorch.org/torchcodec/stable/generated/torchcodec.decoders.VideoDecoder.html) for more functions for efficiently decoding videos.
+
+In addition, you can also check out the [Lance documentation](https://lance.org/examples/python/clip_training/) for more examples on loading image data into `torchvision` for training your own image models.
+
 ## Explore more Lance datasets
 
 Lance is an open format with native support for multimodal blobs alongside your traditional tabular data.