Skip to content

Commit 72c9289

Browse files
prrao87lhoestqjulien-c
authored
Add examples on create index and training (#2194)
* Add examples on create index and training * Update openvid path and minor fixes * Update docs/hub/datasets-lance.md Co-authored-by: Julien Chaumond <julien@huggingface.co> --------- Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Co-authored-by: Julien Chaumond <julien@huggingface.co>
1 parent b61c71c commit 72c9289

File tree

1 file changed

+40
-6
lines changed

1 file changed

+40
-6
lines changed

docs/hub/datasets-lance.md

Lines changed: 40 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ pip install pylance pyarrow
1616

1717
## Why Lance?
1818

19-
- Optimized for ML/AI workloads: Lance is a modern columnar format designed for fast random access without compromising scan performance.
19+
- Optimized for ML/AI workloads: Lance is a modern columnar format designed for fast random access without compromising scan performance, making it useful for search, analytics, training, feature engineering and many more use cases.
2020
- Multimodal assets are stored as bytes, or binary objects ("[blobs as files](https://lance.org/guide/blob/)") in Lance alongside embeddings, and traditional scalar data -- this makes it easier to govern, share, and distribute your large datasets via the Hub.
2121
- Indexing is a first-class citizen (native to the format itself): Lance comes with fast, on-disk, scalable [vector](https://lance.org/quickstart/vector-search) and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them.
2222
- Flexible schema and [data evolution](https://lance.org/guide/data_evolution) let you incrementally add new features/columns (moderation tags, embeddings, etc.) **without** needing to rewrite the entire table.
@@ -108,11 +108,29 @@ subset = scanner.to_table()
108108
lance.write_dataset(subset, "./laion_subset")
109109
```
110110

111+
## Create index
112+
113+
If your dataset doesn't already have an index associated with it, you can create one after downloading it locally.
114+
115+
```python
116+
# ds is a local Lance dataset
117+
ds.create_index(
118+
"img_emb",
119+
index_type="IVF_PQ",
120+
num_partitions=256,
121+
num_sub_vectors=96,
122+
replace=True,
123+
)
124+
```
125+
126+
See the [Lance docs](https://lance.org/quickstart/vector-search/) on vector index creation for a more detailed example. Once you have a vector index created, you can run similarity search on the data via embeddings.
127+
111128
## Vector search
112129

113-
Because indexes are first-class citizens in Lance, you can store not only your data but also your embeddings and indexes together in a dataset and query them directly on the Hub. Simply use the `describe_indices()` method to list the index information for the dataset. If an index doesn't exist in the dataset, you can use `lance.write_dataset()` to write a local version of the dataset and use [LanceDataset.create_index](https://lance-format.github.io/lance-python-doc/all-modules.html#lance.dataset.LanceDataset.create_index) to create an index for your needs.
130+
Because indexes are first-class citizens in Lance, you can store not only your data but also your embeddings and indexes together and query them **directly on the Hub**. Simply use the `describe_indices()` method to list the index information for the dataset. If an index doesn't exist in the dataset, you can use `lance.write_dataset()` to write a local version of the dataset and use [LanceDataset.create_index](https://lance-format.github.io/lance-python-doc/all-modules.html#lance.dataset.LanceDataset.create_index) to create an index for your needs.
131+
132+
The example below shows a dataset for which we already have a vector index on the `img_emb` field:
114133

115-
The example below shows a dataset for which we have a vector index on the `img_emb` field, as well as its index statistics.
116134
```python
117135
import lance
118136

@@ -133,7 +151,7 @@ print(ds.list_indices())
133151
# ]
134152
```
135153

136-
You can run vector search queries directly on the remote dataset without downloading it. The example below shows how to run a nearest neighbor search on a vector index using an image embedding as the query vector.
154+
You can run vector search queries directly on the remote dataset without downloading it (or, if you prefer, download the dataset locally and create a new index). The example below shows how to run a nearest neighbor search on a vector index using an image embedding as the query vector.
137155

138156
```python
139157
import lance
@@ -193,7 +211,7 @@ Lance tables also support large inline video blobs. The `OpenVid-1M` dataset (fr
193211
```python
194212
import lance
195213

196-
lance_ds = lance.dataset("hf://datasets/lance-format/openvid-lance/data/train.lance")
214+
lance_ds = lance.dataset("hf://datasets/lance-format/Openvid-1M/data/train.lance")
197215
blob_file = lance_ds.take_blobs("video_blob", ids=[0])[0]
198216
video_bytes = blob_file.read()
199217
```
@@ -203,7 +221,7 @@ Unlike other data formats, large multimodal binary objects (blobs) are first-cla
203221
```python
204222
import lance
205223

206-
ds = lance.dataset("hf://datasets/lance-format/openvid-lance/data/train.lance")
224+
ds = lance.dataset("hf://datasets/lance-format/Openvid-1M/data/train.lance")
207225

208226
# 1. Browse metadata without loading video blobs.
209227
metadata = ds.scanner(
@@ -219,6 +237,22 @@ with open("video_0.mp4", "wb") as f:
219237
f.write(blob_file.read())
220238
```
221239

240+
## Prepare data for training
241+
242+
Training is another area where Lance's fast random access and scan performance can be useful. You can use Lance datasets as the storage mechanism for your training data, shuffling it and loading into batches as part of your training pipelines.
243+
244+
The blob API in Lance is compatible with `torchcodec`, so you can easily decode video blobs as `torch` tensors:
245+
246+
```python
247+
from torchcodec.decoders import VideoDecoder
248+
decoder = VideoDecoder(blob_file)
249+
tensor = decoder[0] # uint8 tensor of shape [C, H, W]
250+
```
251+
252+
See the [torchcodec docs](https://docs.pytorch.org/torchcodec/stable/generated/torchcodec.decoders.VideoDecoder.html) for more functions for efficiently decoding videos.
253+
254+
In addition, you can also check out the [Lance documentation](https://lance.org/examples/python/clip_training/) for more examples on loading image data into `torchvision` for training your own image models.
255+
222256
## Explore more Lance datasets
223257

224258
Lance is an open format with native support for multimodal blobs alongside your traditional tabular data.

0 commit comments

Comments
 (0)