You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/hub/datasets-lance.md
+40-6Lines changed: 40 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,7 +16,7 @@ pip install pylance pyarrow
16
16
17
17
## Why Lance?
18
18
19
-
- Optimized for ML/AI workloads: Lance is a modern columnar format designed for fast random access without compromising scan performance.
19
+
- Optimized for ML/AI workloads: Lance is a modern columnar format designed for fast random access without compromising scan performance, making it useful for search, analytics, training, feature engineering and many more use cases.
20
20
- Multimodal assets are stored as bytes, or binary objects ("[blobs as files](https://lance.org/guide/blob/)") in Lance alongside embeddings, and traditional scalar data -- this makes it easier to govern, share, and distribute your large datasets via the Hub.
21
21
- Indexing is a first-class citizen (native to the format itself): Lance comes with fast, on-disk, scalable [vector](https://lance.org/quickstart/vector-search) and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them.
22
22
- Flexible schema and [data evolution](https://lance.org/guide/data_evolution) let you incrementally add new features/columns (moderation tags, embeddings, etc.) **without** needing to rewrite the entire table.
@@ -108,11 +108,29 @@ subset = scanner.to_table()
108
108
lance.write_dataset(subset, "./laion_subset")
109
109
```
110
110
111
+
## Create index
112
+
113
+
If your dataset doesn't already have an index associated with it, you can create one after downloading it locally.
114
+
115
+
```python
116
+
# ds is a local Lance dataset
117
+
ds.create_index(
118
+
"img_emb",
119
+
index_type="IVF_PQ",
120
+
num_partitions=256,
121
+
num_sub_vectors=96,
122
+
replace=True,
123
+
)
124
+
```
125
+
126
+
See the [Lance docs](https://lance.org/quickstart/vector-search/) on vector index creation for a more detailed example. Once you have a vector index created, you can run similarity search on the data via embeddings.
127
+
111
128
## Vector search
112
129
113
-
Because indexes are first-class citizens in Lance, you can store not only your data but also your embeddings and indexes together in a dataset and query them directly on the Hub. Simply use the `describe_indices()` method to list the index information for the dataset. If an index doesn't exist in the dataset, you can use `lance.write_dataset()` to write a local version of the dataset and use [LanceDataset.create_index](https://lance-format.github.io/lance-python-doc/all-modules.html#lance.dataset.LanceDataset.create_index) to create an index for your needs.
130
+
Because indexes are first-class citizens in Lance, you can store not only your data but also your embeddings and indexes together and query them **directly on the Hub**. Simply use the `describe_indices()` method to list the index information for the dataset. If an index doesn't exist in the dataset, you can use `lance.write_dataset()` to write a local version of the dataset and use [LanceDataset.create_index](https://lance-format.github.io/lance-python-doc/all-modules.html#lance.dataset.LanceDataset.create_index) to create an index for your needs.
131
+
132
+
The example below shows a dataset for which we already have a vector index on the `img_emb` field:
114
133
115
-
The example below shows a dataset for which we have a vector index on the `img_emb` field, as well as its index statistics.
116
134
```python
117
135
import lance
118
136
@@ -133,7 +151,7 @@ print(ds.list_indices())
133
151
# ]
134
152
```
135
153
136
-
You can run vector search queries directly on the remote dataset without downloading it. The example below shows how to run a nearest neighbor search on a vector index using an image embedding as the query vector.
154
+
You can run vector search queries directly on the remote dataset without downloading it (or, if you prefer, download the dataset locally and create a new index). The example below shows how to run a nearest neighbor search on a vector index using an image embedding as the query vector.
137
155
138
156
```python
139
157
import lance
@@ -193,7 +211,7 @@ Lance tables also support large inline video blobs. The `OpenVid-1M` dataset (fr
@@ -219,6 +237,22 @@ with open("video_0.mp4", "wb") as f:
219
237
f.write(blob_file.read())
220
238
```
221
239
240
+
## Prepare data for training
241
+
242
+
Training is another area where Lance's fast random access and scan performance can be useful. You can use Lance datasets as the storage mechanism for your training data, shuffling it and loading into batches as part of your training pipelines.
243
+
244
+
The blob API in Lance is compatible with `torchcodec`, so you can easily decode video blobs as `torch` tensors:
See the [torchcodec docs](https://docs.pytorch.org/torchcodec/stable/generated/torchcodec.decoders.VideoDecoder.html) for more functions for efficiently decoding videos.
253
+
254
+
In addition, you can also check out the [Lance documentation](https://lance.org/examples/python/clip_training/) for more examples on loading image data into `torchvision` for training your own image models.
255
+
222
256
## Explore more Lance datasets
223
257
224
258
Lance is an open format with native support for multimodal blobs alongside your traditional tabular data.
0 commit comments