Skip to content

DOC-738 | Vector index reference docs #700

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ see [arangodb.com/community-server/](https://www.arangodb.com/community-server/)
{{% /comment %}}

{{% comment %}} Experimental feature
- [**Vector search**](#TODO):
- [**Vector search**](../../index-and-search/indexing/working-with-indexes/vector-indexes.md):
Find items with similar properties by comparing vector embeddings generated by
machine learning models.
{{% /comment %}}
Expand Down
95 changes: 95 additions & 0 deletions site/content/3.13/aql/functions/vector.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
---
title: Vector search functions in AQL
menuTitle: Vector
weight: 60
description: >-
The functions for vector search let you utilize indexed vector embeddings to
quickly find semantically similar documents
---
To use vector search, you need to have vector embeddings stored in documents
and the attribute that stores them needs to be indexed by a
[vector index](../../index-and-search/indexing/working-with-indexes/vector-indexes.md).

{{< warning >}}
The vector index is an experimental feature that you need to enable for the
ArangoDB server with the `--experimental-vector-index` startup option.
Once enabled for a deployment, it cannot be disabled anymore because it
permanently changes how the data is managed by the RocksDB storage engine
(it adds an additional column family).
{{< /warning >}}

{{< comment >}}TODO: Add DSS docs or already mention because of ArangoGraph with ML?
You can calculate vector embeddings using ArangoDB's GraphML capabilities or
external tools.
{{< /comment >}}

## Distance functions

In order to utilize a vector index, you need to use one of the following
vector distance functions in a query, sort by this distance, and specify the
maximum number of similar documents to retrieve with a `LIMIT` operation.
Example:

```aql
FOR doc IN coll
SORT APPROX_NEAR_L2(doc.vector, @q)
LIMIT 5
RETURN doc
```

The `@q` bind variable needs to be vector (array of numbers) with the dimension
as specified in the vector index. It defines the point at which to look for
neighbors (`5` in this case). <!-- TODO how many results depends on the data and nProbe value! -->

The sorting order needs to be **ascending for the L2 metric** (shown above) and
**descending for the cosine metric**:

```aql
FOR doc IN coll
SORT APPROX_NEAR_COSINE(doc.vector, @q) DESC
LIMIT 5
RETURN doc
```

### APPROX_NEAR_COSINE()

`APPROX_NEAR_COSINE(vector1, vector2, options) → dist`

Retrieve the approximate distance using the cosine metric, accelerated by a
matching vector index.

- **vector1** (array of numbers): The first vector. Either this parameter or
`vector2` needs to reference a stored attribute holding the vector embedding.
attribute of a stored document that stores a vector, like `doc.vector`
- **vector2** (array of numbers): The second vector. Either this parameter or
`vector1` needs to reference a stored attribute holding the vector embedding.
- **options** (object, _optional_):
- **nProbe** (number, _optional_): How many neighboring centroids to consider
for the search results. The larger the number, the slower the search but the
better the search results. If not specified, the `defaultNProbe` value of
the vector index is used.
- returns **dist** (number): The approximate cosine distance between both vectors.

<!-- TODO: generated examples possible? -->

### APPROX_NEAR_L2()

`APPROX_NEAR_L2(vector1, vector2, options) → dist`

Retrieve the approximate distance using the L2 (Euclidean) metric, accelerated
by a matching vector index.

- **vector1** (array of numbers): The first vector. Either this parameter or
`vector2` needs to reference a stored attribute holding the vector embedding.
attribute of a stored document that stores a vector, like `doc.vector`
- **vector2** (array of numbers): The second vector. Either this parameter or
`vector1` needs to reference a stored attribute holding the vector embedding.
- **options** (object, _optional_):
- **nProbe** (number, _optional_): How many neighboring centroids to consider
for the search results. The larger the number, the slower the search but the
better the search results. If not specified, the `defaultNProbe` value of
the vector index is used.
- returns **dist** (number): The approximate L2 (Euclidean) distance between
both vectors.

<!-- TODO: generated examples possible? -->
10 changes: 10 additions & 0 deletions site/content/3.13/index-and-search/indexing/basics.md
Original file line number Diff line number Diff line change
Expand Up @@ -369,6 +369,16 @@ the `GEO_DISTANCE()` function, or if `FILTER` conditions with `GEO_CONTAINS()`
or `GEO_INTERSECTS()` are used. It will not be used for other types of queries
or conditions.

## Vector Index

Vector indexes let you index vector embeddings stored in documents. Such
vectors are arrays of numbers that represent the meaning and relationships of
data numerically. You can you quickly find a given number of semantically
similar documents by searching for close neighbors in a high-dimensional
vector space.

See [Vector Indexes](working-with-indexes/vector-indexes.md) for details.

## Fulltext Index

{{< warning >}}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,18 @@ different usage scenarios:
of the Earth. It supports points, lines, and polygons.
See [Geo-Spatial Indexes](working-with-indexes/geo-spatial-indexes.md).

- **Vector index**: You can find semantically similar documents quickly with
vector indexes. It is required to calculate and store vector embeddings first,
and you may need to update the embeddings when adding new documents.
Vector indexes cannot be used for other types of searches, like equality and
range queries or full-text search.

Vector indexes are utilized via special distance functions, in combination with
a `SORT` operation to sort by the distance, and a `LIMIT` operation to define
how many similar documents to retrieve.

See [Vector indexes](working-with-indexes/vector-indexes.md) for details.

- **fulltext index**: a fulltext index can be used to index all words contained in
a specific attribute of all documents in a collection. Only words with a
(specifiable) minimum length are indexed. Word tokenization is done using
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
---
title: Vector indexes
menuTitle: Vector Indexes
weight: 40
description: >-
You can index vector embeddings to allow queries to quickly find semantically
similar documents
---
Vector indexes let you index vector embeddings stored in documents. Such
vectors are arrays of numbers that represent the meaning and relationships of
data numerically. You can you quickly find a given number of semantically
similar documents by searching for close neighbors in a high-dimensional
vector space.

The vector index implementation uses the [Faiss library](https://github.com/facebookresearch/faiss/)
to support L2 and cosine metrics. The index used is IndexIVFFlat, the quantizer
for L2 is IndexFlatL2, and the cosine uses IndexFlatIP, where vectors are
normalized before insertion and search.

Sometimes, if there is no relevant data found in the list, the faiss might not
produce the top K requested results. Therefore, only the found results is provided. <!-- TODO -->

{{< warning >}}
The vector index is an experimental feature that you need to enable for the
ArangoDB server with the `--experimental-vector-index` startup option.
Once enabled for a deployment, it cannot be disabled anymore because it
permanently changes how the data is managed by the RocksDB storage engine
(it adds an additional column family).
{{< /warning >}}

### How to use vector indexes

Creating an index triggers training the index on top of real data, which is a limitation that assumes the data already exists for the vector field upon which the index is created.
The number of training points depends on the nLists parameter; a bigger nLists will produce more correct results but will increase the training time necessary to build the index.


## Vector index properties

- **name** (_optional_): A user-defined name for the index for easier
identification. If not specified, a name is automatically generated.
- **type**: The index type. Needs to be `"vector"`.
- **fields** (array of strings): A list with a single attribute path to specify
where the vector embedding is stored in each document. The vector data needs
to be populated before creating the index.

If you want to index another vector embedding attribute, you need to create a
separate vector index.
- **params**: The parameters as used by the Faiss library.
- **metric** (string): Whether to use `cosine` or `l2` (Euclidean) distance calculation.
- **dimension** (number): The vector dimension. The attribute to index needs to
have this many elements in the array that stores the vector embedding.
- **nLists** (number): The number of centroids in the index. What value to choose
depends on the data distribution and chosen metric. According to
[The Faiss library paper](https://arxiv.org/abs/2401.08281), it should be
around `15 * N` where `N` is the number of documents in the collection,
respectively the number of documents in the shard for cluster deployments.
- **defaultNProbe** (number, _optional_): How many neighboring centroids to
consider for the search results by default. The larger the number, the slower
the search but the better the search results. The default is `1`. <!-- TODO: recommend higher -->
- **trainingIterations** (number, _optional_): The number of iterations in the
training process. The default is `25`. Smaller values lead to a faster index
creation but may yield worse search results.
- **factory** (string, _optional_): You can specify a factory string to pass
through to the underlying Faiss library, allowing you to combine different
options, for example:
- `"IVF100_HNSW10,Flat"`
- `"IVF100,SQ4"`
- `"IVF10_HNSW5,Flat"`
- `"IVF100_HNSW5,PQ256x16"`
The base index must be an IVF to work with ArangoDB. For more information on
how to create these custom indexes, see the
[Faiss Wiki](https://github.com/facebookresearch/faiss/wiki/The-index-factory).

## Interfaces

### Create a vector index

{{< tabs "interfaces" >}}

{{< tab "Web interface" >}}
1. In the **Collections** section, click the name or row of the desired collection.
2. Go to the **Indexes** tab.
3. Click **Add index**.
4. Select **Vector** as the **Type**.
5. Enter the name of the attribute that holds the vector embeddings into **Fields**.
6. Set the parameters for the vector index, see [Vector index parameters](#vector-index-parameters).
7. Optionally give the index a user-defined name.
8. Click **Create**.
{{< /tab >}}

{{< tab "arangosh" >}}
```js
db.coll.ensureIndex({
name: "vector_l2",
type: "vector",
fields: ["embedding"],
params: {
metric: "l2",
dimension: 544,
nLists: 100,
defaultNProbe: 1,
trainingIterations: 25
}
});
```
{{< /tab >}}

{{< tab "cURL" >}}
```sh
curl -d '{"name":"vector_l2","type":"vector","fields":["embedding"],"params":{"metric":"l2","dimension":544,"nLists":100,"defaultNProbe":1,"trainingIterations":25}}' http://localhost:8529/_db/mydb/_api/index?collection=coll
```
{{< /tab >}}

{{< tab "JavaScript" >}}
```js
const info = await coll.ensureIndex({
name: "vector_l2",
type: "vector",
fields: ["embedding"],
params: {
metric: "l2",
dimension: 544,
nLists: 100,
defaultNProbe: 1,
trainingIterations: 25
}
});
```
{{< /tab >}}

{{< tab "Go" >}}
The Go driver does not support vector indexes yet.
{{< /tab >}}

{{< tab "Java" >}}
The Java driver does not support vector indexes yet.
{{< /tab >}}

{{< tab "Python" >}}
```py
info = coll.add_index({
"name": "vector_l2",
"type": "vector",
"fields": ["embedding"],
"params": {
"metric": "l2",
"dimension": 544
"nLists": 100,
"defaultNProbe": 1,
"trainingIterations": 25
}
})
```
{{< /tab >}}

{{< /tabs >}}