Skip to content

Commit 6184418

Browse files
Jeadielukekim
andauthored
Add chunking documentation (#424)
* add chunking documentation * Update datasets.md * search feature docs * linking * Apply suggestions from code review * Improvements --------- Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com>
1 parent cbd9e60 commit 6184418

3 files changed

Lines changed: 179 additions & 3 deletions

File tree

spiceaidocs/docs/api/http/search.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,15 +7,15 @@ pagination_prev: null
77
pagination_next: null
88
---
99

10-
Performs a basic vector similarity search from one or more dataset(s).
10+
Performs a basic vector similarity search across one or more datasets.
1111

1212
Request Body
13-
- `datasets` (array of strings): Dataset component names to perform similarity search against. Each dataset is expected to have one and only one column augmented with an embedding.
13+
- `datasets` (array of strings): Names of the dataset components to perform the similarity search on. Each dataset must have exactly one column augmented with an embedding.
1414
- `text` (string): Query plaintext used to retrieve similar rows from the underlying datasets listed in the `from` request key.
1515
- `limit` (integer): The number of rows to return, per `from` dataset. Default: 3.
1616
- `where` (string): An SQL filter predicate to apply within the search.
1717
- `additional_columns` (array of strings): Additional columns, from the datasets, to return in the response (under `.matches[*].metadata`).
18-
18+
1919
#### Example
2020

2121
Spicepod
@@ -67,3 +67,5 @@ Response
6767
"duration_ms": 42,
6868
}
6969
```
70+
71+
The `v1/search` endpoint supports [chunked](/features/search/index.md#chunking) embedding columns.
Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
---
2+
title: 'Search Functionality'
3+
sidebar_label: 'Search'
4+
description: 'Learn how Spice can search across datasets using database-native and vector-search methods.'
5+
sidebar_position: 8
6+
pagination_prev: null
7+
pagination_next: null
8+
---
9+
10+
Spice provides advanced search capabilities that go beyond standard SQL queries, offering both traditional SQL search patterns and vector-based search functionality.
11+
12+
## SQL-Based Search
13+
14+
Spice supports basic search patterns directly through SQL, leveraging its SQL query features. For example, you can perform a text search within a table using SQL's `LIKE` clause:
15+
16+
```sql
17+
SELECT id, text_column
18+
FROM my_table
19+
WHERE
20+
LOWER(text_column) LIKE '%search_term%'
21+
AND
22+
date_published > '2021-01-01'
23+
```
24+
25+
## Vector Search
26+
27+
In addition to SQL, Spice provides advanced vector-based search capabilities, enabling more nuanced and intelligent searches. The runtime supports both:
28+
29+
1. Local embedding models, e.g. [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).
30+
2. Remote embedding providers, e.g. [OpenAI](https://platform.openai.com/docs/api-reference/embeddings/create).
31+
32+
Embedding models are defined in the `spicepod.yaml` file as top-level components.
33+
34+
```yaml
35+
embeddings:
36+
- from: openai
37+
name: remote_service
38+
params:
39+
openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }
40+
41+
- name: local_embedding_model
42+
from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2
43+
```
44+
45+
Datasets can be augmented with embeddings targeting specific columns, to enable search capabilities through similarity searches.
46+
47+
```yaml
48+
datasets:
49+
- from: github:github.com/spiceai/spiceai/issues
50+
name: spiceai.issues
51+
acceleration:
52+
enabled: true
53+
embeddings:
54+
- column: body # The text column in the `spiceai.issues` dataset
55+
use: local_embedding_model # Embedding model used for this column
56+
```
57+
58+
By defining embeddings on the `body` column, Spice is now configured to execute similarity searches on the dataset.
59+
60+
```shell
61+
curl -XPOST http://localhost:8090/v1/search \
62+
-H 'Content-Type: application/json' \
63+
-d '{
64+
"datasets": ["spiceai.issues"],
65+
"text": "cutting edge AI",
66+
"where": "author=\"jeadie\"",
67+
"additional_columns": ["title", "state"],
68+
"limit": 2
69+
}'
70+
```
71+
72+
For more details, see the [API reference for /v1/search](/api/http/search).
73+
74+
### Chunking Support
75+
76+
Spice also supports chunking of content before embedding, which is useful for large text columns such as those found in [Document Tables](/components/data-connectors/index.md#document-support). Chunking ensures that only the most relevant portions of text are returned during search queries. Chunking is configured as part of the embedding configuration.
77+
78+
```yaml
79+
datasets:
80+
- from: github:github.com/spiceai/spiceai/issues
81+
name: spiceai.issues
82+
acceleration:
83+
enabled: true
84+
embeddings:
85+
- column: body
86+
use: local_embedding_model
87+
chunking:
88+
enabled: true
89+
target_chunk_size: 512
90+
```
91+
92+
The `body` column will be divided into chunks of approximately 512 tokens, while maintaining structural and semantic integrity (e.g. not splitting sentences).
93+
94+
### Document Retrieval
95+
96+
When performing searches on datasets with chunking enabled, Spice returns the most relevant chunk for each match. To retrieve the full content of a column, include the embedding column in the `additional_columns` list.
97+
98+
For example:
99+
100+
```shell
101+
curl -XPOST http://localhost:8090/v1/search \
102+
-H 'Content-Type: application/json' \
103+
-d '{
104+
"datasets": ["spiceai.issues"],
105+
"text": "cutting edge AI",
106+
"where": "array_has(assignees, \"jeadie\")",
107+
"additional_columns": ["title", "state", "body"],
108+
"limit": 2
109+
}'
110+
```
111+
112+
Response:
113+
114+
```json
115+
{
116+
"matches": [
117+
{
118+
"value": "implements a scalar UDF `array_distance`:\n```\narray_distance(FixedSizeList[Float32], FixedSizeList[Float32])",
119+
"dataset": "spiceai.issues",
120+
"metadata": {
121+
"title": "Improve scalar UDF array_distance",
122+
"state": "Closed",
123+
"body": "## Overview\n- Previous PR https://github.com/spiceai/spiceai/pull/1601 implements a scalar UDF `array_distance`:\n```\narray_distance(FixedSizeList[Float32], FixedSizeList[Float32])\narray_distance(FixedSizeList[Float32], List[Float64])\n```\n\n### Changes\n - Improve using Native arrow function, e.g. `arrow_cast`, [`sub_checked`](https://arrow.apache.org/rust/arrow/array/trait.ArrowNativeTypeOp.html#tymethod.sub_checked)\n - Support a greater range of array types and numeric types\n - Possibly create a sub operator and UDF, e.g.\n\t- `FixedSizeList[Float32] - FixedSizeList[Float32]`\n\t- `Norm(FixedSizeList[Float32])`"
124+
}
125+
},
126+
{
127+
"value": "est external tools being returned for toolusing models",
128+
"dataset": "spiceai.issues",
129+
"metadata": {
130+
"title": "Automatic NSQL retries in /v1/nsql ",
131+
"state": "Open",
132+
"body": "To mimic our ability for LLMs to repeatedly retry tools based on errors, the `/v1/nsql`, which does not use this same paradigm, should retry internally.\n\nIf possible, improve the structured output to increase the likelihood of valid SQL in the response. Currently we just inforce JSON like this\n```json\n{\n "sql": "SELECT ..."\n}\n```"
133+
}
134+
}
135+
],
136+
"duration_ms": 45
137+
}
138+
```

spiceaidocs/docs/reference/spicepod/datasets.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -358,3 +358,39 @@ The embedding model to use, specific the component name `embeddings[*].name`.
358358
## `embeddings[*].column_pk`
359359

360360
Optional. For datasets without a primary key, explicitly specify column(s) that uniquely identify a row.
361+
362+
## `embeddings[*].chunking`
363+
364+
Optional. The configuration to enable and define the chunking strategy for the embedding column.
365+
366+
```yaml
367+
datasets:
368+
- from: spice.ai/eth.recent_blocks
369+
name: eth.recent_blocks
370+
embeddings:
371+
- column: extra_data
372+
use: hf_minilm
373+
chunking:
374+
enabled: true
375+
target_chunk_size: 512
376+
overlap_size: 128
377+
trim_whitespace: false
378+
```
379+
380+
## `embeddings[*].chunking.enabled`
381+
382+
Optional. Enable or disable chunking for the embedding column. Defaults to `false`.
383+
384+
## `embeddings[*].chunking.target_chunk_size`
385+
386+
The desired size of each chunk, in tokens.
387+
388+
If the desired chunk size is larger than the maximum size of the embedding model, the maximum size will be used.
389+
390+
## `embeddings[*].chunking.overlap_size`
391+
392+
Optional. The number of tokens to overlap between chunks. Defaults to `0`.
393+
394+
## `embeddings[*].chunking.trim_whitespace`
395+
396+
Optional. If enabled, the content of each chunk will be trimmed to remove leading and trailing whitespace. Defaults to `true`.

0 commit comments

Comments
 (0)