Add chunking documentation (#424)

Jeadie · lukekim · web-flow · commit 618441873462 · 2024-10-01T14:04:58.000-07:00
* add chunking documentation

* Update datasets.md

* search feature docs

* linking

* Apply suggestions from code review

* Improvements

---------

Co-authored-by: Luke Kim &lt;80174+lukekim@users.noreply.github.com&gt;
diff --git a/spiceaidocs/docs/api/http/search.md b/spiceaidocs/docs/api/http/search.md
@@ -7,15 +7,15 @@ pagination_prev: null
 pagination_next: null
 ---
 
-Performs a basic vector similarity search from one or more dataset(s). 
+Performs a basic vector similarity search across one or more datasets.
 
 Request Body
- - `datasets` (array of strings): Dataset component names to perform similarity search against. Each dataset is expected to have one and only one column augmented with an embedding. 
+ - `datasets` (array of strings): Names of the dataset components to perform the similarity search on. Each dataset must have exactly one column augmented with an embedding.
  - `text` (string): Query plaintext used to retrieve similar rows from the underlying datasets listed in the `from` request key.
  - `limit` (integer): The number of rows to return, per `from` dataset. Default: 3.
  - `where` (string): An SQL filter predicate to apply within the search.
  - `additional_columns` (array of strings): Additional columns, from the datasets, to return in the response (under `.matches[*].metadata`).
- 
+
 #### Example
 
 Spicepod
@@ -67,3 +67,5 @@ Response
   "duration_ms": 42,
 }
 ```
+
+The `v1/search` endpoint supports [chunked](/features/search/index.md#chunking) embedding columns.
diff --git a/spiceaidocs/docs/features/search/index.md b/spiceaidocs/docs/features/search/index.md
@@ -0,0 +1,138 @@
+---
+title: 'Search Functionality'
+sidebar_label: 'Search'
+description: 'Learn how Spice can search across datasets using database-native and vector-search methods.'
+sidebar_position: 8
+pagination_prev: null
+pagination_next: null
+---
+
+Spice provides advanced search capabilities that go beyond standard SQL queries, offering both traditional SQL search patterns and vector-based search functionality.
+
+## SQL-Based Search
+
+Spice supports basic search patterns directly through SQL, leveraging its SQL query features. For example, you can perform a text search within a table using SQL's `LIKE` clause:
+
+```sql
+SELECT id, text_column
+FROM my_table
+WHERE
+    LOWER(text_column) LIKE '%search_term%'
+  AND
+    date_published > '2021-01-01'
+```
+
+## Vector Search
+
+In addition to SQL, Spice provides advanced vector-based search capabilities, enabling more nuanced and intelligent searches. The runtime supports both:
+
+1. Local embedding models, e.g. [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).
+2. Remote embedding providers, e.g. [OpenAI](https://platform.openai.com/docs/api-reference/embeddings/create).
+
+Embedding models are defined in the `spicepod.yaml` file as top-level components.
+
+```yaml
+embeddings:
+  - from: openai
+    name: remote_service
+    params:
+      openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }
+
+  - name: local_embedding_model
+    from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2
+```
+
+Datasets can be augmented with embeddings targeting specific columns, to enable search capabilities through similarity searches.
+
+```yaml
+datasets:
+  - from: github:github.com/spiceai/spiceai/issues
+    name: spiceai.issues
+    acceleration:
+      enabled: true
+    embeddings:
+      - column: body # The text column in the `spiceai.issues` dataset
+        use: local_embedding_model # Embedding model used for this column
+```
+
+By defining embeddings on the `body` column, Spice is now configured to execute similarity searches on the dataset.
+
+```shell
+curl -XPOST http://localhost:8090/v1/search \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "datasets": ["spiceai.issues"],
+    "text": "cutting edge AI",
+    "where": "author=\"jeadie\"",
+    "additional_columns": ["title", "state"],
+    "limit": 2
+  }'
+```
+
+For more details, see the [API reference for /v1/search](/api/http/search).
+
+### Chunking Support
+
+Spice also supports chunking of content before embedding, which is useful for large text columns such as those found in [Document Tables](/components/data-connectors/index.md#document-support). Chunking ensures that only the most relevant portions of text are returned during search queries. Chunking is configured as part of the embedding configuration.
+
+```yaml
+datasets:
+  - from: github:github.com/spiceai/spiceai/issues
+    name: spiceai.issues
+    acceleration:
+      enabled: true
+    embeddings:
+      - column: body
+        use: local_embedding_model
+        chunking:
+          enabled: true
+          target_chunk_size: 512
+```
+
+The `body` column will be divided into chunks of approximately 512 tokens, while maintaining structural and semantic integrity (e.g. not splitting sentences).
+
+### Document Retrieval
+
+When performing searches on datasets with chunking enabled, Spice returns the most relevant chunk for each match. To retrieve the full content of a column, include the embedding column in the `additional_columns` list.
+
+For example:
+
+```shell
+curl -XPOST http://localhost:8090/v1/search \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "datasets": ["spiceai.issues"],
+    "text": "cutting edge AI",
+    "where": "array_has(assignees, \"jeadie\")",
+    "additional_columns": ["title", "state", "body"],
+    "limit": 2
+  }'
+```
+
+Response:
+
+```json
+{
+  "matches": [
+    {
+      "value": "implements a scalar UDF `array_distance`:\n```\narray_distance(FixedSizeList[Float32], FixedSizeList[Float32])",
+      "dataset": "spiceai.issues",
+      "metadata": {
+        "title": "Improve scalar UDF array_distance",
+        "state": "Closed",
+        "body": "## Overview\n- Previous PR https://github.com/spiceai/spiceai/pull/1601 implements a scalar UDF `array_distance`:\n```\narray_distance(FixedSizeList[Float32], FixedSizeList[Float32])\narray_distance(FixedSizeList[Float32], List[Float64])\n```\n\n### Changes\n - Improve using Native arrow function, e.g. `arrow_cast`, [`sub_checked`](https://arrow.apache.org/rust/arrow/array/trait.ArrowNativeTypeOp.html#tymethod.sub_checked)\n - Support a greater range of array types and numeric types\n - Possibly create a sub operator and UDF, e.g.\n\t- `FixedSizeList[Float32] - FixedSizeList[Float32]`\n\t- `Norm(FixedSizeList[Float32])`"
+      }
+    },
+    {
+      "value": "est external tools being returned for toolusing models",
+      "dataset": "spiceai.issues",
+      "metadata": {
+        "title": "Automatic NSQL retries in /v1/nsql ",
+        "state": "Open",
+        "body": "To mimic our ability for LLMs to repeatedly retry tools based on errors, the `/v1/nsql`, which does not use this same paradigm, should retry internally.\n\nIf possible, improve the structured output to increase the likelihood of valid SQL in the response. Currently we just inforce JSON like this\n```json\n{\n  "sql": "SELECT ..."\n}\n```"
+      }
+    }
+  ],
+  "duration_ms": 45
+}
+```
diff --git a/spiceaidocs/docs/reference/spicepod/datasets.md b/spiceaidocs/docs/reference/spicepod/datasets.md
@@ -358,3 +358,39 @@ The embedding model to use, specific the component name `embeddings[*].name`.
 ## `embeddings[*].column_pk`
 
 Optional. For datasets without a primary key, explicitly specify column(s) that uniquely identify a row.
+
+## `embeddings[*].chunking`
+
+Optional. The configuration to enable and define the chunking strategy for the embedding column.
+
+```yaml
+datasets:
+  - from: spice.ai/eth.recent_blocks
+    name: eth.recent_blocks
+    embeddings:
+      - column: extra_data
+        use: hf_minilm
+        chunking:
+          enabled: true
+          target_chunk_size: 512
+          overlap_size: 128
+          trim_whitespace: false
+```
+
+## `embeddings[*].chunking.enabled`
+
+Optional. Enable or disable chunking for the embedding column. Defaults to `false`.
+
+## `embeddings[*].chunking.target_chunk_size`
+
+The desired size of each chunk, in tokens.
+
+If the desired chunk size is larger than the maximum size of the embedding model, the maximum size will be used.
+
+## `embeddings[*].chunking.overlap_size`
+
+Optional. The number of tokens to overlap between chunks. Defaults to `0`.
+
+## `embeddings[*].chunking.trim_whitespace`
+
+Optional. If enabled, the content of each chunk will be trimmed to remove leading and trailing whitespace. Defaults to `true`.