Add primary key/indexes documentation (#283)

phillipleblanc · digadeesh · web-flow · commit dd6c09ffd2b2 · 2024-06-18T08:11:31.000+09:00
* Add primary key/indexes documentation

* Fix typo and add links

* Update spiceaidocs/docs/features/local-acceleration/index.md

Co-authored-by: Aurash Behbahani &lt;aurashb@gmail.com&gt;

---------

Co-authored-by: Aurash Behbahani &lt;aurashb@gmail.com&gt;
diff --git a/spiceaidocs/docs/features/federated-queries/index.md b/spiceaidocs/docs/features/federated-queries/index.md
@@ -181,7 +181,6 @@ While the query in step 8 successfully returned results from federated remote da
 To improve query performance, step 9 demonstrates the same query executed against locally materialized and accelerated datasets using [Data Accelerators](/data-accelerators/index.md), resulting in significant performance gains.
 
 :::warning[Limitations]
-- **Query Optimization:** Filter/Join/Aggregation pushdown is not supported, potentially leading to suboptimal query plan.
 - **Query Performance:** Without acceleration, federated queries will be slower than local queries due to network latency and data transfer.
 - **Query Capabilities:** Not all SQL features and data types are supported across all data sources. More complex data type queries may not work as expected.
 :::
diff --git a/spiceaidocs/docs/features/local-acceleration/constraints.md b/spiceaidocs/docs/features/local-acceleration/constraints.md
@@ -0,0 +1,144 @@
+---
+title: 'Constraints'
+sidebar_label: 'Constraints'
+sidebar_position: 2
+description: 'Learn how to add/configure constraints on local acceleration tables in Spice.'
+---
+
+Constraints are rules that enforce data integrity in a database. Spice supports constraints on locally accelerated tables to ensure data quality, as well as configuring the behavior for inserting data updates that violate constraints.
+
+Constraints are specified in the Spicepod via the `primary_key` field in the acceleration configuration. Additional unique constraints are specified via the [`indexes`](./indexes.md) field with the value `unique`.
+
+The behavior of inserting data that violates the constraint can be configured via the `on_conflict` field to either `drop` the data that violates the constraint or `upsert` that data into the accelerated table (i.e. update all values other than the columns that are part of the constraint to match the incoming data).
+
+If there are multiple rows in the incoming data that violate any constraint, the entire incoming batch of data will be dropped.
+
+Example Spicepod:
+
+```yaml
+datasets:
+  - from: spice.ai/eth.recent_blocks
+    name: eth.recent_blocks
+    acceleration:
+      enabled: true
+      engine: sqlite
+      primary_key: hash # Define a primary key on the `hash` column
+      indexes:
+        "(number, timestamp)": unique # Add a unique index with a multicolumn key comprised of the `number` and `timestamp` columns
+      on_conflict:
+        # Upsert the incoming data when the primary key constraint on "hash" is violated, 
+        # alternatively "drop" can be used instead of "upsert" to drop the data update.
+        hash: upsert
+```
+
+## Column References
+
+Column references can be used to specify which columns are part of the constraint. The column reference can be a single column name or a multicolumn key. The column reference must be enclosed in parentheses if it is a multicolumn key.
+
+Examples
+
+- `number`: Reference a constraint on the `number` column
+- `(hash, timestamp)`: Reference a constraint on the `hash` and `timestamp` columns
+
+## Limitations
+
+- **Not supported for in-memory Arrow:** The default in-memory Arrow acceleration engine does not support constraints. Use [DuckDB](../../data-accelerators/duckdb.md), [SQLite](../../data-accelerators/sqlite.md), or [PostgreSQL](../../data-accelerators/postgres/index.md) as the acceleration engine to enable constraint checking.
+- **Single on_conflict target supported**: Only a single `on_conflict` target can be specified, unless all `on_conflict` targets are specified with drop.
+
+  - <details>
+      <summary>Examples for valid/invalid `on_conflict` targets</summary>
+      <div>
+        The following Spicepod is invalid because it specifies multiple `on_conflict` targets with `upsert`:
+
+    :::danger[Invalid]
+        ```yaml
+        datasets:
+          - from: spice.ai/eth.recent_blocks
+            name: eth.recent_blocks
+            acceleration:
+              enabled: true
+              engine: sqlite
+              primary_key: hash
+              indexes:
+                "(number, timestamp)": unique
+              on_conflict:
+                hash: upsert
+                "(number, timestamp)": upsert
+        ```
+    :::
+
+        The following Spicepod is valid because it specifies multiple `on_conflict` targets with `drop`, which is allowed:
+
+    :::tip[Valid]
+        ```yaml
+        datasets:
+          - from: spice.ai/eth.recent_blocks
+            name: eth.recent_blocks
+            acceleration:
+              enabled: true
+              engine: sqlite
+              primary_key: hash
+              indexes:
+                "(number, timestamp)": unique
+              on_conflict:
+                hash: drop
+                "(number, timestamp)": drop
+        ```
+    :::
+
+
+        The following Spicepod is invalid because it specifies multiple `on_conflict` targets with `upsert` and `drop`:
+
+    :::danger[Invalid]
+        ```yaml
+        datasets:
+          - from: spice.ai/eth.recent_blocks
+            name: eth.recent_blocks
+            acceleration:
+              enabled: true
+              engine: sqlite
+              primary_key: hash
+              indexes:
+                "(number, timestamp)": unique
+              on_conflict:
+                hash: upsert
+                "(number, timestamp)": drop
+        ```
+    :::
+
+      </div>
+    </details>
+
+- **DuckDB Limitations:**
+  - DuckDB does not support `upsert` for datasets with List or Map types.
+  - Standard indexes unexpectedly act like unique indexes and block updates when `upsert` is configured.
+    - <details>
+        <summary>Standard indexes blocking updates</summary>
+        <div>
+          The following Spicepod specifies a standard index on the `number` column, which blocks updates when `upsert` is configured for the `hash` column:
+
+          ```yaml
+          datasets:
+            - from: spice.ai/eth.recent_blocks
+              name: eth.recent_blocks
+              acceleration:
+                enabled: true
+                engine: duckdb
+                primary_key: hash
+                indexes:
+                  number: enabled
+                on_conflict:
+                  hash: upsert
+          ```
+
+          The following error is returned when attempting to upsert data into the `eth.recent_blocks` table:
+
+          ```bash
+          ERROR runtime::accelerated_table::refresh: Error adding data for eth.recent_blocks: External error: 
+          Unable to insert into duckdb table: Binder Error: Can not assign to column 'number' because 
+          it has a UNIQUE/PRIMARY KEY constraint
+          ```
+
+          This is a limitation in DuckDB.
+        </div>
+      </details>
diff --git a/spiceaidocs/docs/features/local-acceleration/index.md b/spiceaidocs/docs/features/local-acceleration/index.md
@@ -4,14 +4,15 @@ sidebar_label: 'Local Acceleration'
 description: 'Learn how to use local acceleration in Spice.'
 sidebar_position: 3
 pagination_prev: null
-pagination_next: null
 ---
 
 Datasets can be locally accelerated by the Spice runtime, pulling data from any [Data Connector](/data-connectors) and storing it locally in a [Data Accelerator](/data-accelerators) for faster access. Additionally, the data is kept up to date in realtime, so you always have the latest data locally for querying.
 
 ## Benefits
 
-When a dataset is locally accelerated by the Spice runtime, the data is stored alongside your application, providing much faster query times by cutting out network latency to make the request. This benefit is accentuated when the result of a query is large because the data does not need to be transferred over the network. Depending on the [Acceleration Engine](/data-accelerators) chosen, the locally accelerated data can also be stored in-memory, further reducing query times.
+When a dataset is locally accelerated by the Spice runtime, the data is stored alongside your application, providing much faster query times by cutting out network latency to make the request. This benefit is accentuated when the result of a query is large because the data does not need to be transferred over the network. Depending on the [Acceleration Engine](/data-accelerators) chosen, the locally accelerated data can also be stored in-memory, further reducing query times. [Indexes](./indexes.md) can also be applied, further speeding up certain types of queries.
+
+Locally accelerated datasets can also have [primary key constraints](./constraints.md) applied. This feature comes with the ability to specify what should happen when a constraint is violated, either drop the specific row that violates the constraint or upsert that row into the accelerated table.
 
 ## Example Use Case
 
diff --git a/spiceaidocs/docs/features/local-acceleration/indexes.md b/spiceaidocs/docs/features/local-acceleration/indexes.md
@@ -0,0 +1,44 @@
+---
+title: 'Indexes'
+sidebar_label: 'Indexes'
+sidebar_position: 1
+description: 'Learn how to add indexes to local acceleration tables in Spice.'
+---
+
+Database indexes are an essential tool for optimizing the performance of queries. Learn how to add indexes to the tables that Spice creates to accelerate data locally.
+
+Example Spicepod:
+
+```yaml
+datasets:
+  - from: spice.ai/eth.recent_blocks
+    name: eth.recent_blocks
+    acceleration:
+      enabled: true
+      engine: sqlite
+      indexes:
+        number: enabled # Index the `number` column
+        "(hash, timestamp)": unique # Add a unique index with a multicolumn key comprised of the `hash` and `timestamp` columns
+```
+
+## Column References
+
+Column references can be used to specify which columns to index. The column reference can be a single column name or a multicolumn key. The column reference must be enclosed in parentheses if it is a multicolumn key.
+
+Examples
+
+- `number`: Index the `number` column
+- `(hash, timestamp)`: Index the `hash` and `timestamp` columns
+
+## Index Types
+
+There are two types of indexes that can be specified in a Spicepod:
+
+- `enabled`: Creates a standard index on the specified column(s). 
+  - Similar to specifying `CREATE INDEX my_index ON my_table (my_column)`.
+- `unique`: Creates a unique index on the specified column(s). See [Constraints](./constraints.md) for more information on working with unique constraints on locally accelerated tables.
+  - Similar to specifying `CREATE UNIQUE INDEX my_index ON my_table (my_column)`.
+
+:::warning[Limitations]
+- **Not supported for in-memory Arrow:** The default in-memory Arrow acceleration engine does not support indexes. Use [DuckDB](../../data-accelerators/duckdb.md), [SQLite](../../data-accelerators/sqlite.md), or [PostgreSQL](../../data-accelerators/postgres/index.md) as the acceleration engine to enable indexing.
+:::
diff --git a/spiceaidocs/docs/reference/spicepod/datasets.md b/spiceaidocs/docs/reference/spicepod/datasets.md
@@ -200,3 +200,75 @@ Optional. How often the retention policy should be checked.
 Required when `acceleration.retention_check_enabled` is `true`.
 
 See [Duration](../duration/index.md)
+
+## `acceleration.indexes`
+
+Optional. Specify which indexes should be applied to the locally accelerated table. Not supported for in-memory Arrow acceleration engine.
+
+The `indexes` field is a map where the key is the column reference and the value is the index type.
+
+A column reference can be a single column name or a multicolumn key. The column reference must be enclosed in parentheses if it is a multicolumn key.
+
+See [Indexes](../../features/local-acceleration/indexes.md)
+
+```yaml
+datasets:
+  - from: spice.ai/eth.recent_blocks
+    name: eth.recent_blocks
+    acceleration:
+      enabled: true
+      engine: sqlite
+      indexes:
+        number: enabled # Index the `number` column
+        "(hash, timestamp)": unique # Add a unique index with a multicolumn key comprised of the `hash` and `timestamp` columns
+```
+
+## `acceleration.primary_key`
+
+Optional. Specify the primary key constraint on the locally accelerated table. Not supported for in-memory Arrow acceleration engine.
+
+The `primary_key` field is a string that represents the column reference that should be used as the primary key. The column reference can be a single column name or a multicolumn key. The column reference must be enclosed in parentheses if it is a multicolumn key.
+
+See [Constraints](../../features/local-acceleration/constraints.md)
+
+```yaml
+datasets:
+  - from: spice.ai/eth.recent_blocks
+    name: eth.recent_blocks
+    acceleration:
+      enabled: true
+      engine: sqlite
+      primary_key: hash # Define a primary key on the `hash` column
+```
+
+## `acceleration.on_conflict`
+
+Optional. Specify what should happen when a constraint is violated. Not supported for in-memory Arrow acceleration engine.
+
+The `on_conflict` field is a map where the key is the column reference and the value is the conflict resolution strategy.
+
+A column reference can be a single column name or a multicolumn key. The column reference must be enclosed in parentheses if it is a multicolumn key.
+
+Only a single `on_conflict` target can be specified, unless all `on_conflict` targets are specified with `drop`.
+
+The possible conflict resolution strategies are:
+- `upsert` - Upsert the incoming data when the primary key constraint is violated.
+- `drop` - Drop the data when the primary key constraint is violated.
+
+See [Constraints](../../features/local-acceleration/constraints.md)
+
+```yaml
+datasets:
+  - from: spice.ai/eth.recent_blocks
+    name: eth.recent_blocks
+    acceleration:
+      enabled: true
+      engine: sqlite
+      primary_key: hash
+      indexes:
+        "(number, timestamp)": unique
+      on_conflict:
+        # Upsert the incoming data when the primary key constraint on "hash" is violated, 
+        # alternatively "drop" can be used instead of "upsert" to drop the data update.
+        hash: upsert
+```