Skip to content

Commit dd6c09f

Browse files
Add primary key/indexes documentation (#283)
* Add primary key/indexes documentation * Fix typo and add links * Update spiceaidocs/docs/features/local-acceleration/index.md Co-authored-by: Aurash Behbahani <aurashb@gmail.com> --------- Co-authored-by: Aurash Behbahani <aurashb@gmail.com>
1 parent e4a6401 commit dd6c09f

5 files changed

Lines changed: 263 additions & 3 deletions

File tree

spiceaidocs/docs/features/federated-queries/index.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -181,7 +181,6 @@ While the query in step 8 successfully returned results from federated remote da
181181
To improve query performance, step 9 demonstrates the same query executed against locally materialized and accelerated datasets using [Data Accelerators](/data-accelerators/index.md), resulting in significant performance gains.
182182

183183
:::warning[Limitations]
184-
- **Query Optimization:** Filter/Join/Aggregation pushdown is not supported, potentially leading to suboptimal query plan.
185184
- **Query Performance:** Without acceleration, federated queries will be slower than local queries due to network latency and data transfer.
186185
- **Query Capabilities:** Not all SQL features and data types are supported across all data sources. More complex data type queries may not work as expected.
187186
:::
Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
---
2+
title: 'Constraints'
3+
sidebar_label: 'Constraints'
4+
sidebar_position: 2
5+
description: 'Learn how to add/configure constraints on local acceleration tables in Spice.'
6+
---
7+
8+
Constraints are rules that enforce data integrity in a database. Spice supports constraints on locally accelerated tables to ensure data quality, as well as configuring the behavior for inserting data updates that violate constraints.
9+
10+
Constraints are specified in the Spicepod via the `primary_key` field in the acceleration configuration. Additional unique constraints are specified via the [`indexes`](./indexes.md) field with the value `unique`.
11+
12+
The behavior of inserting data that violates the constraint can be configured via the `on_conflict` field to either `drop` the data that violates the constraint or `upsert` that data into the accelerated table (i.e. update all values other than the columns that are part of the constraint to match the incoming data).
13+
14+
If there are multiple rows in the incoming data that violate any constraint, the entire incoming batch of data will be dropped.
15+
16+
Example Spicepod:
17+
18+
```yaml
19+
datasets:
20+
- from: spice.ai/eth.recent_blocks
21+
name: eth.recent_blocks
22+
acceleration:
23+
enabled: true
24+
engine: sqlite
25+
primary_key: hash # Define a primary key on the `hash` column
26+
indexes:
27+
"(number, timestamp)": unique # Add a unique index with a multicolumn key comprised of the `number` and `timestamp` columns
28+
on_conflict:
29+
# Upsert the incoming data when the primary key constraint on "hash" is violated,
30+
# alternatively "drop" can be used instead of "upsert" to drop the data update.
31+
hash: upsert
32+
```
33+
34+
## Column References
35+
36+
Column references can be used to specify which columns are part of the constraint. The column reference can be a single column name or a multicolumn key. The column reference must be enclosed in parentheses if it is a multicolumn key.
37+
38+
Examples
39+
40+
- `number`: Reference a constraint on the `number` column
41+
- `(hash, timestamp)`: Reference a constraint on the `hash` and `timestamp` columns
42+
43+
## Limitations
44+
45+
- **Not supported for in-memory Arrow:** The default in-memory Arrow acceleration engine does not support constraints. Use [DuckDB](../../data-accelerators/duckdb.md), [SQLite](../../data-accelerators/sqlite.md), or [PostgreSQL](../../data-accelerators/postgres/index.md) as the acceleration engine to enable constraint checking.
46+
- **Single on_conflict target supported**: Only a single `on_conflict` target can be specified, unless all `on_conflict` targets are specified with drop.
47+
48+
- <details>
49+
<summary>Examples for valid/invalid `on_conflict` targets</summary>
50+
<div>
51+
The following Spicepod is invalid because it specifies multiple `on_conflict` targets with `upsert`:
52+
53+
:::danger[Invalid]
54+
```yaml
55+
datasets:
56+
- from: spice.ai/eth.recent_blocks
57+
name: eth.recent_blocks
58+
acceleration:
59+
enabled: true
60+
engine: sqlite
61+
primary_key: hash
62+
indexes:
63+
"(number, timestamp)": unique
64+
on_conflict:
65+
hash: upsert
66+
"(number, timestamp)": upsert
67+
```
68+
:::
69+
70+
The following Spicepod is valid because it specifies multiple `on_conflict` targets with `drop`, which is allowed:
71+
72+
:::tip[Valid]
73+
```yaml
74+
datasets:
75+
- from: spice.ai/eth.recent_blocks
76+
name: eth.recent_blocks
77+
acceleration:
78+
enabled: true
79+
engine: sqlite
80+
primary_key: hash
81+
indexes:
82+
"(number, timestamp)": unique
83+
on_conflict:
84+
hash: drop
85+
"(number, timestamp)": drop
86+
```
87+
:::
88+
89+
90+
The following Spicepod is invalid because it specifies multiple `on_conflict` targets with `upsert` and `drop`:
91+
92+
:::danger[Invalid]
93+
```yaml
94+
datasets:
95+
- from: spice.ai/eth.recent_blocks
96+
name: eth.recent_blocks
97+
acceleration:
98+
enabled: true
99+
engine: sqlite
100+
primary_key: hash
101+
indexes:
102+
"(number, timestamp)": unique
103+
on_conflict:
104+
hash: upsert
105+
"(number, timestamp)": drop
106+
```
107+
:::
108+
109+
</div>
110+
</details>
111+
112+
- **DuckDB Limitations:**
113+
- DuckDB does not support `upsert` for datasets with List or Map types.
114+
- Standard indexes unexpectedly act like unique indexes and block updates when `upsert` is configured.
115+
- <details>
116+
<summary>Standard indexes blocking updates</summary>
117+
<div>
118+
The following Spicepod specifies a standard index on the `number` column, which blocks updates when `upsert` is configured for the `hash` column:
119+
120+
```yaml
121+
datasets:
122+
- from: spice.ai/eth.recent_blocks
123+
name: eth.recent_blocks
124+
acceleration:
125+
enabled: true
126+
engine: duckdb
127+
primary_key: hash
128+
indexes:
129+
number: enabled
130+
on_conflict:
131+
hash: upsert
132+
```
133+
134+
The following error is returned when attempting to upsert data into the `eth.recent_blocks` table:
135+
136+
```bash
137+
ERROR runtime::accelerated_table::refresh: Error adding data for eth.recent_blocks: External error:
138+
Unable to insert into duckdb table: Binder Error: Can not assign to column 'number' because
139+
it has a UNIQUE/PRIMARY KEY constraint
140+
```
141+
142+
This is a limitation in DuckDB.
143+
</div>
144+
</details>

spiceaidocs/docs/features/local-acceleration/index.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,15 @@ sidebar_label: 'Local Acceleration'
44
description: 'Learn how to use local acceleration in Spice.'
55
sidebar_position: 3
66
pagination_prev: null
7-
pagination_next: null
87
---
98

109
Datasets can be locally accelerated by the Spice runtime, pulling data from any [Data Connector](/data-connectors) and storing it locally in a [Data Accelerator](/data-accelerators) for faster access. Additionally, the data is kept up to date in realtime, so you always have the latest data locally for querying.
1110

1211
## Benefits
1312

14-
When a dataset is locally accelerated by the Spice runtime, the data is stored alongside your application, providing much faster query times by cutting out network latency to make the request. This benefit is accentuated when the result of a query is large because the data does not need to be transferred over the network. Depending on the [Acceleration Engine](/data-accelerators) chosen, the locally accelerated data can also be stored in-memory, further reducing query times.
13+
When a dataset is locally accelerated by the Spice runtime, the data is stored alongside your application, providing much faster query times by cutting out network latency to make the request. This benefit is accentuated when the result of a query is large because the data does not need to be transferred over the network. Depending on the [Acceleration Engine](/data-accelerators) chosen, the locally accelerated data can also be stored in-memory, further reducing query times. [Indexes](./indexes.md) can also be applied, further speeding up certain types of queries.
14+
15+
Locally accelerated datasets can also have [primary key constraints](./constraints.md) applied. This feature comes with the ability to specify what should happen when a constraint is violated, either drop the specific row that violates the constraint or upsert that row into the accelerated table.
1516

1617
## Example Use Case
1718

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
---
2+
title: 'Indexes'
3+
sidebar_label: 'Indexes'
4+
sidebar_position: 1
5+
description: 'Learn how to add indexes to local acceleration tables in Spice.'
6+
---
7+
8+
Database indexes are an essential tool for optimizing the performance of queries. Learn how to add indexes to the tables that Spice creates to accelerate data locally.
9+
10+
Example Spicepod:
11+
12+
```yaml
13+
datasets:
14+
- from: spice.ai/eth.recent_blocks
15+
name: eth.recent_blocks
16+
acceleration:
17+
enabled: true
18+
engine: sqlite
19+
indexes:
20+
number: enabled # Index the `number` column
21+
"(hash, timestamp)": unique # Add a unique index with a multicolumn key comprised of the `hash` and `timestamp` columns
22+
```
23+
24+
## Column References
25+
26+
Column references can be used to specify which columns to index. The column reference can be a single column name or a multicolumn key. The column reference must be enclosed in parentheses if it is a multicolumn key.
27+
28+
Examples
29+
30+
- `number`: Index the `number` column
31+
- `(hash, timestamp)`: Index the `hash` and `timestamp` columns
32+
33+
## Index Types
34+
35+
There are two types of indexes that can be specified in a Spicepod:
36+
37+
- `enabled`: Creates a standard index on the specified column(s).
38+
- Similar to specifying `CREATE INDEX my_index ON my_table (my_column)`.
39+
- `unique`: Creates a unique index on the specified column(s). See [Constraints](./constraints.md) for more information on working with unique constraints on locally accelerated tables.
40+
- Similar to specifying `CREATE UNIQUE INDEX my_index ON my_table (my_column)`.
41+
42+
:::warning[Limitations]
43+
- **Not supported for in-memory Arrow:** The default in-memory Arrow acceleration engine does not support indexes. Use [DuckDB](../../data-accelerators/duckdb.md), [SQLite](../../data-accelerators/sqlite.md), or [PostgreSQL](../../data-accelerators/postgres/index.md) as the acceleration engine to enable indexing.
44+
:::

spiceaidocs/docs/reference/spicepod/datasets.md

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -200,3 +200,75 @@ Optional. How often the retention policy should be checked.
200200
Required when `acceleration.retention_check_enabled` is `true`.
201201

202202
See [Duration](../duration/index.md)
203+
204+
## `acceleration.indexes`
205+
206+
Optional. Specify which indexes should be applied to the locally accelerated table. Not supported for in-memory Arrow acceleration engine.
207+
208+
The `indexes` field is a map where the key is the column reference and the value is the index type.
209+
210+
A column reference can be a single column name or a multicolumn key. The column reference must be enclosed in parentheses if it is a multicolumn key.
211+
212+
See [Indexes](../../features/local-acceleration/indexes.md)
213+
214+
```yaml
215+
datasets:
216+
- from: spice.ai/eth.recent_blocks
217+
name: eth.recent_blocks
218+
acceleration:
219+
enabled: true
220+
engine: sqlite
221+
indexes:
222+
number: enabled # Index the `number` column
223+
"(hash, timestamp)": unique # Add a unique index with a multicolumn key comprised of the `hash` and `timestamp` columns
224+
```
225+
226+
## `acceleration.primary_key`
227+
228+
Optional. Specify the primary key constraint on the locally accelerated table. Not supported for in-memory Arrow acceleration engine.
229+
230+
The `primary_key` field is a string that represents the column reference that should be used as the primary key. The column reference can be a single column name or a multicolumn key. The column reference must be enclosed in parentheses if it is a multicolumn key.
231+
232+
See [Constraints](../../features/local-acceleration/constraints.md)
233+
234+
```yaml
235+
datasets:
236+
- from: spice.ai/eth.recent_blocks
237+
name: eth.recent_blocks
238+
acceleration:
239+
enabled: true
240+
engine: sqlite
241+
primary_key: hash # Define a primary key on the `hash` column
242+
```
243+
244+
## `acceleration.on_conflict`
245+
246+
Optional. Specify what should happen when a constraint is violated. Not supported for in-memory Arrow acceleration engine.
247+
248+
The `on_conflict` field is a map where the key is the column reference and the value is the conflict resolution strategy.
249+
250+
A column reference can be a single column name or a multicolumn key. The column reference must be enclosed in parentheses if it is a multicolumn key.
251+
252+
Only a single `on_conflict` target can be specified, unless all `on_conflict` targets are specified with `drop`.
253+
254+
The possible conflict resolution strategies are:
255+
- `upsert` - Upsert the incoming data when the primary key constraint is violated.
256+
- `drop` - Drop the data when the primary key constraint is violated.
257+
258+
See [Constraints](../../features/local-acceleration/constraints.md)
259+
260+
```yaml
261+
datasets:
262+
- from: spice.ai/eth.recent_blocks
263+
name: eth.recent_blocks
264+
acceleration:
265+
enabled: true
266+
engine: sqlite
267+
primary_key: hash
268+
indexes:
269+
"(number, timestamp)": unique
270+
on_conflict:
271+
# Upsert the incoming data when the primary key constraint on "hash" is violated,
272+
# alternatively "drop" can be used instead of "upsert" to drop the data update.
273+
hash: upsert
274+
```

0 commit comments

Comments
 (0)