Skip to content

Commit 03b2230

Browse files
authored
docs: add per-mode pages under data-acceleration/refresh-modes (#1654)
1 parent a5916c2 commit 03b2230

6 files changed

Lines changed: 260 additions & 0 deletions

File tree

website/docs/features/data-acceleration/data-refresh.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,11 @@ Spice supports five modes to refresh/update local data from a connected data sou
3333

3434
Learn more about each mode:
3535

36+
- [Full Mode](./refresh-modes/full)
37+
- [Append Mode](./refresh-modes/append)
38+
- [Changes Mode](./refresh-modes/changes)
3639
- [Caching Mode](./refresh-modes/caching)
40+
- [Snapshot Mode](./refresh-modes/snapshot)
3741

3842
Example:
3943

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
---
2+
title: 'Append Refresh Mode'
3+
sidebar_label: 'Append'
4+
description: 'Incrementally append new rows to an accelerated dataset.'
5+
sidebar_position: 2
6+
pagination_prev: null
7+
pagination_next: null
8+
---
9+
10+
The `append` refresh mode incrementally adds new rows to the acceleration on each refresh. It is designed for append-only or immutable datasets such as time-series, event, and log data.
11+
12+
Use `append` when:
13+
14+
- New rows are continuously added to the source and existing rows are not modified or deleted.
15+
- A monotonic time or sequence column is available to identify new rows.
16+
- The full dataset is too large to refresh in `full` mode on each interval.
17+
18+
## Configuration
19+
20+
`append` mode requires a [`time_column`](../../../reference/spicepod/datasets#time_column) that identifies new rows by comparing the local maximum value to the source. Data is incrementally refreshed where `time_column` in the source is greater than `max(time_column)` in the acceleration.
21+
22+
```yaml
23+
datasets:
24+
- from: databricks:my_dataset
25+
name: accelerated_dataset
26+
time_column: created_at
27+
acceleration:
28+
enabled: true
29+
refresh_mode: append
30+
refresh_check_interval: 10m
31+
```
32+
33+
## Late-Arriving Data
34+
35+
To account for clock skew or late-arriving rows, configure an overlap window with [`acceleration.refresh_append_overlap`](../../../reference/spicepod/datasets#accelerationrefresh_append_overlap). Rows within the overlap are re-read on each refresh.
36+
37+
## Partition Pruning with `time_partition_column`
38+
39+
Datasets partitioned by a less-granular time column (day, month, year) can specify [`time_partition_column`](../../../reference/spicepod/datasets#time_partition_column) in addition to `time_column` for efficient partition pruning at the source.
40+
41+
```yaml
42+
datasets:
43+
- from: databricks:my_dataset
44+
name: accelerated_dataset
45+
time_column: created_at
46+
time_format: iso8601
47+
time_partition_column: created_at_day
48+
time_partition_format: date
49+
```
50+
51+
## Append Only Modified Files
52+
53+
For object-store sources, set `time_column` or `time_partition_column` to the special value `last_modified` to append only newly created or updated files. Spice uses file metadata to determine which files are new, dramatically reducing scan time for large datasets.
54+
55+
```yaml
56+
datasets:
57+
- from: s3://my_bucket/my_dataset
58+
name: accelerated_dataset
59+
time_column: last_modified
60+
params:
61+
file_format: parquet
62+
acceleration:
63+
refresh_mode: append
64+
refresh_check_interval: 10m
65+
```
66+
67+
If `last_modified` exists as a column in the data, the column value takes precedence over file metadata.
68+
69+
This is supported for connectors that accept the [file format parameter](../../../reference/file_format), such as `s3://`, `abfs://`, and `file://`.
70+
71+
## Readiness with Snapshots
72+
73+
Append-mode accelerations that define a `time_column` wait to report ready until the first append refresh completes after [snapshot bootstrap](../snapshots). This keeps the dataset out of rotation until the freshest data is available while still benefiting from snapshot-assisted startup.
74+
75+
## Combining with Upserts
76+
77+
Pair `refresh_mode: append` with a `primary_key` and `on_conflict: upsert` to handle source rows that are occasionally updated. See [End-to-End Incremental Ingestion Example](../data-refresh#end-to-end-incremental-ingestion-example).
78+
79+
## Related Topics
80+
81+
- [Refresh Interval](../data-refresh#refresh-interval)
82+
- [Refresh on Startup](../data-refresh#refresh-on-startup)
83+
- [Refresh Retries](../data-refresh#refresh-retries)
84+
- [Retention Policy](../data-refresh#retention-policy)
85+
- [Refresh Data Window](../data-refresh#refresh-data-window)
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
---
2+
title: 'Changes Refresh Mode'
3+
sidebar_label: 'Changes'
4+
description: 'Apply incremental inserts, updates, and deletes via Change Data Capture.'
5+
sidebar_position: 3
6+
pagination_prev: null
7+
pagination_next: null
8+
---
9+
10+
The `changes` refresh mode applies incremental inserts, updates, and deletes from a [Change Data Capture (CDC)](../../cdc) source. Unlike `append`, `changes` mode reflects modifications and deletions in the acceleration, keeping it consistent with sources where rows mutate over time.
11+
12+
Use `changes` when:
13+
14+
- The source supports CDC (e.g., a database with a transaction log).
15+
- Rows in the source are updated or deleted, not just inserted.
16+
- The acceleration must reflect the current state of the source row-for-row.
17+
18+
## Configuration
19+
20+
`refresh_mode: changes` requires a CDC-capable data connector. Spice supports CDC via [PostgreSQL Logical Replication](../../cdc/postgres-replication), [DynamoDB Streams](../../../components/data-connectors/dynamodb#streams), [Apache Kafka](../../../components/data-connectors/kafka), and [Debezium](../../../components/data-connectors/debezium). See [Supported Data Connectors](../../cdc#supported-data-connectors) for details.
21+
22+
```yaml
23+
datasets:
24+
- from: debezium:cdc.public.customer_orders
25+
name: customer_orders
26+
acceleration:
27+
enabled: true
28+
refresh_mode: changes
29+
engine: duckdb
30+
mode: file
31+
```
32+
33+
The Debezium connector streams change events from a Kafka topic produced by Debezium. Each event is applied to the acceleration in order, preserving inserts, updates, and deletes from the source.
34+
35+
## Behavior
36+
37+
- The acceleration is bootstrapped from the source snapshot, then continuously updated from the change stream.
38+
- `refresh_check_interval`, `refresh_cron`, on-demand refresh, `refresh_data_window`, and `retention_period` do not apply — updates are driven by the change stream rather than periodic polling.
39+
- [`refresh_sql`](../data-refresh#refresh-sql) can only modify selected columns in `changes` mode and cannot apply row filters.
40+
41+
## Related Topics
42+
43+
- [Change Data Capture](../../cdc)
44+
- [Debezium Data Connector](../../../components/data-connectors/debezium)
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
---
2+
title: 'Full Refresh Mode'
3+
sidebar_label: 'Full'
4+
description: 'Replace the entire accelerated dataset on each refresh.'
5+
sidebar_position: 1
6+
pagination_prev: null
7+
pagination_next: null
8+
---
9+
10+
The `full` refresh mode replaces the entire accelerated dataset on every refresh. It is the default refresh mode and the simplest way to keep an acceleration in sync with its source.
11+
12+
Use `full` when:
13+
14+
- The dataset is small enough to be re-read on each refresh.
15+
- Source rows can be inserted, updated, or deleted, and incremental tracking is not available.
16+
- Strong consistency with the source is preferred over minimizing source load.
17+
18+
## Configuration
19+
20+
```yaml
21+
datasets:
22+
- from: databricks:my_dataset
23+
name: accelerated_dataset
24+
acceleration:
25+
enabled: true
26+
refresh_mode: full
27+
refresh_check_interval: 10m
28+
```
29+
30+
On each refresh, the runtime issues a single `SELECT` against the source, materializes the result into the acceleration engine, and atomically swaps the new data in.
31+
32+
## Behavior
33+
34+
- Each refresh fully scans the source. Any [`refresh_sql`](../data-refresh#refresh-sql) and [`refresh_data_window`](../data-refresh#refresh-data-window) filters are pushed down to limit data transferred.
35+
- Queries continue to be served from the previous result set until the new refresh completes.
36+
- Supported with all data connectors and all acceleration engines.
37+
38+
## Related Topics
39+
40+
For cross-cutting refresh behavior that applies to `full` mode, see:
41+
42+
- [Refresh Interval](../data-refresh#refresh-interval)
43+
- [Refresh on Startup](../data-refresh#refresh-on-startup)
44+
- [Refresh Retries](../data-refresh#refresh-retries)
45+
- [Retention Policy](../data-refresh#retention-policy)
46+
- [Behavior on Zero Results](../data-refresh#behavior-on-zero-results)
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
---
2+
title: 'Refresh Modes'
3+
sidebar_label: 'Refresh Modes'
4+
description: 'Refresh modes for accelerated datasets in Spice.'
5+
sidebar_position: 2
6+
pagination_prev: null
7+
pagination_next: null
8+
---
9+
10+
Spice supports five modes to refresh accelerated datasets. `full` is the default.
11+
12+
| Mode | Description | Example |
13+
| ----------------------------- | ---------------------------------------------------- | ---------------------------------------------------------------- |
14+
| [`full`](./full.md) | Replace/overwrite the entire dataset on each refresh | A table of users |
15+
| [`append`](./append.md) | Append/add data to the dataset on each refresh | Append-only, immutable datasets, such as time-series or log data |
16+
| [`changes`](./changes.md) | Apply incremental inserts, updates, and deletes | Customer order lifecycle table |
17+
| [`caching`](./caching.md) | Read-through caching for HTTP-based datasets | API search results or dynamic content endpoints |
18+
| [`snapshot`](./snapshot.md) | Reload exclusively from the snapshot store | Read-only replicas bootstrapped from centralized snapshots |
19+
20+
For cross-cutting refresh behavior — refresh intervals, on-demand refresh, retries, retention, and behavior on zero results — see [Data Refresh](../data-refresh.md).
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
---
2+
title: 'Snapshot Refresh Mode'
3+
sidebar_label: 'Snapshot'
4+
description: 'Reload acceleration data exclusively from the snapshot store.'
5+
sidebar_position: 5
6+
pagination_prev: null
7+
pagination_next: null
8+
---
9+
10+
The `snapshot` refresh mode creates a read-only acceleration that reloads exclusively from the [snapshot store](../snapshots). The federated source is never queried for refreshes — instead, the runtime polls the snapshot store on a configurable interval and atomically swaps in newer snapshots when available.
11+
12+
Use `snapshot` when:
13+
14+
- A separate writer publishes acceleration snapshots to object storage.
15+
- Read replicas need fast, source-independent startup and refresh.
16+
- The federated source should not be queried by the replica (e.g., edge nodes, security boundaries, or to reduce source load).
17+
18+
## Configuration
19+
20+
```yaml
21+
snapshots:
22+
enabled: true
23+
location: s3://my-bucket/snapshots/
24+
params:
25+
s3_auth: iam_role
26+
27+
datasets:
28+
- from: postgres:public.my_table
29+
name: my_table
30+
acceleration:
31+
enabled: true
32+
engine: duckdb
33+
mode: file
34+
refresh_mode: snapshot
35+
refresh_check_interval: 30s # Poll interval; defaults to 1m
36+
snapshots: enabled
37+
params:
38+
duckdb_file: /nvme/my_table.db
39+
```
40+
41+
## Requirements
42+
43+
- `acceleration.snapshots` must be `enabled` or `bootstrap_only`.
44+
- The acceleration engine must be a snapshot-capable file-based engine: **DuckDB**, **SQLite**, or **Cayenne**.
45+
46+
## Behavior
47+
48+
- On startup, the runtime bootstraps from the most recent snapshot, identical to other snapshot-enabled modes.
49+
- After bootstrap, the runtime polls the snapshot store at `refresh_check_interval` (default: 60s) for newer snapshots.
50+
- When a newer snapshot is found, its schema is validated against the current acceleration schema before downloading.
51+
- The accelerator file is swapped atomically — queries continue to be served from the previous snapshot until the swap completes.
52+
- `INSERT INTO` statements are rejected with an error since the acceleration is driven exclusively from snapshots.
53+
54+
:::tip
55+
Use `refresh_mode: snapshot` for read-only replicas that should not access the federated source — for example, edge nodes that receive snapshots from a centralized writer.
56+
:::
57+
58+
## Related Topics
59+
60+
- [Acceleration Snapshots](../snapshots)
61+
- [Refresh Interval](../data-refresh#refresh-interval)

0 commit comments

Comments
 (0)