docs: add per-mode pages under data-acceleration/refresh-modes (#1654)

lukekim · web-flow · commit 03b2230ad5e7 · 2026-05-06T16:47:34.000Z
diff --git a/website/docs/features/data-acceleration/data-refresh.md b/website/docs/features/data-acceleration/data-refresh.md
@@ -33,7 +33,11 @@ Spice supports five modes to refresh/update local data from a connected data sou
 
 Learn more about each mode:
 
+- [Full Mode](./refresh-modes/full)
+- [Append Mode](./refresh-modes/append)
+- [Changes Mode](./refresh-modes/changes)
 - [Caching Mode](./refresh-modes/caching)
+- [Snapshot Mode](./refresh-modes/snapshot)
 
 Example:
 
diff --git a/website/docs/features/data-acceleration/refresh-modes/append.md b/website/docs/features/data-acceleration/refresh-modes/append.md
@@ -0,0 +1,85 @@
+---
+title: 'Append Refresh Mode'
+sidebar_label: 'Append'
+description: 'Incrementally append new rows to an accelerated dataset.'
+sidebar_position: 2
+pagination_prev: null
+pagination_next: null
+---
+
+The `append` refresh mode incrementally adds new rows to the acceleration on each refresh. It is designed for append-only or immutable datasets such as time-series, event, and log data.
+
+Use `append` when:
+
+- New rows are continuously added to the source and existing rows are not modified or deleted.
+- A monotonic time or sequence column is available to identify new rows.
+- The full dataset is too large to refresh in `full` mode on each interval.
+
+## Configuration
+
+`append` mode requires a [`time_column`](../../../reference/spicepod/datasets#time_column) that identifies new rows by comparing the local maximum value to the source. Data is incrementally refreshed where `time_column` in the source is greater than `max(time_column)` in the acceleration.
+
+```yaml
+datasets:
+  - from: databricks:my_dataset
+    name: accelerated_dataset
+    time_column: created_at
+    acceleration:
+      enabled: true
+      refresh_mode: append
+      refresh_check_interval: 10m
+```
+
+## Late-Arriving Data
+
+To account for clock skew or late-arriving rows, configure an overlap window with [`acceleration.refresh_append_overlap`](../../../reference/spicepod/datasets#accelerationrefresh_append_overlap). Rows within the overlap are re-read on each refresh.
+
+## Partition Pruning with `time_partition_column`
+
+Datasets partitioned by a less-granular time column (day, month, year) can specify [`time_partition_column`](../../../reference/spicepod/datasets#time_partition_column) in addition to `time_column` for efficient partition pruning at the source.
+
+```yaml
+datasets:
+  - from: databricks:my_dataset
+    name: accelerated_dataset
+    time_column: created_at
+    time_format: iso8601
+    time_partition_column: created_at_day
+    time_partition_format: date
+```
+
+## Append Only Modified Files
+
+For object-store sources, set `time_column` or `time_partition_column` to the special value `last_modified` to append only newly created or updated files. Spice uses file metadata to determine which files are new, dramatically reducing scan time for large datasets.
+
+```yaml
+datasets:
+  - from: s3://my_bucket/my_dataset
+    name: accelerated_dataset
+    time_column: last_modified
+    params:
+      file_format: parquet
+    acceleration:
+      refresh_mode: append
+      refresh_check_interval: 10m
+```
+
+If `last_modified` exists as a column in the data, the column value takes precedence over file metadata.
+
+This is supported for connectors that accept the [file format parameter](../../../reference/file_format), such as `s3://`, `abfs://`, and `file://`.
+
+## Readiness with Snapshots
+
+Append-mode accelerations that define a `time_column` wait to report ready until the first append refresh completes after [snapshot bootstrap](../snapshots). This keeps the dataset out of rotation until the freshest data is available while still benefiting from snapshot-assisted startup.
+
+## Combining with Upserts
+
+Pair `refresh_mode: append` with a `primary_key` and `on_conflict: upsert` to handle source rows that are occasionally updated. See [End-to-End Incremental Ingestion Example](../data-refresh#end-to-end-incremental-ingestion-example).
+
+## Related Topics
+
+- [Refresh Interval](../data-refresh#refresh-interval)
+- [Refresh on Startup](../data-refresh#refresh-on-startup)
+- [Refresh Retries](../data-refresh#refresh-retries)
+- [Retention Policy](../data-refresh#retention-policy)
+- [Refresh Data Window](../data-refresh#refresh-data-window)
diff --git a/website/docs/features/data-acceleration/refresh-modes/changes.md b/website/docs/features/data-acceleration/refresh-modes/changes.md
@@ -0,0 +1,44 @@
+---
+title: 'Changes Refresh Mode'
+sidebar_label: 'Changes'
+description: 'Apply incremental inserts, updates, and deletes via Change Data Capture.'
+sidebar_position: 3
+pagination_prev: null
+pagination_next: null
+---
+
+The `changes` refresh mode applies incremental inserts, updates, and deletes from a [Change Data Capture (CDC)](../../cdc) source. Unlike `append`, `changes` mode reflects modifications and deletions in the acceleration, keeping it consistent with sources where rows mutate over time.
+
+Use `changes` when:
+
+- The source supports CDC (e.g., a database with a transaction log).
+- Rows in the source are updated or deleted, not just inserted.
+- The acceleration must reflect the current state of the source row-for-row.
+
+## Configuration
+
+`refresh_mode: changes` requires a CDC-capable data connector. Spice supports CDC via [PostgreSQL Logical Replication](../../cdc/postgres-replication), [DynamoDB Streams](../../../components/data-connectors/dynamodb#streams), [Apache Kafka](../../../components/data-connectors/kafka), and [Debezium](../../../components/data-connectors/debezium). See [Supported Data Connectors](../../cdc#supported-data-connectors) for details.
+
+```yaml
+datasets:
+  - from: debezium:cdc.public.customer_orders
+    name: customer_orders
+    acceleration:
+      enabled: true
+      refresh_mode: changes
+      engine: duckdb
+      mode: file
+```
+
+The Debezium connector streams change events from a Kafka topic produced by Debezium. Each event is applied to the acceleration in order, preserving inserts, updates, and deletes from the source.
+
+## Behavior
+
+- The acceleration is bootstrapped from the source snapshot, then continuously updated from the change stream.
+- `refresh_check_interval`, `refresh_cron`, on-demand refresh, `refresh_data_window`, and `retention_period` do not apply — updates are driven by the change stream rather than periodic polling.
+- [`refresh_sql`](../data-refresh#refresh-sql) can only modify selected columns in `changes` mode and cannot apply row filters.
+
+## Related Topics
+
+- [Change Data Capture](../../cdc)
+- [Debezium Data Connector](../../../components/data-connectors/debezium)
diff --git a/website/docs/features/data-acceleration/refresh-modes/full.md b/website/docs/features/data-acceleration/refresh-modes/full.md
@@ -0,0 +1,46 @@
+---
+title: 'Full Refresh Mode'
+sidebar_label: 'Full'
+description: 'Replace the entire accelerated dataset on each refresh.'
+sidebar_position: 1
+pagination_prev: null
+pagination_next: null
+---
+
+The `full` refresh mode replaces the entire accelerated dataset on every refresh. It is the default refresh mode and the simplest way to keep an acceleration in sync with its source.
+
+Use `full` when:
+
+- The dataset is small enough to be re-read on each refresh.
+- Source rows can be inserted, updated, or deleted, and incremental tracking is not available.
+- Strong consistency with the source is preferred over minimizing source load.
+
+## Configuration
+
+```yaml
+datasets:
+  - from: databricks:my_dataset
+    name: accelerated_dataset
+    acceleration:
+      enabled: true
+      refresh_mode: full
+      refresh_check_interval: 10m
+```
+
+On each refresh, the runtime issues a single `SELECT` against the source, materializes the result into the acceleration engine, and atomically swaps the new data in.
+
+## Behavior
+
+- Each refresh fully scans the source. Any [`refresh_sql`](../data-refresh#refresh-sql) and [`refresh_data_window`](../data-refresh#refresh-data-window) filters are pushed down to limit data transferred.
+- Queries continue to be served from the previous result set until the new refresh completes.
+- Supported with all data connectors and all acceleration engines.
+
+## Related Topics
+
+For cross-cutting refresh behavior that applies to `full` mode, see:
+
+- [Refresh Interval](../data-refresh#refresh-interval)
+- [Refresh on Startup](../data-refresh#refresh-on-startup)
+- [Refresh Retries](../data-refresh#refresh-retries)
+- [Retention Policy](../data-refresh#retention-policy)
+- [Behavior on Zero Results](../data-refresh#behavior-on-zero-results)
diff --git a/website/docs/features/data-acceleration/refresh-modes/index.md b/website/docs/features/data-acceleration/refresh-modes/index.md
@@ -0,0 +1,20 @@
+---
+title: 'Refresh Modes'
+sidebar_label: 'Refresh Modes'
+description: 'Refresh modes for accelerated datasets in Spice.'
+sidebar_position: 2
+pagination_prev: null
+pagination_next: null
+---
+
+Spice supports five modes to refresh accelerated datasets. `full` is the default.
+
+| Mode                          | Description                                          | Example                                                          |
+| ----------------------------- | ---------------------------------------------------- | ---------------------------------------------------------------- |
+| [`full`](./full.md)           | Replace/overwrite the entire dataset on each refresh | A table of users                                                 |
+| [`append`](./append.md)       | Append/add data to the dataset on each refresh       | Append-only, immutable datasets, such as time-series or log data |
+| [`changes`](./changes.md)     | Apply incremental inserts, updates, and deletes      | Customer order lifecycle table                                   |
+| [`caching`](./caching.md)     | Read-through caching for HTTP-based datasets         | API search results or dynamic content endpoints                  |
+| [`snapshot`](./snapshot.md)   | Reload exclusively from the snapshot store           | Read-only replicas bootstrapped from centralized snapshots       |
+
+For cross-cutting refresh behavior — refresh intervals, on-demand refresh, retries, retention, and behavior on zero results — see [Data Refresh](../data-refresh.md).
diff --git a/website/docs/features/data-acceleration/refresh-modes/snapshot.md b/website/docs/features/data-acceleration/refresh-modes/snapshot.md
@@ -0,0 +1,61 @@
+---
+title: 'Snapshot Refresh Mode'
+sidebar_label: 'Snapshot'
+description: 'Reload acceleration data exclusively from the snapshot store.'
+sidebar_position: 5
+pagination_prev: null
+pagination_next: null
+---
+
+The `snapshot` refresh mode creates a read-only acceleration that reloads exclusively from the [snapshot store](../snapshots). The federated source is never queried for refreshes — instead, the runtime polls the snapshot store on a configurable interval and atomically swaps in newer snapshots when available.
+
+Use `snapshot` when:
+
+- A separate writer publishes acceleration snapshots to object storage.
+- Read replicas need fast, source-independent startup and refresh.
+- The federated source should not be queried by the replica (e.g., edge nodes, security boundaries, or to reduce source load).
+
+## Configuration
+
+```yaml
+snapshots:
+  enabled: true
+  location: s3://my-bucket/snapshots/
+  params:
+    s3_auth: iam_role
+
+datasets:
+  - from: postgres:public.my_table
+    name: my_table
+    acceleration:
+      enabled: true
+      engine: duckdb
+      mode: file
+      refresh_mode: snapshot
+      refresh_check_interval: 30s # Poll interval; defaults to 1m
+      snapshots: enabled
+      params:
+        duckdb_file: /nvme/my_table.db
+```
+
+## Requirements
+
+- `acceleration.snapshots` must be `enabled` or `bootstrap_only`.
+- The acceleration engine must be a snapshot-capable file-based engine: **DuckDB**, **SQLite**, or **Cayenne**.
+
+## Behavior
+
+- On startup, the runtime bootstraps from the most recent snapshot, identical to other snapshot-enabled modes.
+- After bootstrap, the runtime polls the snapshot store at `refresh_check_interval` (default: 60s) for newer snapshots.
+- When a newer snapshot is found, its schema is validated against the current acceleration schema before downloading.
+- The accelerator file is swapped atomically — queries continue to be served from the previous snapshot until the swap completes.
+- `INSERT INTO` statements are rejected with an error since the acceleration is driven exclusively from snapshots.
+
+:::tip
+Use `refresh_mode: snapshot` for read-only replicas that should not access the federated source — for example, edge nodes that receive snapshots from a centralized writer.
+:::
+
+## Related Topics
+
+- [Acceleration Snapshots](../snapshots)
+- [Refresh Interval](../data-refresh#refresh-interval)