Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions website/docs/features/data-acceleration/data-refresh.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,11 @@ Spice supports five modes to refresh/update local data from a connected data sou

Learn more about each mode:

- [Full Mode](./refresh-modes/full)
- [Append Mode](./refresh-modes/append)
- [Changes Mode](./refresh-modes/changes)
- [Caching Mode](./refresh-modes/caching)
- [Snapshot Mode](./refresh-modes/snapshot)

Example:

Expand Down
85 changes: 85 additions & 0 deletions website/docs/features/data-acceleration/refresh-modes/append.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
---
title: 'Append Refresh Mode'
sidebar_label: 'Append'
description: 'Incrementally append new rows to an accelerated dataset.'
sidebar_position: 2
pagination_prev: null
pagination_next: null
---

The `append` refresh mode incrementally adds new rows to the acceleration on each refresh. It is designed for append-only or immutable datasets such as time-series, event, and log data.

Use `append` when:

- New rows are continuously added to the source and existing rows are not modified or deleted.
- A monotonic time or sequence column is available to identify new rows.
- The full dataset is too large to refresh in `full` mode on each interval.

## Configuration

`append` mode requires a [`time_column`](../../../reference/spicepod/datasets#time_column) that identifies new rows by comparing the local maximum value to the source. Data is incrementally refreshed where `time_column` in the source is greater than `max(time_column)` in the acceleration.

```yaml
datasets:
- from: databricks:my_dataset
name: accelerated_dataset
time_column: created_at
acceleration:
enabled: true
refresh_mode: append
refresh_check_interval: 10m
```

## Late-Arriving Data

To account for clock skew or late-arriving rows, configure an overlap window with [`acceleration.refresh_append_overlap`](../../../reference/spicepod/datasets#accelerationrefresh_append_overlap). Rows within the overlap are re-read on each refresh.

## Partition Pruning with `time_partition_column`

Datasets partitioned by a less-granular time column (day, month, year) can specify [`time_partition_column`](../../../reference/spicepod/datasets#time_partition_column) in addition to `time_column` for efficient partition pruning at the source.

```yaml
datasets:
- from: databricks:my_dataset
name: accelerated_dataset
time_column: created_at
time_format: iso8601
time_partition_column: created_at_day
time_partition_format: date
```

## Append Only Modified Files

For object-store sources, set `time_column` or `time_partition_column` to the special value `last_modified` to append only newly created or updated files. Spice uses file metadata to determine which files are new, dramatically reducing scan time for large datasets.

```yaml
datasets:
- from: s3://my_bucket/my_dataset
name: accelerated_dataset
time_column: last_modified
params:
file_format: parquet
acceleration:
refresh_mode: append
refresh_check_interval: 10m
```

If `last_modified` exists as a column in the data, the column value takes precedence over file metadata.

This is supported for connectors that accept the [file format parameter](../../../reference/file_format), such as `s3://`, `abfs://`, and `file://`.

## Readiness with Snapshots

Append-mode accelerations that define a `time_column` wait to report ready until the first append refresh completes after [snapshot bootstrap](../snapshots). This keeps the dataset out of rotation until the freshest data is available while still benefiting from snapshot-assisted startup.

## Combining with Upserts

Pair `refresh_mode: append` with a `primary_key` and `on_conflict: upsert` to handle source rows that are occasionally updated. See [End-to-End Incremental Ingestion Example](../data-refresh#end-to-end-incremental-ingestion-example).

## Related Topics

- [Refresh Interval](../data-refresh#refresh-interval)
- [Refresh on Startup](../data-refresh#refresh-on-startup)
- [Refresh Retries](../data-refresh#refresh-retries)
- [Retention Policy](../data-refresh#retention-policy)
- [Refresh Data Window](../data-refresh#refresh-data-window)
44 changes: 44 additions & 0 deletions website/docs/features/data-acceleration/refresh-modes/changes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
---
title: 'Changes Refresh Mode'
sidebar_label: 'Changes'
description: 'Apply incremental inserts, updates, and deletes via Change Data Capture.'
sidebar_position: 3
pagination_prev: null
pagination_next: null
---

The `changes` refresh mode applies incremental inserts, updates, and deletes from a [Change Data Capture (CDC)](../../cdc) source. Unlike `append`, `changes` mode reflects modifications and deletions in the acceleration, keeping it consistent with sources where rows mutate over time.

Use `changes` when:

- The source supports CDC (e.g., a database with a transaction log).
- Rows in the source are updated or deleted, not just inserted.
- The acceleration must reflect the current state of the source row-for-row.

## Configuration

`refresh_mode: changes` requires a CDC-capable data connector. Spice supports CDC via [PostgreSQL Logical Replication](../../cdc/postgres-replication), [DynamoDB Streams](../../../components/data-connectors/dynamodb#streams), [Apache Kafka](../../../components/data-connectors/kafka), and [Debezium](../../../components/data-connectors/debezium). See [Supported Data Connectors](../../cdc#supported-data-connectors) for details.

```yaml
datasets:
- from: debezium:cdc.public.customer_orders
name: customer_orders
acceleration:
enabled: true
refresh_mode: changes
engine: duckdb
mode: file
```

The Debezium connector streams change events from a Kafka topic produced by Debezium. Each event is applied to the acceleration in order, preserving inserts, updates, and deletes from the source.

## Behavior

- The acceleration is bootstrapped from the source snapshot, then continuously updated from the change stream.
- `refresh_check_interval`, `refresh_cron`, on-demand refresh, `refresh_data_window`, and `retention_period` do not apply — updates are driven by the change stream rather than periodic polling.
- [`refresh_sql`](../data-refresh#refresh-sql) can only modify selected columns in `changes` mode and cannot apply row filters.

## Related Topics

- [Change Data Capture](../../cdc)
- [Debezium Data Connector](../../../components/data-connectors/debezium)
46 changes: 46 additions & 0 deletions website/docs/features/data-acceleration/refresh-modes/full.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
---
title: 'Full Refresh Mode'
sidebar_label: 'Full'
description: 'Replace the entire accelerated dataset on each refresh.'
sidebar_position: 1
pagination_prev: null
pagination_next: null
---

The `full` refresh mode replaces the entire accelerated dataset on every refresh. It is the default refresh mode and the simplest way to keep an acceleration in sync with its source.

Use `full` when:

- The dataset is small enough to be re-read on each refresh.
- Source rows can be inserted, updated, or deleted, and incremental tracking is not available.
- Strong consistency with the source is preferred over minimizing source load.

## Configuration

```yaml
datasets:
- from: databricks:my_dataset
name: accelerated_dataset
acceleration:
enabled: true
refresh_mode: full
refresh_check_interval: 10m
```

On each refresh, the runtime issues a single `SELECT` against the source, materializes the result into the acceleration engine, and atomically swaps the new data in.

## Behavior

- Each refresh fully scans the source. Any [`refresh_sql`](../data-refresh#refresh-sql) and [`refresh_data_window`](../data-refresh#refresh-data-window) filters are pushed down to limit data transferred.
- Queries continue to be served from the previous result set until the new refresh completes.
- Supported with all data connectors and all acceleration engines.

## Related Topics

For cross-cutting refresh behavior that applies to `full` mode, see:

- [Refresh Interval](../data-refresh#refresh-interval)
- [Refresh on Startup](../data-refresh#refresh-on-startup)
- [Refresh Retries](../data-refresh#refresh-retries)
- [Retention Policy](../data-refresh#retention-policy)
- [Behavior on Zero Results](../data-refresh#behavior-on-zero-results)
20 changes: 20 additions & 0 deletions website/docs/features/data-acceleration/refresh-modes/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
title: 'Refresh Modes'
sidebar_label: 'Refresh Modes'
description: 'Refresh modes for accelerated datasets in Spice.'
sidebar_position: 2
pagination_prev: null
pagination_next: null
---

Spice supports five modes to refresh accelerated datasets. `full` is the default.

| Mode | Description | Example |
| ----------------------------- | ---------------------------------------------------- | ---------------------------------------------------------------- |
| [`full`](./full.md) | Replace/overwrite the entire dataset on each refresh | A table of users |
| [`append`](./append.md) | Append/add data to the dataset on each refresh | Append-only, immutable datasets, such as time-series or log data |
| [`changes`](./changes.md) | Apply incremental inserts, updates, and deletes | Customer order lifecycle table |
| [`caching`](./caching.md) | Read-through caching for HTTP-based datasets | API search results or dynamic content endpoints |
| [`snapshot`](./snapshot.md) | Reload exclusively from the snapshot store | Read-only replicas bootstrapped from centralized snapshots |

For cross-cutting refresh behavior — refresh intervals, on-demand refresh, retries, retention, and behavior on zero results — see [Data Refresh](../data-refresh.md).
61 changes: 61 additions & 0 deletions website/docs/features/data-acceleration/refresh-modes/snapshot.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
---
title: 'Snapshot Refresh Mode'
sidebar_label: 'Snapshot'
description: 'Reload acceleration data exclusively from the snapshot store.'
sidebar_position: 5
pagination_prev: null
pagination_next: null
---

The `snapshot` refresh mode creates a read-only acceleration that reloads exclusively from the [snapshot store](../snapshots). The federated source is never queried for refreshes — instead, the runtime polls the snapshot store on a configurable interval and atomically swaps in newer snapshots when available.

Use `snapshot` when:

- A separate writer publishes acceleration snapshots to object storage.
- Read replicas need fast, source-independent startup and refresh.
- The federated source should not be queried by the replica (e.g., edge nodes, security boundaries, or to reduce source load).

## Configuration

```yaml
snapshots:
enabled: true
location: s3://my-bucket/snapshots/
params:
s3_auth: iam_role

datasets:
- from: postgres:public.my_table
name: my_table
acceleration:
enabled: true
engine: duckdb
mode: file
refresh_mode: snapshot
refresh_check_interval: 30s # Poll interval; defaults to 1m
snapshots: enabled
params:
duckdb_file: /nvme/my_table.db
```

## Requirements

- `acceleration.snapshots` must be `enabled` or `bootstrap_only`.
- The acceleration engine must be a snapshot-capable file-based engine: **DuckDB**, **SQLite**, or **Cayenne**.

## Behavior

- On startup, the runtime bootstraps from the most recent snapshot, identical to other snapshot-enabled modes.
- After bootstrap, the runtime polls the snapshot store at `refresh_check_interval` (default: 60s) for newer snapshots.
- When a newer snapshot is found, its schema is validated against the current acceleration schema before downloading.
- The accelerator file is swapped atomically — queries continue to be served from the previous snapshot until the swap completes.
- `INSERT INTO` statements are rejected with an error since the acceleration is driven exclusively from snapshots.

:::tip
Use `refresh_mode: snapshot` for read-only replicas that should not access the federated source — for example, edge nodes that receive snapshots from a centralized writer.
:::

## Related Topics

- [Acceleration Snapshots](../snapshots)
- [Refresh Interval](../data-refresh#refresh-interval)
Loading