Works with v1.0+
Serve queries instantly while a large table accelerates in the background by registering the same source as two datasets: one federated and one accelerated.
Tip: Keep the Data Acceleration documentation handy while following this guide.
Use dual-dataset registration when all of the following apply:
- The table is large (hundreds of millions of rows or more) and the initial acceleration load takes minutes to hours.
- Queries against the federated source during the loading window are too slow for your application to tolerate (high latency, timeouts, or expensive per-query costs on the source).
- You need deterministic control over when your application switches from the federated path to the accelerated path (for example, a blue-green cutover, or gating behind a feature flag).
| Scenario | Better alternative |
|---|---|
| The table loads in seconds or a few minutes and brief federation latency is acceptable. | Use a single dataset with ready_state: on_registration. Queries fall back to the federated source automatically until acceleration finishes. No application-side routing needed. |
| You never need to query the table before acceleration completes. | Use a single accelerated dataset with the default ready_state: on_load. The runtime will report the dataset as ready only after loading is done. |
| You want fast restarts but already have a prior acceleration file. | Use acceleration snapshots to bootstrap from a pre-built file. The dataset is ready in seconds. |
- Two table names. Your application must be aware of both
<table>and<table>_acceleratedand include logic to switch between them when the accelerated copy is ready. - Double registration overhead. The runtime registers two datasets against the same source. This is lightweight (metadata only for the federated entry), but it adds entries to the dataset catalog.
- Source queried twice during load. Until the acceleration finishes, the federated dataset still routes to the source. If the source charges per query or per byte scanned, plan accordingly.
- Spice CLI installed.
spice init dual-dataset-registration
cd dual-dataset-registrationReplace the generated spicepod.yaml with the following configuration. It registers the public NYC taxi-trips Parquet dataset twice — once as a plain federated table and once with DuckDB file-mode acceleration.
version: v1
kind: Spicepod
name: dual-dataset-registration
datasets:
# Federated table — available immediately at startup
- from: s3://spiceai-public-datasets/taxi_trips/
name: taxi_trips
params:
file_format: parquet
# Accelerated copy — loads in the background
- from: s3://spiceai-public-datasets/taxi_trips/
name: taxi_trips_accelerated
params:
file_format: parquet
ready_state: on_registration # Required for runtime to be ready while loading
acceleration:
enabled: true
engine: duckdb
mode: file
refresh_check_interval: 30mKey points:
taxi_tripsis a federated dataset with no acceleration. It is ready the moment the runtime starts.taxi_trips_acceleratedpoints to the same source but hasacceleration.enabled: true. The runtime begins loading data into a local DuckDB file as soon as it starts.ready_state: on_registrationis required on the accelerated dataset so the runtime becomes ready immediately. Without it the runtime waits for the acceleration to finish loading before marking itself ready, which blocks the federated dataset from serving queries.mode: filewrites the accelerated data to disk instead of memory, avoiding out-of-memory crashes for large tables.refresh_check_interval: 30mre-checks the source every 30 minutes after the initial load.
spice runShortly after startup you will see both datasets register. The federated table is ready immediately, while the accelerated copy begins loading:
2025-03-24T10:00:01.123456Z INFO runtime::init::dataset: Dataset taxi_trips registered (s3://spiceai-public-datasets/taxi_trips/), results cache enabled.
2025-03-24T10:00:01.234567Z INFO runtime::init::dataset: Dataset taxi_trips_accelerated registered (s3://spiceai-public-datasets/taxi_trips/), acceleration (duckdb:file), results cache enabled.
2025-03-24T10:00:01.234890Z INFO runtime::accelerated_table::refresh_task: Loading data for dataset taxi_trips_acceleratedIn a new terminal, open the Spice SQL REPL:
spice sqlRun a query against the federated table. It hits the S3 source directly, so it works right away — even while the acceleration is still loading:
SELECT COUNT(*) AS total_trips FROM taxi_trips;+-------------+
| total_trips |
+-------------+
| 1547741 |
+-------------+
Queries against the federated table work but have higher latency since every query reads from the remote source.
Use the datasets status API to determine when the accelerated copy is ready:
curl "http://localhost:8090/v1/datasets?status=true" | jqWhile still loading:
[
{
"from": "s3://spiceai-public-datasets/taxi_trips/",
"name": "taxi_trips",
"status": "Ready"
},
{
"from": "s3://spiceai-public-datasets/taxi_trips/",
"name": "taxi_trips_accelerated",
"status": "Refreshing"
}
]When the acceleration finishes:
[
{
"from": "s3://spiceai-public-datasets/taxi_trips/",
"name": "taxi_trips",
"status": "Ready"
},
{
"from": "s3://spiceai-public-datasets/taxi_trips/",
"name": "taxi_trips_accelerated",
"status": "Ready"
}
]The Spice runtime logs also confirm when the load completes:
2025-03-24T10:03:45.678901Z INFO runtime::accelerated_table::refresh_task: Loaded 1,547,741 rows () for dataset taxi_trips_accelerated in 3m 44s.Once taxi_trips_accelerated reports Ready, run the same query against the accelerated copy:
SELECT COUNT(*) AS total_trips FROM taxi_trips_accelerated;+-------------+
| total_trips |
+-------------+
| 1547741 |
+-------------+
The result is the same, but the query runs against local DuckDB storage and is significantly faster — especially for analytical queries scanning many rows.
In your application, poll the status endpoint and route queries accordingly. A minimal example:
#!/bin/bash
# Poll until the accelerated table is ready, then switch.
TABLE="taxi_trips"
while true; do
STATUS=$(curl -s "http://localhost:8090/v1/datasets?status=true" \
| jq -r '.[] | select(.name == "taxi_trips_accelerated") | .status')
if [ "$STATUS" = "Ready" ]; then
TABLE="taxi_trips_accelerated"
echo "Switched to accelerated table."
break
fi
sleep 5
done
# Use $TABLE for subsequent queries.In production, implement this check in your application's startup or health-check loop rather than a shell script.
You registered the same S3 source as two datasets — a federated table for immediate queries and an accelerated table for fast local reads once loaded. This dual-dataset pattern keeps your application responsive during cold starts while still delivering the full performance benefits of local acceleration for large tables.
For tables that load quickly or where brief federation latency is acceptable, prefer the simpler ready_state: on_registration approach on a single dataset instead.