Skip to content

Commit d779dc8

Browse files
claudespicelukekim
authored andcommitted
docs: Replace load: on_demand with deferred dataset initialization
`load: on_demand` has been removed from the runtime (spiceai/spiceai#10669). The replacement is `ready_state: on_registration` combined with explicit `columns[].type` declarations, which enables deferred initialization — the source connector is not created until the first query references the dataset. Source: spiceai/spiceai#10669
1 parent 15d2f9c commit d779dc8

1 file changed

Lines changed: 26 additions & 22 deletions

File tree

website/docs/reference/spicepod/datasets.md

Lines changed: 26 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -255,7 +255,7 @@ Not all connectors support specifying an `unsupported_type_action`. When specifi
255255

256256
Supports one of two values:
257257

258-
- `on_registration`: Mark the dataset as ready immediately, and queries on this table will fall back to the underlying source directly until the initial acceleration is complete
258+
- `on_registration`: Mark the dataset as ready immediately, and queries on this table will fall back to the underlying source directly until the initial acceleration is complete. When combined with fully declared [`columns[].type`](#columnstype) entries, enables [deferred dataset initialization](#deferred-dataset-initialization) — the source connector is not created until the first query.
259259
- `on_load`: Mark the dataset as ready only after the initial acceleration. Queries against the dataset will return an error before the load has been completed.
260260

261261
```yaml
@@ -303,34 +303,38 @@ LIMIT 1;
303303

304304
If the monitoring query fails a warning is emitted in the logs, an error is propagated to the `task_history` table and the `dataset_unavailable_time_ms` metric is incremented for the failing dataset.
305305

306-
## `load`
306+
## Deferred dataset initialization
307307

308-
Optional. Controls when the dataset is loaded by the runtime. Defaults to `on_startup`.
308+
Datasets can defer connector creation and schema inference until the first query by combining `ready_state: on_registration` with fully declared `columns[].type` entries. When every column has an explicit type, the runtime registers a placeholder table with the declared Arrow schema at startup — SQL planning and federation analysis work against this schema **without contacting the source**. On first query, the placeholder is swapped for the real provider.
309309

310-
- `on_startup` (default): The dataset is initialized during runtime startup — the connector is created, schema is inferred, and acceleration (if configured) begins immediately.
311-
- `on_demand`: The dataset is **not** initialized at startup. Initialization is deferred until the first SQL query that references the dataset, or until an explicit refresh is triggered via `POST /v1/datasets/{name}/acceleration/refresh`.
312-
313-
When a dataset is configured with `load: on_demand`, the runtime:
314-
- Parses and validates the dataset configuration at startup, but does **not** create the connector, infer the schema, or start any refresh tasks.
315-
- Reports the dataset status as `NotLoaded` until it is triggered.
316-
- On the first query (or explicit refresh), initializes the dataset transparently — subsequent queries proceed normally.
317-
- Coordinates concurrent triggers so the dataset is only initialized once.
310+
A dataset is eligible for deferred initialization when:
311+
- It is read-only.
312+
- `ready_state: on_registration` is set.
313+
- It has no embedding or full-text-search columns.
314+
- Every column has an explicit [`columns[].type`](#columnstype).
318315

319316
```yaml
320317
datasets:
321-
- from: postgres:public.large_table
322-
name: large_table
323-
load: on_demand
324-
params:
325-
pg_host: localhost
326-
pg_port: 5432
327-
pg_db: my_db
328-
pg_user: ${secrets:pg_user}
329-
pg_pass: ${secrets:pg_pass}
318+
- from: https://api.example.com/data.json
319+
name: my_data
320+
ready_state: on_registration
321+
columns:
322+
- name: id
323+
type: bigint
324+
- name: name
325+
type: text
326+
- name: created_at
327+
type: timestamptz
330328
```
331329

332-
:::tip
333-
Use `load: on_demand` for large or infrequently accessed datasets to reduce startup time and resource consumption. The dataset will be loaded transparently on first access.
330+
When deferred initialization is active, the runtime:
331+
- Registers the dataset immediately with the declared schema — queries can reference the table in planning before the source is contacted.
332+
- On the first query that references the dataset, initializes the connector and loads the real data transparently.
333+
- Coordinates concurrent triggers so the dataset is only initialized once.
334+
- Supports acceleration — after deferred initialization, the acceleration table, refresh loop, and health monitor are set up normally.
335+
336+
:::warning[Breaking change]
337+
`load: on_demand` has been removed. Replace it with `ready_state: on_registration` combined with explicit `columns[].type` declarations.
334338
:::
335339

336340
## `acceleration`

0 commit comments

Comments
 (0)