You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: Replace load: on_demand with deferred dataset initialization
`load: on_demand` has been removed from the runtime (spiceai/spiceai#10669).
The replacement is `ready_state: on_registration` combined with explicit
`columns[].type` declarations, which enables deferred initialization —
the source connector is not created until the first query references the
dataset.
Source: spiceai/spiceai#10669
Copy file name to clipboardExpand all lines: website/docs/reference/spicepod/datasets.md
+26-22Lines changed: 26 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -255,7 +255,7 @@ Not all connectors support specifying an `unsupported_type_action`. When specifi
255
255
256
256
Supports one of two values:
257
257
258
-
- `on_registration`: Mark the dataset as ready immediately, and queries on this table will fall back to the underlying source directly until the initial acceleration is complete
258
+
- `on_registration`: Mark the dataset as ready immediately, and queries on this table will fall back to the underlying source directly until the initial acceleration is complete. When combined with fully declared [`columns[].type`](#columnstype) entries, enables [deferred dataset initialization](#deferred-dataset-initialization) — the source connector is not created until the first query.
259
259
- `on_load`: Mark the dataset as ready only after the initial acceleration. Queries against the dataset will return an error before the load has been completed.
260
260
261
261
```yaml
@@ -303,34 +303,38 @@ LIMIT 1;
303
303
304
304
If the monitoring query fails a warning is emitted in the logs, an error is propagated to the `task_history` table and the `dataset_unavailable_time_ms` metric is incremented for the failing dataset.
305
305
306
-
## `load`
306
+
## Deferred dataset initialization
307
307
308
-
Optional. Controls when the dataset is loaded by the runtime. Defaults to `on_startup`.
308
+
Datasets can defer connector creation and schema inference until the first query by combining `ready_state: on_registration` with fully declared `columns[].type` entries. When every column has an explicit type, the runtime registers a placeholder table with the declared Arrow schema at startup — SQL planning and federation analysis work against this schema **without contacting the source**. On first query, the placeholder is swapped for the real provider.
309
309
310
-
- `on_startup` (default): The dataset is initialized during runtime startup — the connector is created, schema is inferred, and acceleration (if configured) begins immediately.
311
-
- `on_demand`: The dataset is **not** initialized at startup. Initialization is deferred until the first SQL query that references the dataset, or until an explicit refresh is triggered via `POST /v1/datasets/{name}/acceleration/refresh`.
312
-
313
-
When a dataset is configured with `load: on_demand`, the runtime:
314
-
- Parses and validates the dataset configuration at startup, but does **not** create the connector, infer the schema, or start any refresh tasks.
315
-
- Reports the dataset status as `NotLoaded` until it is triggered.
316
-
- On the first query (or explicit refresh), initializes the dataset transparently — subsequent queries proceed normally.
317
-
- Coordinates concurrent triggers so the dataset is only initialized once.
310
+
A dataset is eligible for deferred initialization when:
311
+
- It is read-only.
312
+
- `ready_state: on_registration` is set.
313
+
- It has no embedding or full-text-search columns.
314
+
- Every column has an explicit [`columns[].type`](#columnstype).
318
315
319
316
```yaml
320
317
datasets:
321
-
- from: postgres:public.large_table
322
-
name: large_table
323
-
load: on_demand
324
-
params:
325
-
pg_host: localhost
326
-
pg_port: 5432
327
-
pg_db: my_db
328
-
pg_user: ${secrets:pg_user}
329
-
pg_pass: ${secrets:pg_pass}
318
+
- from: https://api.example.com/data.json
319
+
name: my_data
320
+
ready_state: on_registration
321
+
columns:
322
+
- name: id
323
+
type: bigint
324
+
- name: name
325
+
type: text
326
+
- name: created_at
327
+
type: timestamptz
330
328
```
331
329
332
-
:::tip
333
-
Use `load: on_demand` for large or infrequently accessed datasets to reduce startup time and resource consumption. The dataset will be loaded transparently on first access.
330
+
When deferred initialization is active, the runtime:
331
+
- Registers the dataset immediately with the declared schema — queries can reference the table in planning before the source is contacted.
332
+
- On the first query that references the dataset, initializes the connector and loads the real data transparently.
333
+
- Coordinates concurrent triggers so the dataset is only initialized once.
334
+
- Supports acceleration — after deferred initialization, the acceleration table, refresh loop, and health monitor are set up normally.
335
+
336
+
:::warning[Breaking change]
337
+
`load: on_demand` has been removed. Replace it with `ready_state: on_registration` combined with explicit `columns[].type` declarations.
0 commit comments