You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Enhance documentation on schema inference and evolution across components, including data accelerators, data connectors, views, and troubleshooting sections. (#1394)
Copy file name to clipboardExpand all lines: website/docs/components/data-accelerators/index.md
+8Lines changed: 8 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -87,6 +87,14 @@ In-memory limitations can be mitigated by storing acceleration data on disk, whi
87
87
88
88
:::
89
89
90
+
## Schema Handling
91
+
92
+
Data accelerators store the schema that Spice infers from the data source at startup. This schema is fixed for the lifetime of the runtime process and defines the column names, data types, and nullability of the accelerated table.
93
+
94
+
If the source schema changes while the runtime is running (for example, new columns are added or data types change), subsequent data refreshes into the accelerator will fail because the incoming data no longer matches the schema of the accelerated table. Restart the runtime to re-infer the schema and re-initialize the accelerated table.
95
+
96
+
For details on how schema inference works per connector and recommendations for managing schema drift, see [Schema Inference](data-connectors#schema-inference).
Copy file name to clipboardExpand all lines: website/docs/components/data-connectors/index.md
+35Lines changed: 35 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -177,6 +177,41 @@ SELECT * FROM partitioned_data WHERE year = '2024' AND month = '01';
177
177
```
178
178
179
179
Partition pruning improves query performance by reading only the relevant files.
180
+
181
+
## Schema Inference
182
+
183
+
Spice infers the schema for each dataset from its data source at startup. The inferred schema defines the column names, data types, and nullability used by the dataset for the lifetime of that runtime process.
184
+
185
+
Schema inference happens once, when the dataset is first registered. Some connectors support tuning the inference behavior with connector-specific parameters:
| [Kafka](data-connectors/kafka) | `schema_infer_max_records` | 10 | Number of messages sampled to infer the JSON schema |
190
+
| [DynamoDB](data-connectors/dynamodb) | `schema_infer_max_records` | 10 | Number of items sampled to infer the schema |
191
+
| [MongoDB](data-connectors/mongodb) | `mongodb_num_docs_to_infer_schema` | 400 | Number of documents sampled to infer the schema |
192
+
| [CSV files](../reference/file_format) | `csv_schema_infer_max_records` | 1000 | Number of rows sampled to infer the CSV schema |
193
+
194
+
For connectors that read self-describing formats (Parquet, Arrow, Avro), the schema is read directly from file metadata and does not require sampling.
195
+
196
+
### Runtime Schema Changes
197
+
198
+
Spice does not apply schema changes at runtime. If the source schema changes while the runtime is running — for example, new columns are added, columns are removed, or data types change — subsequent data refreshes will fail with an error such as:
199
+
200
+
```
201
+
Failed to load data for dataset <name>: Cannot cast struct field ...
202
+
```
203
+
204
+
This behavior is by design. Blocking runtime schema evolution protects accelerated tables from unintentional or breaking schema changes that could corrupt data or produce unexpected query results.
205
+
206
+
To apply a new source schema, restart the Spice runtime. On startup, Spice re-infers the schema from the source and re-initializes the dataset with the updated column definitions.
207
+
208
+
:::tip[Recommendation]
209
+
Pin a known-good schema version in the data source or use the [`columns`](../reference/spicepod/datasets#columns) configuration to explicitly define the expected columns. This makes schema expectations explicit and produces clear errors if the source drifts.
210
+
:::
211
+
212
+
:::note
213
+
Runtime schema evolution controls are planned for a future release. When available, schema evolution will remain off by default.
214
+
:::
180
215
| Name | Parameter | Supported | Is Document Format |
Copy file name to clipboardExpand all lines: website/docs/components/views/index.md
+8Lines changed: 8 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -44,3 +44,11 @@ views:
44
44
- Views are read-only; insert, update, and delete operations are not supported.
45
45
- Performance depends on SQL complexity and underlying data.
46
46
- Ensure queries are optimized to prevent slow execution.
47
+
48
+
## Schema Inference and Evolution
49
+
50
+
Views derive their schema from the SQL query that defines them. When a view is accelerated, Spice materializes this schema into the acceleration engine at startup.
51
+
52
+
If the underlying dataset schemas change while the runtime is running, the accelerated view will fail to refresh because the materialized schema no longer matches the source data. Restart the runtime to re-derive the view schema from the updated datasets.
53
+
54
+
For more detail on schema inference and runtime schema changes, see the [Data Connectors schema inference](data-connectors#schema-inference) documentation.
Copy file name to clipboardExpand all lines: website/docs/faq/index.md
+8Lines changed: 8 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -115,3 +115,11 @@ Yes. Spice accelerates data locally using Apache Arrow, Spice Cayenne (Vortex),
115
115
## 21. How can developers contribute to Spice?
116
116
117
117
Developers can contribute by submitting code, documentation, or raising issues on [GitHub](https://github.com/spiceai/spiceai). See [CONTRIBUTING.md](https://github.com/spiceai/spiceai/blob/trunk/CONTRIBUTING) for guidelines.
118
+
119
+
## 22. Does Spice support schema evolution?
120
+
121
+
Spice infers the schema for datasets and views at startup and does not apply runtime schema changes by default. If the source schema changes while the runtime is running (for example, columns are added, removed, or their types change), data refreshes will fail with a schema mismatch error rather than silently applying the new schema. This behavior is intentional — it protects against unintentional or breaking schema changes propagating into accelerated tables.
122
+
123
+
To pick up a new source schema, restart the Spice runtime. On startup, Spice re-infers the schema from the source, and the accelerated table is re-initialized with the updated schema.
124
+
125
+
Runtime schema evolution controls are planned for a future release but will remain off by default to guard against unexpected schema drift.
Copy file name to clipboardExpand all lines: website/docs/features/data-acceleration/index.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -30,6 +30,8 @@ Consider a high-volume e-trading frontend application backed by an AWS RDS datab
30
30
31
31
**Refresh Latency**: The `refresh_check_interval` controls how frequently the runtime checks for new data. Shorter intervals increase load on the source database. For real-time requirements, use [Change Data Capture (CDC)](../cdc/index.md) instead of polling.
32
32
33
+
**Schema Changes**: Spice infers the schema for each accelerated dataset at startup and does not apply schema changes at runtime. If the source schema changes (columns added, removed, or types changed) while the runtime is running, data refreshes will fail rather than silently applying the new schema. To pick up source schema changes, restart the runtime. See [Schema Inference](../components/data-connectors#schema-inference) for details.
34
+
33
35
**Engine Selection**: Choose the acceleration engine based on workload characteristics:
Copy file name to clipboardExpand all lines: website/docs/reference/spicepod/datasets.md
+6Lines changed: 6 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -207,6 +207,12 @@ Spice emits a warning if the `time_column` from the data source is incompatible
207
207
208
208
(Optional) Define the format of the `time_partition_column`. For instance, if the physical partitions follow a date format (YYYY-MM-DD), set this value to `date`. The same format options as `time_format` are supported for `time_partition_column`.
209
209
210
+
## Schema Inference and Evolution
211
+
212
+
Spice infers the dataset schema from the data source at startup. The inferred schema defines the column names, data types, and nullability used for the lifetime of that runtime process. Schema changes at the source are not applied at runtime — data refreshes will fail if the source schema drifts. Restart the runtime to re-infer the schema.
213
+
214
+
For connector-specific inference parameters, runtime schema change behavior, and recommendations, see [Schema Inference](../../components/data-connectors#schema-inference).
215
+
210
216
## `unsupported_type_action`
211
217
212
218
Optional. Specifies the action to take when a data type that is not supported by the data connector is encountered.
Copy file name to clipboardExpand all lines: website/docs/troubleshooting/index.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -33,7 +33,7 @@ Check the runtime output for error messages. Common causes:
33
33
34
34
-**Incorrect credentials**: Verify connection parameters (`pg_host`, `pg_user`, `pg_pass`, etc.) and that secrets are configured correctly. Use `--verbose` to see secret resolution logs.
35
35
-**Network connectivity**: Ensure the Spice runtime can reach the data source. Test with `curl` or `psql` from the same host.
36
-
-**Schema mismatch**: If the source schema changed since the last refresh, the accelerated table may fail to update. Restart the runtime to re-initialize the dataset.
36
+
-**Schema mismatch**: If the source schema changed since the last refresh (for example, columns were added, removed, or types changed), the accelerated table will fail to update. Spice infers the dataset schema at startup and intentionally blocks runtime schema evolution to protect against unintentional or breaking changes. Restart the runtime so Spice re-infers the schema from the source. See [Schema Inference](components/data-connectors#schema-inference) for details.
37
37
38
38
Review the `task_history` table for detailed error messages:
Copy file name to clipboardExpand all lines: website/versioned_docs/version-1.11.x/components/data-accelerators/index.md
+8Lines changed: 8 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -87,6 +87,14 @@ In-memory limitations can be mitigated by storing acceleration data on disk, whi
87
87
88
88
:::
89
89
90
+
## Schema Handling
91
+
92
+
Data accelerators store the schema that Spice infers from the data source at startup. This schema is fixed for the lifetime of the runtime process and defines the column names, data types, and nullability of the accelerated table.
93
+
94
+
If the source schema changes while the runtime is running (for example, new columns are added or data types change), subsequent data refreshes into the accelerator will fail because the incoming data no longer matches the schema of the accelerated table. Restart the runtime to re-infer the schema and re-initialize the accelerated table.
95
+
96
+
For details on how schema inference works per connector and recommendations for managing schema drift, see [Schema Inference](data-connectors#schema-inference).
Copy file name to clipboardExpand all lines: website/versioned_docs/version-1.11.x/components/data-connectors/index.md
+35Lines changed: 35 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -176,6 +176,41 @@ SELECT * FROM partitioned_data WHERE year = '2024' AND month = '01';
176
176
```
177
177
178
178
Partition pruning improves query performance by reading only the relevant files.
179
+
180
+
## Schema Inference
181
+
182
+
Spice infers the schema for each dataset from its data source at startup. The inferred schema defines the column names, data types, and nullability used by the dataset for the lifetime of that runtime process.
183
+
184
+
Schema inference happens once, when the dataset is first registered. Some connectors support tuning the inference behavior with connector-specific parameters:
| [Kafka](data-connectors/kafka) | `schema_infer_max_records` | 10 | Number of messages sampled to infer the JSON schema |
189
+
| [DynamoDB](data-connectors/dynamodb) | `schema_infer_max_records` | 10 | Number of items sampled to infer the schema |
190
+
| [MongoDB](data-connectors/mongodb) | `mongodb_num_docs_to_infer_schema` | 400 | Number of documents sampled to infer the schema |
191
+
| [CSV files](../reference/file_format) | `csv_schema_infer_max_records` | 1000 | Number of rows sampled to infer the CSV schema |
192
+
193
+
For connectors that read self-describing formats (Parquet, Arrow, Avro), the schema is read directly from file metadata and does not require sampling.
194
+
195
+
### Runtime Schema Changes
196
+
197
+
Spice does not apply schema changes at runtime. If the source schema changes while the runtime is running — for example, new columns are added, columns are removed, or data types change — subsequent data refreshes will fail with an error such as:
198
+
199
+
```
200
+
Failed to load data for dataset <name>: Cannot cast struct field ...
201
+
```
202
+
203
+
This behavior is by design. Blocking runtime schema evolution protects accelerated tables from unintentional or breaking schema changes that could corrupt data or produce unexpected query results.
204
+
205
+
To apply a new source schema, restart the Spice runtime. On startup, Spice re-infers the schema from the source and re-initializes the dataset with the updated column definitions.
206
+
207
+
:::tip[Recommendation]
208
+
Pin a known-good schema version in the data source or use the [`columns`](../reference/spicepod/datasets#columns) configuration to explicitly define the expected columns. This makes schema expectations explicit and produces clear errors if the source drifts.
209
+
:::
210
+
211
+
:::note
212
+
Runtime schema evolution controls are planned for a future release. When available, schema evolution will remain off by default.
213
+
:::
179
214
| Name | Parameter | Supported | Is Document Format |
Copy file name to clipboardExpand all lines: website/versioned_docs/version-1.11.x/components/views/index.md
+8Lines changed: 8 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -44,3 +44,11 @@ views:
44
44
- Views are read-only; insert, update, and delete operations are not supported.
45
45
- Performance depends on SQL complexity and underlying data.
46
46
- Ensure queries are optimized to prevent slow execution.
47
+
48
+
## Schema Inference and Evolution
49
+
50
+
Views derive their schema from the SQL query that defines them. When a view is accelerated, Spice materializes this schema into the acceleration engine at startup.
51
+
52
+
If the underlying dataset schemas change while the runtime is running, the accelerated view will fail to refresh because the materialized schema no longer matches the source data. Restart the runtime to re-derive the view schema from the updated datasets.
53
+
54
+
For more detail on schema inference and runtime schema changes, see the [Data Connectors schema inference](data-connectors#schema-inference) documentation.
0 commit comments