Enhance documentation on schema inference and evolution across components, including data accelerators, data connectors, views, and troubleshooting sections. (#1394)

lukekim · web-flow · commit 4f396ce8208e · 2026-03-04T20:49:56.000Z
diff --git a/website/docs/components/data-accelerators/index.md b/website/docs/components/data-accelerators/index.md
@@ -87,6 +87,14 @@ In-memory limitations can be mitigated by storing acceleration data on disk, whi
 
 :::
 
+## Schema Handling
+
+Data accelerators store the schema that Spice infers from the data source at startup. This schema is fixed for the lifetime of the runtime process and defines the column names, data types, and nullability of the accelerated table.
+
+If the source schema changes while the runtime is running (for example, new columns are added or data types change), subsequent data refreshes into the accelerator will fail because the incoming data no longer matches the schema of the accelerated table. Restart the runtime to re-infer the schema and re-initialize the accelerated table.
+
+For details on how schema inference works per connector and recommendations for managing schema drift, see [Schema Inference](data-connectors#schema-inference).
+
 ## Data Accelerator Docs
 
 import DocCardList from '@theme/DocCardList';
diff --git a/website/docs/components/data-connectors/index.md b/website/docs/components/data-connectors/index.md
@@ -177,6 +177,41 @@ SELECT * FROM partitioned_data WHERE year = '2024' AND month = '01';
 ```
 
 Partition pruning improves query performance by reading only the relevant files.
+
+## Schema Inference
+
+Spice infers the schema for each dataset from its data source at startup. The inferred schema defines the column names, data types, and nullability used by the dataset for the lifetime of that runtime process.
+
+Schema inference happens once, when the dataset is first registered. Some connectors support tuning the inference behavior with connector-specific parameters:
+
+| Connector                             | Parameter                          | Default | Description                                         |
+| ------------------------------------- | ---------------------------------- | ------- | --------------------------------------------------- |
+| [Kafka](data-connectors/kafka)        | `schema_infer_max_records`         | 10      | Number of messages sampled to infer the JSON schema |
+| [DynamoDB](data-connectors/dynamodb)  | `schema_infer_max_records`         | 10      | Number of items sampled to infer the schema         |
+| [MongoDB](data-connectors/mongodb)    | `mongodb_num_docs_to_infer_schema` | 400     | Number of documents sampled to infer the schema     |
+| [CSV files](../reference/file_format) | `csv_schema_infer_max_records`     | 1000    | Number of rows sampled to infer the CSV schema      |
+
+For connectors that read self-describing formats (Parquet, Arrow, Avro), the schema is read directly from file metadata and does not require sampling.
+
+### Runtime Schema Changes
+
+Spice does not apply schema changes at runtime. If the source schema changes while the runtime is running — for example, new columns are added, columns are removed, or data types change — subsequent data refreshes will fail with an error such as:
+
+```
+Failed to load data for dataset <name>: Cannot cast struct field ...
+```
+
+This behavior is by design. Blocking runtime schema evolution protects accelerated tables from unintentional or breaking schema changes that could corrupt data or produce unexpected query results.
+
+To apply a new source schema, restart the Spice runtime. On startup, Spice re-infers the schema from the source and re-initializes the dataset with the updated column definitions.
+
+:::tip[Recommendation]
+Pin a known-good schema version in the data source or use the [`columns`](../reference/spicepod/datasets#columns) configuration to explicitly define the expected columns. This makes schema expectations explicit and produces clear errors if the source drifts.
+:::
+
+:::note
+Runtime schema evolution controls are planned for a future release. When available, schema evolution will remain off by default.
+:::
 | Name                                          | Parameter              | Supported | Is Document Format |
 | --------------------------------------------- | ---------------------- | --------- | ------------------ |
 | [Apache Parquet](https://parquet.apache.org/) | `file_format: parquet` | ✅         | ❌                  |
diff --git a/website/docs/components/views/index.md b/website/docs/components/views/index.md
@@ -44,3 +44,11 @@ views:
 - Views are read-only; insert, update, and delete operations are not supported.
 - Performance depends on SQL complexity and underlying data.
 - Ensure queries are optimized to prevent slow execution.
+
+## Schema Inference and Evolution
+
+Views derive their schema from the SQL query that defines them. When a view is accelerated, Spice materializes this schema into the acceleration engine at startup.
+
+If the underlying dataset schemas change while the runtime is running, the accelerated view will fail to refresh because the materialized schema no longer matches the source data. Restart the runtime to re-derive the view schema from the updated datasets.
+
+For more detail on schema inference and runtime schema changes, see the [Data Connectors schema inference](data-connectors#schema-inference) documentation.
diff --git a/website/docs/faq/index.md b/website/docs/faq/index.md
@@ -115,3 +115,11 @@ Yes. Spice accelerates data locally using Apache Arrow, Spice Cayenne (Vortex),
 ## 21. How can developers contribute to Spice?
 
 Developers can contribute by submitting code, documentation, or raising issues on [GitHub](https://github.com/spiceai/spiceai). See [CONTRIBUTING.md](https://github.com/spiceai/spiceai/blob/trunk/CONTRIBUTING) for guidelines.
+
+## 22. Does Spice support schema evolution?
+
+Spice infers the schema for datasets and views at startup and does not apply runtime schema changes by default. If the source schema changes while the runtime is running (for example, columns are added, removed, or their types change), data refreshes will fail with a schema mismatch error rather than silently applying the new schema. This behavior is intentional — it protects against unintentional or breaking schema changes propagating into accelerated tables.
+
+To pick up a new source schema, restart the Spice runtime. On startup, Spice re-infers the schema from the source, and the accelerated table is re-initialized with the updated schema.
+
+Runtime schema evolution controls are planned for a future release but will remain off by default to guard against unexpected schema drift.
diff --git a/website/docs/features/data-acceleration/index.md b/website/docs/features/data-acceleration/index.md
@@ -30,6 +30,8 @@ Consider a high-volume e-trading frontend application backed by an AWS RDS datab
 
 **Refresh Latency**: The `refresh_check_interval` controls how frequently the runtime checks for new data. Shorter intervals increase load on the source database. For real-time requirements, use [Change Data Capture (CDC)](../cdc/index.md) instead of polling.
 
+**Schema Changes**: Spice infers the schema for each accelerated dataset at startup and does not apply schema changes at runtime. If the source schema changes (columns added, removed, or types changed) while the runtime is running, data refreshes will fail rather than silently applying the new schema. To pick up source schema changes, restart the runtime. See [Schema Inference](../components/data-connectors#schema-inference) for details.
+
 **Engine Selection**: Choose the acceleration engine based on workload characteristics:
 
 | Engine     | Best For                                           | Mode               |
diff --git a/website/docs/reference/spicepod/datasets.md b/website/docs/reference/spicepod/datasets.md
@@ -207,6 +207,12 @@ Spice emits a warning if the `time_column` from the data source is incompatible
 
 (Optional) Define the format of the `time_partition_column`. For instance, if the physical partitions follow a date format (YYYY-MM-DD), set this value to `date`. The same format options as `time_format` are supported for `time_partition_column`.
 
+## Schema Inference and Evolution
+
+Spice infers the dataset schema from the data source at startup. The inferred schema defines the column names, data types, and nullability used for the lifetime of that runtime process. Schema changes at the source are not applied at runtime — data refreshes will fail if the source schema drifts. Restart the runtime to re-infer the schema.
+
+For connector-specific inference parameters, runtime schema change behavior, and recommendations, see [Schema Inference](../../components/data-connectors#schema-inference).
+
 ## `unsupported_type_action`
 
 Optional. Specifies the action to take when a data type that is not supported by the data connector is encountered.
diff --git a/website/docs/troubleshooting/index.md b/website/docs/troubleshooting/index.md
@@ -33,7 +33,7 @@ Check the runtime output for error messages. Common causes:
 
 - **Incorrect credentials**: Verify connection parameters (`pg_host`, `pg_user`, `pg_pass`, etc.) and that secrets are configured correctly. Use `--verbose` to see secret resolution logs.
 - **Network connectivity**: Ensure the Spice runtime can reach the data source. Test with `curl` or `psql` from the same host.
-- **Schema mismatch**: If the source schema changed since the last refresh, the accelerated table may fail to update. Restart the runtime to re-initialize the dataset.
+- **Schema mismatch**: If the source schema changed since the last refresh (for example, columns were added, removed, or types changed), the accelerated table will fail to update. Spice infers the dataset schema at startup and intentionally blocks runtime schema evolution to protect against unintentional or breaking changes. Restart the runtime so Spice re-infers the schema from the source. See [Schema Inference](components/data-connectors#schema-inference) for details.
 
 Review the `task_history` table for detailed error messages:
 
diff --git a/website/versioned_docs/version-1.11.x/components/data-accelerators/index.md b/website/versioned_docs/version-1.11.x/components/data-accelerators/index.md
@@ -87,6 +87,14 @@ In-memory limitations can be mitigated by storing acceleration data on disk, whi
 
 :::
 
+## Schema Handling
+
+Data accelerators store the schema that Spice infers from the data source at startup. This schema is fixed for the lifetime of the runtime process and defines the column names, data types, and nullability of the accelerated table.
+
+If the source schema changes while the runtime is running (for example, new columns are added or data types change), subsequent data refreshes into the accelerator will fail because the incoming data no longer matches the schema of the accelerated table. Restart the runtime to re-infer the schema and re-initialize the accelerated table.
+
+For details on how schema inference works per connector and recommendations for managing schema drift, see [Schema Inference](data-connectors#schema-inference).
+
 ## Data Accelerator Docs
 
 import DocCardList from '@theme/DocCardList';
diff --git a/website/versioned_docs/version-1.11.x/components/data-connectors/index.md b/website/versioned_docs/version-1.11.x/components/data-connectors/index.md
@@ -176,6 +176,41 @@ SELECT * FROM partitioned_data WHERE year = '2024' AND month = '01';
 ```
 
 Partition pruning improves query performance by reading only the relevant files.
+
+## Schema Inference
+
+Spice infers the schema for each dataset from its data source at startup. The inferred schema defines the column names, data types, and nullability used by the dataset for the lifetime of that runtime process.
+
+Schema inference happens once, when the dataset is first registered. Some connectors support tuning the inference behavior with connector-specific parameters:
+
+| Connector                             | Parameter                          | Default | Description                                         |
+| ------------------------------------- | ---------------------------------- | ------- | --------------------------------------------------- |
+| [Kafka](data-connectors/kafka)        | `schema_infer_max_records`         | 10      | Number of messages sampled to infer the JSON schema |
+| [DynamoDB](data-connectors/dynamodb)  | `schema_infer_max_records`         | 10      | Number of items sampled to infer the schema         |
+| [MongoDB](data-connectors/mongodb)    | `mongodb_num_docs_to_infer_schema` | 400     | Number of documents sampled to infer the schema     |
+| [CSV files](../reference/file_format) | `csv_schema_infer_max_records`     | 1000    | Number of rows sampled to infer the CSV schema      |
+
+For connectors that read self-describing formats (Parquet, Arrow, Avro), the schema is read directly from file metadata and does not require sampling.
+
+### Runtime Schema Changes
+
+Spice does not apply schema changes at runtime. If the source schema changes while the runtime is running — for example, new columns are added, columns are removed, or data types change — subsequent data refreshes will fail with an error such as:
+
+```
+Failed to load data for dataset <name>: Cannot cast struct field ...
+```
+
+This behavior is by design. Blocking runtime schema evolution protects accelerated tables from unintentional or breaking schema changes that could corrupt data or produce unexpected query results.
+
+To apply a new source schema, restart the Spice runtime. On startup, Spice re-infers the schema from the source and re-initializes the dataset with the updated column definitions.
+
+:::tip[Recommendation]
+Pin a known-good schema version in the data source or use the [`columns`](../reference/spicepod/datasets#columns) configuration to explicitly define the expected columns. This makes schema expectations explicit and produces clear errors if the source drifts.
+:::
+
+:::note
+Runtime schema evolution controls are planned for a future release. When available, schema evolution will remain off by default.
+:::
 | Name                                          | Parameter              | Supported | Is Document Format |
 | --------------------------------------------- | ---------------------- | --------- | ------------------ |
 | [Apache Parquet](https://parquet.apache.org/) | `file_format: parquet` | ✅         | ❌                  |
diff --git a/website/versioned_docs/version-1.11.x/components/views/index.md b/website/versioned_docs/version-1.11.x/components/views/index.md
@@ -44,3 +44,11 @@ views:
 - Views are read-only; insert, update, and delete operations are not supported.
 - Performance depends on SQL complexity and underlying data.
 - Ensure queries are optimized to prevent slow execution.
+
+## Schema Inference and Evolution
+
+Views derive their schema from the SQL query that defines them. When a view is accelerated, Spice materializes this schema into the acceleration engine at startup.
+
+If the underlying dataset schemas change while the runtime is running, the accelerated view will fail to refresh because the materialized schema no longer matches the source data. Restart the runtime to re-derive the view schema from the updated datasets.
+
+For more detail on schema inference and runtime schema changes, see the [Data Connectors schema inference](data-connectors#schema-inference) documentation.
diff --git a/website/versioned_docs/version-1.11.x/faq/index.md b/website/versioned_docs/version-1.11.x/faq/index.md
@@ -115,3 +115,11 @@ Yes. Spice accelerates data locally using Apache Arrow, Spice Cayenne (Vortex),
 ## 21. How can developers contribute to Spice?
 
 Developers can contribute by submitting code, documentation, or raising issues on [GitHub](https://github.com/spiceai/spiceai). See [CONTRIBUTING.md](https://github.com/spiceai/spiceai/blob/trunk/CONTRIBUTING) for guidelines.
+
+## 22. Does Spice support schema evolution?
+
+Spice infers the schema for datasets and views at startup and does not apply runtime schema changes by default. If the source schema changes while the runtime is running (for example, columns are added, removed, or their types change), data refreshes will fail with a schema mismatch error rather than silently applying the new schema. This behavior is intentional — it protects against unintentional or breaking schema changes propagating into accelerated tables.
+
+To pick up a new source schema, restart the Spice runtime. On startup, Spice re-infers the schema from the source, and the accelerated table is re-initialized with the updated schema.
+
+Runtime schema evolution controls are planned for a future release but will remain off by default to guard against unexpected schema drift.
diff --git a/website/versioned_docs/version-1.11.x/features/data-acceleration/index.md b/website/versioned_docs/version-1.11.x/features/data-acceleration/index.md
@@ -30,6 +30,8 @@ Consider a high-volume e-trading frontend application backed by an AWS RDS datab
 
 **Refresh Latency**: The `refresh_check_interval` controls how frequently the runtime checks for new data. Shorter intervals increase load on the source database. For real-time requirements, use [Change Data Capture (CDC)](../cdc/index.md) instead of polling.
 
+**Schema Changes**: Spice infers the schema for each accelerated dataset at startup and does not apply schema changes at runtime. If the source schema changes (columns added, removed, or types changed) while the runtime is running, data refreshes will fail rather than silently applying the new schema. To pick up source schema changes, restart the runtime. See [Schema Inference](../components/data-connectors#schema-inference) for details.
+
 **Engine Selection**: Choose the acceleration engine based on workload characteristics:
 
 | Engine     | Best For                                           | Mode               |
diff --git a/website/versioned_docs/version-1.11.x/reference/spicepod/datasets.md b/website/versioned_docs/version-1.11.x/reference/spicepod/datasets.md
@@ -207,6 +207,12 @@ Spice emits a warning if the `time_column` from the data source is incompatible
 
 (Optional) Define the format of the `time_partition_column`. For instance, if the physical partitions follow a date format (YYYY-MM-DD), set this value to `date`. The same format options as `time_format` are supported for `time_partition_column`.
 
+## Schema Inference and Evolution
+
+Spice infers the dataset schema from the data source at startup. The inferred schema defines the column names, data types, and nullability used for the lifetime of that runtime process. Schema changes at the source are not applied at runtime — data refreshes will fail if the source schema drifts. Restart the runtime to re-infer the schema.
+
+For connector-specific inference parameters, runtime schema change behavior, and recommendations, see [Schema Inference](../../components/data-connectors#schema-inference).
+
 ## `unsupported_type_action`
 
 Optional. Specifies the action to take when a data type that is not supported by the data connector is encountered.
diff --git a/website/versioned_docs/version-1.11.x/troubleshooting/index.md b/website/versioned_docs/version-1.11.x/troubleshooting/index.md
@@ -33,7 +33,7 @@ Check the runtime output for error messages. Common causes:
 
 - **Incorrect credentials**: Verify connection parameters (`pg_host`, `pg_user`, `pg_pass`, etc.) and that secrets are configured correctly. Use `--verbose` to see secret resolution logs.
 - **Network connectivity**: Ensure the Spice runtime can reach the data source. Test with `curl` or `psql` from the same host.
-- **Schema mismatch**: If the source schema changed since the last refresh, the accelerated table may fail to update. Restart the runtime to re-initialize the dataset.
+- **Schema mismatch**: If the source schema changed since the last refresh (for example, columns were added, removed, or types changed), the accelerated table will fail to update. Spice infers the dataset schema at startup and intentionally blocks runtime schema evolution to protect against unintentional or breaking changes. Restart the runtime so Spice re-infers the schema from the source. See [Schema Inference](components/data-connectors#schema-inference) for details.
 
 Review the `task_history` table for detailed error messages: