Add metadata columns to v1.11.x docs

lukekim · lukekim · commit 3045ebe87987 · 2026-03-11T10:29:32.000-07:00
diff --git a/website/versioned_docs/version-1.11.x/components/data-connectors/index.md b/website/versioned_docs/version-1.11.x/components/data-connectors/index.md
@@ -177,6 +177,91 @@ SELECT * FROM partitioned_data WHERE year = '2024' AND month = '01';
 
 Partition pruning improves query performance by reading only the relevant files.
 
+### Metadata Columns
+
+File-based connectors can expose per-file object store metadata as virtual columns in the dataset schema. These columns are not stored in the data files — they are derived from object store file metadata at query time.
+
+#### Available Columns
+
+| Column          | Type                   | Description                     |
+| --------------- | ---------------------- | ------------------------------- |
+| `location`      | `Utf8`                 | Full URI of the source file     |
+| `last_modified` | `Timestamp(µs, "UTC")` | When the file was last modified |
+| `size`          | `UInt64`               | File size in bytes              |
+
+#### Enabling Metadata Columns
+
+Metadata columns are enabled by adding a `metadata` section to the dataset definition with each desired column set to `enabled`:
+
+```yaml
+datasets:
+  - from: s3://bucket/data/
+    name: my_data
+    params:
+      file_format: parquet
+    metadata:
+      location: enabled
+      last_modified: enabled
+      size: enabled
+```
+
+Each column can be individually enabled or omitted:
+
+```yaml
+metadata:
+  location: enabled     # Only add the location column
+```
+
+:::note
+If the data files already contain a column with the same name as a metadata column (e.g., a Parquet file with a `size` column), the metadata column is not added to avoid conflicts.
+:::
+
+#### Querying Metadata Columns
+
+Once enabled, metadata columns appear alongside the regular data columns:
+
+```sql
+SELECT * FROM my_data LIMIT 3;
+```
+
+```shell
++----+---------+------+-------+-----+----------------------+--------------------------------------------------------------+------+
+| id | value   | year | month | day | last_modified        | location                                                     | size |
++----+---------+------+-------+-----+----------------------+--------------------------------------------------------------+------+
+| 0  | value_0 | 2022 | 1     | 1   | 2024-10-10T05:36:59Z | s3://bucket/data/year=2022/month=1/day=1/data_0.parquet      | 2317 |
+| 1  | value_1 | 2022 | 1     | 1   | 2024-10-10T05:36:59Z | s3://bucket/data/year=2022/month=1/day=1/data_0.parquet      | 2317 |
+| 2  | value_2 | 2022 | 1     | 1   | 2024-10-10T05:36:59Z | s3://bucket/data/year=2022/month=1/day=1/data_0.parquet      | 2317 |
++----+---------+------+-------+-----+----------------------+--------------------------------------------------------------+------+
+```
+
+Metadata columns can be used in filters, projections, aggregations, and joins like any other column:
+
+```sql
+-- Filter by file location
+SELECT id, value FROM my_data
+WHERE location = 's3://bucket/data/year=2022/month=1/day=1/data_0.parquet';
+
+-- Find recently modified files
+SELECT DISTINCT location, last_modified FROM my_data
+WHERE last_modified > '2024-01-01T00:00:00Z';
+
+-- Aggregate by file
+SELECT location, COUNT(*) AS row_count, size
+FROM my_data
+GROUP BY location, size
+ORDER BY location;
+```
+
+#### Applicable Connectors
+
+Metadata columns are supported by all file-based connectors:
+
+| Connector Type               | Connectors                        |
+| ---------------------------- | --------------------------------- |
+| **Object Stores**            | S3, Azure Blob (ABFS), HTTP/HTTPS |
+| **Network-Attached Storage** | FTP, SFTP, SMB, NFS               |
+| **Local Storage**            | File                              |
+
 ## Schema Inference
 
 Spice infers the schema for each dataset from its data source at startup. The inferred schema defines the column names, data types, and nullability used by the dataset for the lifetime of that runtime process.
diff --git a/website/versioned_docs/version-1.11.x/components/data-connectors/s3.md b/website/versioned_docs/version-1.11.x/components/data-connectors/s3.md
@@ -262,6 +262,89 @@ Use `schema_source_path` to speed up dataset registration by specifying a URL to
     schema_source_path: s3://spiceai-demo-datasets/taxi_trips/2014/1/trips_01.parquet # or s3://spiceai-demo-datasets/taxi_trips/2014/1/
 ```
 
+### Metadata Columns Example
+
+Metadata columns expose per-file S3 object metadata (`location`, `last_modified`, `size`) as virtual columns in query results. See [Metadata Columns](./#metadata-columns) for full details.
+
+```yaml
+- from: s3://spiceai-public-datasets/hive_partitioned_data/
+  name: hive_data
+  params:
+    file_format: parquet
+    hive_partitioning_enabled: true
+  metadata:
+    location: enabled
+    last_modified: enabled
+    size: enabled
+```
+
+Query metadata alongside regular data:
+
+```sql
+SELECT id, value, location, size, last_modified
+FROM hive_data
+ORDER BY id
+LIMIT 5;
+```
+
+```shell
++----+---------+-------------------------------------------------------------------------------------------+------+----------------------+
+| id | value   | location                                                                                  | size | last_modified        |
++----+---------+-------------------------------------------------------------------------------------------+------+----------------------+
+| 0  | value_0 | s3://spiceai-public-datasets/hive_partitioned_data/year=2022/month=1/day=1/data_0.parquet | 2317 | 2024-10-10T05:36:59Z |
+| 1  | value_1 | s3://spiceai-public-datasets/hive_partitioned_data/year=2022/month=1/day=1/data_0.parquet | 2317 | 2024-10-10T05:36:59Z |
+| 2  | value_2 | s3://spiceai-public-datasets/hive_partitioned_data/year=2022/month=1/day=1/data_0.parquet | 2317 | 2024-10-10T05:36:59Z |
+| 3  | value_3 | s3://spiceai-public-datasets/hive_partitioned_data/year=2022/month=1/day=1/data_0.parquet | 2317 | 2024-10-10T05:36:59Z |
+| 4  | value_4 | s3://spiceai-public-datasets/hive_partitioned_data/year=2022/month=1/day=1/data_0.parquet | 2317 | 2024-10-10T05:36:59Z |
++----+---------+-------------------------------------------------------------------------------------------+------+----------------------+
+```
+
+Filter by specific file:
+
+```sql
+SELECT id, value FROM hive_data
+WHERE location = 's3://spiceai-public-datasets/hive_partitioned_data/year=2023/month=2/day=2/data_1.parquet'
+ORDER BY id;
+```
+
+```shell
++----+---------+
+| id | value   |
++----+---------+
+| 10 | value_0 |
+| 11 | value_1 |
+| 12 | value_2 |
+| 13 | value_3 |
+| 14 | value_4 |
+| 15 | value_5 |
+| 16 | value_6 |
+| 17 | value_7 |
+| 18 | value_8 |
+| 19 | value_9 |
++----+---------+
+```
+
+Aggregate per file:
+
+```sql
+SELECT location, COUNT(*) AS row_count, size
+FROM hive_data
+GROUP BY location, size
+ORDER BY location;
+```
+
+```shell
++-------------------------------------------------------------------------------------------+-----------+------+
+| location                                                                                  | row_count | size |
++-------------------------------------------------------------------------------------------+-----------+------+
+| s3://spiceai-public-datasets/hive_partitioned_data/year=2022/month=1/day=1/data_0.parquet | 10        | 2317 |
+| s3://spiceai-public-datasets/hive_partitioned_data/year=2022/month=1/day=2/data_4.parquet | 10        | 2319 |
+| s3://spiceai-public-datasets/hive_partitioned_data/year=2022/month=3/day=3/data_2.parquet | 10        | 2319 |
+| s3://spiceai-public-datasets/hive_partitioned_data/year=2023/month=2/day=2/data_1.parquet | 10        | 2319 |
+| s3://spiceai-public-datasets/hive_partitioned_data/year=2023/month=4/day=1/data_3.parquet | 10        | 2319 |
++-------------------------------------------------------------------------------------------+-----------+------+
+```
+
 ## Secrets
 
 Spice integrates with multiple secret stores to help manage sensitive data securely. For detailed information on supported secret stores, refer to the [secret stores documentation](../secret-stores/). Additionally, learn how to use referenced secrets in component parameters by visiting the [using referenced secrets guide](../secret-stores/#using-secrets).
diff --git a/website/versioned_docs/version-1.11.x/reference/spicepod/datasets.md b/website/versioned_docs/version-1.11.x/reference/spicepod/datasets.md
@@ -830,15 +830,41 @@ Optional. If enabled, the content of each chunk will be trimmed to remove leadin
 
 ## `metadata` {#metadata}
 
-Optional. Additional key-value metadata for the dataset. Used as part of the [Semantic Data Model](../../features/semantic-model).
-
-```yaml
-datasets:
-  - from: spice.ai/eth.recent_blocks
-    name: eth.recent_blocks
-    metadata:
-      instructions: The last 128 blocks.
-```
+Optional. Additional key-value metadata for the dataset.
+
+The `metadata` field serves two purposes:
+
+1. **Semantic metadata** — Arbitrary key-value pairs used as part of the [Semantic Data Model](../../features/semantic-model).
+
+    ```yaml
+    datasets:
+      - from: spice.ai/eth.recent_blocks
+        name: eth.recent_blocks
+        metadata:
+          instructions: The last 128 blocks.
+    ```
+
+2. **File metadata columns** — For [file-based connectors](../../components/data-connectors/#metadata-columns) (S3, ABFS, File, FTP, SFTP, SMB, NFS, HTTP/HTTPS), the following reserved keys enable virtual columns that expose per-file object store metadata in query results:
+
+    | Key             | Value     | Column Type            | Description                     |
+    | --------------- | --------- | ---------------------- | ------------------------------- |
+    | `location`      | `enabled` | `Utf8`                 | Full URI of the source file     |
+    | `last_modified` | `enabled` | `Timestamp(µs, "UTC")` | When the file was last modified |
+    | `size`          | `enabled` | `UInt64`               | File size in bytes              |
+
+    ```yaml
+    datasets:
+      - from: s3://bucket/data/
+        name: my_data
+        params:
+          file_format: parquet
+        metadata:
+          location: enabled
+          last_modified: enabled
+          size: enabled
+    ```
+
+    If a data file already contains a column with the same name as a metadata column, the metadata column is not added.
 
 ## `vectors`