Skip to content

Commit 3045ebe

Browse files
committed
Add metadata columns to v1.11.x docs
1 parent 078d278 commit 3045ebe

3 files changed

Lines changed: 203 additions & 9 deletions

File tree

website/versioned_docs/version-1.11.x/components/data-connectors/index.md

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -177,6 +177,91 @@ SELECT * FROM partitioned_data WHERE year = '2024' AND month = '01';
177177

178178
Partition pruning improves query performance by reading only the relevant files.
179179

180+
### Metadata Columns
181+
182+
File-based connectors can expose per-file object store metadata as virtual columns in the dataset schema. These columns are not stored in the data files — they are derived from object store file metadata at query time.
183+
184+
#### Available Columns
185+
186+
| Column | Type | Description |
187+
| --------------- | ---------------------- | ------------------------------- |
188+
| `location` | `Utf8` | Full URI of the source file |
189+
| `last_modified` | `Timestamp(µs, "UTC")` | When the file was last modified |
190+
| `size` | `UInt64` | File size in bytes |
191+
192+
#### Enabling Metadata Columns
193+
194+
Metadata columns are enabled by adding a `metadata` section to the dataset definition with each desired column set to `enabled`:
195+
196+
```yaml
197+
datasets:
198+
- from: s3://bucket/data/
199+
name: my_data
200+
params:
201+
file_format: parquet
202+
metadata:
203+
location: enabled
204+
last_modified: enabled
205+
size: enabled
206+
```
207+
208+
Each column can be individually enabled or omitted:
209+
210+
```yaml
211+
metadata:
212+
location: enabled # Only add the location column
213+
```
214+
215+
:::note
216+
If the data files already contain a column with the same name as a metadata column (e.g., a Parquet file with a `size` column), the metadata column is not added to avoid conflicts.
217+
:::
218+
219+
#### Querying Metadata Columns
220+
221+
Once enabled, metadata columns appear alongside the regular data columns:
222+
223+
```sql
224+
SELECT * FROM my_data LIMIT 3;
225+
```
226+
227+
```shell
228+
+----+---------+------+-------+-----+----------------------+--------------------------------------------------------------+------+
229+
| id | value | year | month | day | last_modified | location | size |
230+
+----+---------+------+-------+-----+----------------------+--------------------------------------------------------------+------+
231+
| 0 | value_0 | 2022 | 1 | 1 | 2024-10-10T05:36:59Z | s3://bucket/data/year=2022/month=1/day=1/data_0.parquet | 2317 |
232+
| 1 | value_1 | 2022 | 1 | 1 | 2024-10-10T05:36:59Z | s3://bucket/data/year=2022/month=1/day=1/data_0.parquet | 2317 |
233+
| 2 | value_2 | 2022 | 1 | 1 | 2024-10-10T05:36:59Z | s3://bucket/data/year=2022/month=1/day=1/data_0.parquet | 2317 |
234+
+----+---------+------+-------+-----+----------------------+--------------------------------------------------------------+------+
235+
```
236+
237+
Metadata columns can be used in filters, projections, aggregations, and joins like any other column:
238+
239+
```sql
240+
-- Filter by file location
241+
SELECT id, value FROM my_data
242+
WHERE location = 's3://bucket/data/year=2022/month=1/day=1/data_0.parquet';
243+
244+
-- Find recently modified files
245+
SELECT DISTINCT location, last_modified FROM my_data
246+
WHERE last_modified > '2024-01-01T00:00:00Z';
247+
248+
-- Aggregate by file
249+
SELECT location, COUNT(*) AS row_count, size
250+
FROM my_data
251+
GROUP BY location, size
252+
ORDER BY location;
253+
```
254+
255+
#### Applicable Connectors
256+
257+
Metadata columns are supported by all file-based connectors:
258+
259+
| Connector Type | Connectors |
260+
| ---------------------------- | --------------------------------- |
261+
| **Object Stores** | S3, Azure Blob (ABFS), HTTP/HTTPS |
262+
| **Network-Attached Storage** | FTP, SFTP, SMB, NFS |
263+
| **Local Storage** | File |
264+
180265
## Schema Inference
181266

182267
Spice infers the schema for each dataset from its data source at startup. The inferred schema defines the column names, data types, and nullability used by the dataset for the lifetime of that runtime process.

website/versioned_docs/version-1.11.x/components/data-connectors/s3.md

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -262,6 +262,89 @@ Use `schema_source_path` to speed up dataset registration by specifying a URL to
262262
schema_source_path: s3://spiceai-demo-datasets/taxi_trips/2014/1/trips_01.parquet # or s3://spiceai-demo-datasets/taxi_trips/2014/1/
263263
```
264264

265+
### Metadata Columns Example
266+
267+
Metadata columns expose per-file S3 object metadata (`location`, `last_modified`, `size`) as virtual columns in query results. See [Metadata Columns](./#metadata-columns) for full details.
268+
269+
```yaml
270+
- from: s3://spiceai-public-datasets/hive_partitioned_data/
271+
name: hive_data
272+
params:
273+
file_format: parquet
274+
hive_partitioning_enabled: true
275+
metadata:
276+
location: enabled
277+
last_modified: enabled
278+
size: enabled
279+
```
280+
281+
Query metadata alongside regular data:
282+
283+
```sql
284+
SELECT id, value, location, size, last_modified
285+
FROM hive_data
286+
ORDER BY id
287+
LIMIT 5;
288+
```
289+
290+
```shell
291+
+----+---------+-------------------------------------------------------------------------------------------+------+----------------------+
292+
| id | value | location | size | last_modified |
293+
+----+---------+-------------------------------------------------------------------------------------------+------+----------------------+
294+
| 0 | value_0 | s3://spiceai-public-datasets/hive_partitioned_data/year=2022/month=1/day=1/data_0.parquet | 2317 | 2024-10-10T05:36:59Z |
295+
| 1 | value_1 | s3://spiceai-public-datasets/hive_partitioned_data/year=2022/month=1/day=1/data_0.parquet | 2317 | 2024-10-10T05:36:59Z |
296+
| 2 | value_2 | s3://spiceai-public-datasets/hive_partitioned_data/year=2022/month=1/day=1/data_0.parquet | 2317 | 2024-10-10T05:36:59Z |
297+
| 3 | value_3 | s3://spiceai-public-datasets/hive_partitioned_data/year=2022/month=1/day=1/data_0.parquet | 2317 | 2024-10-10T05:36:59Z |
298+
| 4 | value_4 | s3://spiceai-public-datasets/hive_partitioned_data/year=2022/month=1/day=1/data_0.parquet | 2317 | 2024-10-10T05:36:59Z |
299+
+----+---------+-------------------------------------------------------------------------------------------+------+----------------------+
300+
```
301+
302+
Filter by specific file:
303+
304+
```sql
305+
SELECT id, value FROM hive_data
306+
WHERE location = 's3://spiceai-public-datasets/hive_partitioned_data/year=2023/month=2/day=2/data_1.parquet'
307+
ORDER BY id;
308+
```
309+
310+
```shell
311+
+----+---------+
312+
| id | value |
313+
+----+---------+
314+
| 10 | value_0 |
315+
| 11 | value_1 |
316+
| 12 | value_2 |
317+
| 13 | value_3 |
318+
| 14 | value_4 |
319+
| 15 | value_5 |
320+
| 16 | value_6 |
321+
| 17 | value_7 |
322+
| 18 | value_8 |
323+
| 19 | value_9 |
324+
+----+---------+
325+
```
326+
327+
Aggregate per file:
328+
329+
```sql
330+
SELECT location, COUNT(*) AS row_count, size
331+
FROM hive_data
332+
GROUP BY location, size
333+
ORDER BY location;
334+
```
335+
336+
```shell
337+
+-------------------------------------------------------------------------------------------+-----------+------+
338+
| location | row_count | size |
339+
+-------------------------------------------------------------------------------------------+-----------+------+
340+
| s3://spiceai-public-datasets/hive_partitioned_data/year=2022/month=1/day=1/data_0.parquet | 10 | 2317 |
341+
| s3://spiceai-public-datasets/hive_partitioned_data/year=2022/month=1/day=2/data_4.parquet | 10 | 2319 |
342+
| s3://spiceai-public-datasets/hive_partitioned_data/year=2022/month=3/day=3/data_2.parquet | 10 | 2319 |
343+
| s3://spiceai-public-datasets/hive_partitioned_data/year=2023/month=2/day=2/data_1.parquet | 10 | 2319 |
344+
| s3://spiceai-public-datasets/hive_partitioned_data/year=2023/month=4/day=1/data_3.parquet | 10 | 2319 |
345+
+-------------------------------------------------------------------------------------------+-----------+------+
346+
```
347+
265348
## Secrets
266349

267350
Spice integrates with multiple secret stores to help manage sensitive data securely. For detailed information on supported secret stores, refer to the [secret stores documentation](../secret-stores/). Additionally, learn how to use referenced secrets in component parameters by visiting the [using referenced secrets guide](../secret-stores/#using-secrets).

website/versioned_docs/version-1.11.x/reference/spicepod/datasets.md

Lines changed: 35 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -830,15 +830,41 @@ Optional. If enabled, the content of each chunk will be trimmed to remove leadin
830830

831831
## `metadata` {#metadata}
832832

833-
Optional. Additional key-value metadata for the dataset. Used as part of the [Semantic Data Model](../../features/semantic-model).
834-
835-
```yaml
836-
datasets:
837-
- from: spice.ai/eth.recent_blocks
838-
name: eth.recent_blocks
839-
metadata:
840-
instructions: The last 128 blocks.
841-
```
833+
Optional. Additional key-value metadata for the dataset.
834+
835+
The `metadata` field serves two purposes:
836+
837+
1. **Semantic metadata** — Arbitrary key-value pairs used as part of the [Semantic Data Model](../../features/semantic-model).
838+
839+
```yaml
840+
datasets:
841+
- from: spice.ai/eth.recent_blocks
842+
name: eth.recent_blocks
843+
metadata:
844+
instructions: The last 128 blocks.
845+
```
846+
847+
2. **File metadata columns** — For [file-based connectors](../../components/data-connectors/#metadata-columns) (S3, ABFS, File, FTP, SFTP, SMB, NFS, HTTP/HTTPS), the following reserved keys enable virtual columns that expose per-file object store metadata in query results:
848+
849+
| Key | Value | Column Type | Description |
850+
| --------------- | --------- | ---------------------- | ------------------------------- |
851+
| `location` | `enabled` | `Utf8` | Full URI of the source file |
852+
| `last_modified` | `enabled` | `Timestamp(µs, "UTC")` | When the file was last modified |
853+
| `size` | `enabled` | `UInt64` | File size in bytes |
854+
855+
```yaml
856+
datasets:
857+
- from: s3://bucket/data/
858+
name: my_data
859+
params:
860+
file_format: parquet
861+
metadata:
862+
location: enabled
863+
last_modified: enabled
864+
size: enabled
865+
```
866+
867+
If a data file already contains a column with the same name as a metadata column, the metadata column is not added.
842868

843869
## `vectors`
844870

0 commit comments

Comments
 (0)