Skip to content

Commit 62c1bc8

Browse files
authored
docs: Update file formats and data connectors documentation (#1409)
1 parent ab7a089 commit 62c1bc8

4 files changed

Lines changed: 146 additions & 52 deletions

File tree

website/docs/components/data-connectors/index.md

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ Supported Data Connectors include:
5555
| `odbc` | ODBC | Beta | ODBC |
5656
| `snowflake` | Snowflake | Beta | Arrow |
5757
| `spark` | Spark | Beta | [Spark Connect][spark] |
58-
| `iceberg` | [Apache Iceberg][iceberg] | Beta | Parquet |
58+
| `iceberg` | [Apache Iceberg][iceberg] | Stable | Parquet |
5959
| `abfs` | Azure BlobFS | Alpha | Parquet, CSV, JSON |
6060
| `ftp`, `sftp` | FTP/SFTP | Alpha | Parquet, CSV, JSON |
6161
| `smb` | SMB | Alpha | Parquet, CSV, JSON |
@@ -117,11 +117,11 @@ datasets:
117117
| [CSV](../reference/file_format#csv) | `file_format: csv` | Stable | Comma-separated values |
118118
| JSON | `file_format: json` | Stable | JavaScript Object Notation |
119119
| [Delta Lake](https://delta.io/) | `file_format: delta` | Stable | Open table format with ACID transactions. Object stores only. |
120-
| [Apache Iceberg](https://iceberg.apache.org/) | `file_format: iceberg` | Beta | Open table format for large analytic datasets. Object stores only. Requires a [catalog](../catalogs/index.md). |
120+
| [Apache Iceberg](https://iceberg.apache.org/) | `file_format: iceberg` | Stable | Open table format for large analytic datasets. Object stores only. Requires a [catalog](../catalogs/index.md). |
121121
| Microsoft Excel | `file_format: xlsx` | Roadmap | Excel spreadsheet format |
122122
| Markdown | `file_format: md` | Stable | Plain text with formatting (document format) |
123123
| Text | `file_format: txt` | Stable | Plain text files (document format) |
124-
| PDF | `file_format: pdf` | Alpha | Portable Document Format (document format) |
124+
| PDF | `file_format: pdf` | Beta | Portable Document Format (document format) |
125125
| Microsoft Word | `file_format: docx` | Alpha | Word document format (document format) |
126126

127127
### Format-Specific Parameters
@@ -186,11 +186,11 @@ File-based connectors can expose per-file object store metadata as virtual colum
186186

187187
#### Available Columns
188188

189-
| Column | Type | Description |
190-
| ---------------- | ---------------------- | ---------------------------------- |
191-
| `_location` | `Utf8` | Full URI of the source file |
192-
| `_last_modified` | `Timestamp(µs, "UTC")` | When the file was last modified |
193-
| `_size` | `UInt64` | File size in bytes |
189+
| Column | Type | Description |
190+
| ---------------- | ---------------------- | ------------------------------- |
191+
| `_location` | `Utf8` | Full URI of the source file |
192+
| `_last_modified` | `Timestamp(µs, "UTC")` | When the file was last modified |
193+
| `_size` | `UInt64` | File size in bytes |
194194

195195
#### Enabling Metadata Columns
196196

@@ -259,11 +259,11 @@ ORDER BY _location;
259259

260260
Metadata columns are supported by all file-based connectors:
261261

262-
| Connector Type | Connectors |
263-
| ---------------------------- | -------------------------------------- |
264-
| **Object Stores** | S3, Azure Blob (ABFS), HTTP/HTTPS |
265-
| **Network-Attached Storage** | FTP, SFTP, SMB, NFS |
266-
| **Local Storage** | File |
262+
| Connector Type | Connectors |
263+
| ---------------------------- | --------------------------------- |
264+
| **Object Stores** | S3, Azure Blob (ABFS), HTTP/HTTPS |
265+
| **Network-Attached Storage** | FTP, SFTP, SMB, NFS |
266+
| **Local Storage** | File |
267267

268268
## Schema Inference
269269

@@ -304,12 +304,12 @@ Runtime schema evolution controls are planned for a future release. When availab
304304
| [Apache Parquet](https://parquet.apache.org/) | `file_format: parquet` | ✅ | ❌ |
305305
| [CSV](../reference/file_format#csv) | `file_format: csv` | ✅ | ❌ |
306306
| [Delta Lake](https://delta.io/) | `file_format: delta` | ✅ | ❌ |
307-
| [Apache Iceberg](https://iceberg.apache.org/) | `file_format: iceberg` | Beta | ❌ |
307+
| [Apache Iceberg](https://iceberg.apache.org/) | `file_format: iceberg` | | ❌ |
308308
| JSON | `file_format: json` | ✅ | ❌ |
309309
| Microsoft Excel | `file_format: xlsx` | Roadmap | ❌ |
310310
| Markdown | `file_format: md` | ✅ | ✅ |
311311
| Text | `file_format: txt` | ✅ | ✅ |
312-
| PDF | `file_format: pdf` | Alpha | ✅ |
312+
| PDF | `file_format: pdf` | Beta | ✅ |
313313
| Microsoft Word | `file_format: docx` | Alpha | ✅ |
314314

315315
### Document Formats {#document-formats}
@@ -320,7 +320,7 @@ Runtime schema evolution controls are planned for a future release. When availab
320320
Document formats (Markdown, Text, PDF, Word) are handled differently from structured data formats. Each file becomes a row in the resulting table, with the file contents stored in a `content` column.
321321

322322
:::warning[Note]
323-
Document formats in Alpha (PDF, DOCX) may not parse all structure or text from the underlying documents correctly.
323+
Document formats in Alpha (DOCX) may not parse all structure or text from the underlying documents correctly.
324324
:::
325325

326326
#### Document Table Schema

website/docs/reference/file_format.md

Lines changed: 57 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -6,15 +6,39 @@ pagination_prev: 'reference/index'
66
pagination_next: null
77
---
88

9-
Spice supports CSV, JSON, Parquet, Delta Lake, and Iceberg data file-formats for data connectors that can read files from a file system or cloud object storage (i.e. [`s3://`](../components/data-connectors/s3), [`abfs://`](../components/data-connectors/abfs), [`file://`](../components/data-connectors/file), etc.). Delta Lake and Iceberg are supported for object store connectors. Iceberg requires a catalog to be configured.
9+
File-based [data connectors](../components/data-connectors/index.md) — including [`s3://`](../components/data-connectors/s3), [`abfs://`](../components/data-connectors/abfs), [`file://`](../components/data-connectors/file), [`ftp://`](../components/data-connectors/ftp), [`sftp://`](../components/data-connectors/ftp), and others — support multiple structured and document file formats. This page details the format-specific parameters available for each.
1010

11-
The parameters supported for specific file-formats are detailed on this page.
11+
## Common Parameters
12+
13+
These parameters apply across multiple file formats.
14+
15+
| Parameter | Type | Default | Description |
16+
| --------------------------- | ------- | -------------- | ----------------------------------------------------------------------------------------------------------------------------- |
17+
| `file_format` | String | Inferred | Selects the file reader. If omitted, format is inferred from the file extension. See [Supported Formats](#supported-formats). |
18+
| `file_extension` | String | Derived | Overrides the file extension filter used when listing files. Defaults to the extension matching the resolved format. |
19+
| `schema_infer_max_records` | Integer | `1000` | Maximum number of records scanned to infer the schema. |
20+
| `file_compression_type` | String | `UNCOMPRESSED` | File-level compression for CSV, TSV, and JSON files. Valid values: `GZIP`, `BZIP2`, `XZ`, `ZSTD`, `UNCOMPRESSED`. |
21+
| `hive_partitioning_enabled` | Boolean | `false` | Enables Hive-style partition discovery from directory structure. |
22+
23+
## Supported Formats {#supported-formats}
24+
25+
The `file_format` parameter accepts these values:
26+
27+
| Value | Reader | Default Extension | Notes |
28+
| --------- | ------------------- | ----------------- | ------------------------------------------- |
29+
| `parquet` | Apache Parquet | `.parquet` | |
30+
| `csv` | CSV | `.csv` | Uses `csv_*` parameters. |
31+
| `tsv` | TSV (tab-delimited) | `.tsv` | Uses `tsv_*` parameters. Delimiter is tab. |
32+
| `json` | JSON | `.json` | Uses `json_format` to control parsing mode. |
33+
| `jsonl` | JSON Lines | `.jsonl` | Line-delimited JSON. |
34+
35+
When `file_format` is omitted, Spice infers the format from the dataset path extension. If the extension does not match one of the values above, a configuration error is returned.
1236

1337
## Parquet
1438

15-
Spice automatically supports reading any Parquet file, regardless of the compression codec or data encoding used.
39+
Spice reads any Parquet file regardless of the compression codec or data encoding.
1640

17-
Compression codecs:
41+
Supported compression codecs:
1842

1943
- [`UNCOMPRESSED`](https://parquet.apache.org/docs/file-format/data-pages/compression/#uncompressed)
2044
- [`SNAPPY`](https://parquet.apache.org/docs/file-format/data-pages/compression/#snappy)
@@ -25,7 +49,7 @@ Compression codecs:
2549
- [`LZ4_RAW`](https://parquet.apache.org/docs/file-format/data-pages/compression/#lz4_raw)
2650
- [`ZSTD`](https://parquet.apache.org/docs/file-format/data-pages/compression/#zstd)
2751

28-
Data encodings:
52+
Supported data encodings:
2953

3054
- [`PLAIN`](https://parquet.apache.org/docs/file-format/data-pages/encodings/#plain-plain--0)
3155
- [`PLAIN_DICTIONARY` / `RLE_DICTIONARY`](https://parquet.apache.org/docs/file-format/data-pages/encodings/#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
@@ -38,16 +62,38 @@ Data encodings:
3862

3963
## CSV
4064

65+
### Parameters {#csv}
66+
67+
| Parameter | Type | Default | Description |
68+
| ------------------------------ | ------- | ------- | ----------------------------------------------------------------------------------------------------- |
69+
| `csv_has_header` | Boolean | `true` | Whether the first row contains column headers. |
70+
| `csv_quote` | Char | `"` | Character used to quote fields containing special characters. |
71+
| `csv_escape` | Char | _none_ | Character used to escape special characters within a field. |
72+
| `csv_delimiter` | Char | `,` | Character used to separate fields. |
73+
| `csv_schema_infer_max_records` | Integer | `1000` | **Deprecated.** Use `schema_infer_max_records` instead. Maximum records scanned for schema inference. |
74+
75+
## TSV
76+
77+
TSV (tab-separated values) is a first-class format. Set `file_format: tsv` or use a `.tsv` file extension. The delimiter is always tab and cannot be changed.
78+
4179
### Parameters
4280

43-
- `csv_has_header`: Optional. Indicate if the CSV file has header row. Defaults to `true`
44-
- `csv_quote`: Optional. A one-character string used to quote fields containing special characters. Defaults to `"`
45-
- `csv_escape`: Optional. A one-character string used to represent special characters or to include characters that would normally be interpreted as delimiters or new line characters within a field value. Defaults to `null`
46-
- `csv_schema_infer_max_records`: Optional. A number used to set the limit in terms of records to scan to infer the schema. Defaults to `1000`
47-
- `csv_delimiter`: Optional. A one-character string used to separate individual fields. Defaults to `,`
81+
| Parameter | Type | Default | Description |
82+
| ------------------------------ | ------- | ------- | ----------------------------------------------------------------------------------------------------- |
83+
| `tsv_has_header` | Boolean | `true` | Whether the first row contains column headers. |
84+
| `tsv_quote` | Char | `"` | Character used to quote fields containing special characters. |
85+
| `tsv_escape` | Char | _none_ | Character used to escape special characters within a field. |
86+
| `tsv_schema_infer_max_records` | Integer | `1000` | **Deprecated.** Use `schema_infer_max_records` instead. Maximum records scanned for schema inference. |
4887

4988
## JSON
5089

90+
Set `file_format: json` for JSON files. Use the `json_format` parameter to select the parsing mode.
91+
5192
### Parameters
5293

53-
- `json_format`: Optional. Specifies the JSON format to parse. Valid values are `array`, `ndjson`, and `jsonl`. Defaults to `jsonl`
94+
| Parameter | Type | Default | Description |
95+
| -------------- | ------- | ------- | ----------------------------------------------------------------------------------------------------------------- |
96+
| `json_format` | String | `jsonl` | Parsing mode. `jsonl`, `ndjson`, and `ldjson` produce line-delimited JSON. `array` parses a top-level JSON array. |
97+
| `flatten_json` | Boolean | `false` | When `true`, nested JSON objects are flattened with `.` as a separator (e.g., `address.city`). |
98+
99+
Setting `file_format: jsonl` uses the DataFusion JSON Lines reader directly, without `json_format` or `flatten_json` support.

0 commit comments

Comments
 (0)