Skip to content

Commit 47b93e9

Browse files
committed
docs: Document compressed file extension auto-detection
Listing data connectors (S3, ABFS, GCS, HTTP/HTTPS, FTP, SFTP, SMB, NFS, file) now auto-detect both file format and compression codec from compound extensions like `.csv.gz`, `.jsonl.zst`, `.tsv.xz`, `.json.bz2`. No `file_compression_type` parameter is required for the recognized suffixes (`.gz`, `.bz2`, `.xz`, `.zst`). - Updates the Common Parameters table: `file_extension` and `file_compression_type` now mention compound-extension inference. - Adds a new "Compressed File Extensions" section with the format/ compression mapping and an example. Reflects the behavior introduced in spiceai/spiceai#10809.
1 parent 98eb3b9 commit 47b93e9

1 file changed

Lines changed: 33 additions & 2 deletions

File tree

website/docs/reference/file_format.md

Lines changed: 33 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,9 @@ These parameters apply across multiple file formats.
1515
| Parameter | Type | Default | Description |
1616
| --------------------------- | ------- | -------------- | ----------------------------------------------------------------------------------------------------------------------------- |
1717
| `file_format` | String | Inferred | Selects the file reader. If omitted, format is inferred from the file extension. See [Supported Formats](#supported-formats). |
18-
| `file_extension` | String | Derived | Overrides the file extension filter used when listing files. Defaults to the extension matching the resolved format. |
18+
| `file_extension` | String | Derived | Overrides the file extension filter used when listing files. Defaults to the extension matching the resolved format. Compound extensions like `.csv.gz` set both format and compression at once. |
1919
| `schema_infer_max_records` | Integer | `1000` | Maximum number of records scanned to infer the schema. |
20-
| `file_compression_type` | String | `UNCOMPRESSED` | File-level compression for CSV, TSV, and JSON files. Valid values: `GZIP`, `BZIP2`, `XZ`, `ZSTD`, `UNCOMPRESSED`. |
20+
| `file_compression_type` | String | Inferred | File-level compression for CSV, TSV, and JSON files. Valid values: `GZIP`, `BZIP2`, `XZ`, `ZSTD`, `UNCOMPRESSED`. If unset, Spice infers compression from compound file extensions such as `.gz`, `.bz2`, `.xz`, and `.zst` (see [Compressed File Extensions](#compressed-file-extensions)). |
2121
| `hive_partitioning_enabled` | Boolean | `false` | Enables Hive-style partition discovery from directory structure. |
2222

2323
## Supported Formats {#supported-formats}
@@ -34,6 +34,37 @@ The `file_format` parameter accepts these values:
3434

3535
When `file_format` is omitted, Spice infers the format from the dataset path extension. If the extension does not match one of the values above, a configuration error is returned.
3636

37+
## Compressed File Extensions {#compressed-file-extensions}
38+
39+
For CSV, TSV, JSON, and JSONL files, Spice auto-detects both the file format and the compression codec from compound file extensions. Files matching one of the recognized format extensions followed by a recognized compression suffix are read transparently — no `file_compression_type` parameter is required.
40+
41+
Recognized compression suffixes: `.gz` (GZIP), `.bz2` (BZIP2), `.xz` (XZ), `.zst` (ZSTD).
42+
43+
| Path / extension | Format inferred | Compression inferred |
44+
| ---------------------- | --------------- | -------------------- |
45+
| `data.csv.gz` | `csv` | `GZIP` |
46+
| `events.jsonl.zst` | `jsonl` | `ZSTD` |
47+
| `report.tsv.xz` | `tsv` | `XZ` |
48+
| `data.json.bz2` | `json` | `BZIP2` |
49+
| `data.csv` | `csv` | `UNCOMPRESSED` |
50+
51+
The `.ndjson` and `.ldjson` extensions are also accepted as JSON Lines aliases — `events.ndjson.gz` is read as JSONL with GZIP compression.
52+
53+
Auto-detection works the same way when reading a single file or listing a directory: object-store listings are filtered by the compound extension, and the same compression codec is applied to every file matched.
54+
55+
To override auto-detection — for example, to force a specific compression codec — set `file_compression_type` explicitly. Explicit values always take precedence over the inferred value.
56+
57+
```yaml
58+
datasets:
59+
- from: s3://my-bucket/exports/
60+
name: exports
61+
params:
62+
file_format: csv # Listing only includes matching files
63+
file_extension: .csv.gz # Reads gzipped CSVs in the directory
64+
```
65+
66+
Auto-detection applies to listing data connectors (S3, ABFS, GCS, HTTP/HTTPS, FTP, SFTP, SMB, NFS, file) and to the HTTPS connector's auto-format detection. Parquet handles compression internally and is not affected by this setting — see [Parquet](#parquet) for the codecs Parquet supports natively.
67+
3768
## Parquet
3869
3970
Spice reads any Parquet file regardless of the compression codec or data encoding.

0 commit comments

Comments
 (0)