You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Listing data connectors (S3, ABFS, GCS, HTTP/HTTPS, FTP, SFTP, SMB, NFS,
file) now auto-detect both file format and compression codec from
compound extensions like `.csv.gz`, `.jsonl.zst`, `.tsv.xz`, `.json.bz2`.
No `file_compression_type` parameter is required for the recognized
suffixes (`.gz`, `.bz2`, `.xz`, `.zst`).
- Updates the Common Parameters table: `file_extension` and
`file_compression_type` now mention compound-extension inference.
- Adds a new "Compressed File Extensions" section with the format/
compression mapping and an example.
Reflects the behavior introduced in spiceai/spiceai#10809.
|`file_format`| String | Inferred | Selects the file reader. If omitted, format is inferred from the file extension. See [Supported Formats](#supported-formats). |
18
-
|`file_extension`| String | Derived | Overrides the file extension filter used when listing files. Defaults to the extension matching the resolved format. |
18
+
|`file_extension`| String | Derived | Overrides the file extension filter used when listing files. Defaults to the extension matching the resolved format. Compound extensions like `.csv.gz` set both format and compression at once.|
19
19
|`schema_infer_max_records`| Integer |`1000`| Maximum number of records scanned to infer the schema. |
20
-
|`file_compression_type`| String |`UNCOMPRESSED`| File-level compression for CSV, TSV, and JSON files. Valid values: `GZIP`, `BZIP2`, `XZ`, `ZSTD`, `UNCOMPRESSED`. |
20
+
|`file_compression_type`| String |Inferred | File-level compression for CSV, TSV, and JSON files. Valid values: `GZIP`, `BZIP2`, `XZ`, `ZSTD`, `UNCOMPRESSED`. If unset, Spice infers compression from compound file extensions such as `.gz`, `.bz2`, `.xz`, and `.zst` (see [Compressed File Extensions](#compressed-file-extensions)).|
@@ -34,6 +34,37 @@ The `file_format` parameter accepts these values:
34
34
35
35
When `file_format` is omitted, Spice infers the format from the dataset path extension. If the extension does not match one of the values above, a configuration error is returned.
For CSV, TSV, JSON, and JSONL files, Spice auto-detects both the file format and the compression codec from compound file extensions. Files matching one of the recognized format extensions followed by a recognized compression suffix are read transparently — no `file_compression_type` parameter is required.
The `.ndjson` and `.ldjson` extensions are also accepted as JSON Lines aliases — `events.ndjson.gz` is read as JSONL with GZIP compression.
52
+
53
+
Auto-detection works the same way when reading a single file or listing a directory: object-store listings are filtered by the compound extension, and the same compression codec is applied to every file matched.
54
+
55
+
To override auto-detection — for example, to force a specific compression codec — set `file_compression_type` explicitly. Explicit values always take precedence over the inferred value.
56
+
57
+
```yaml
58
+
datasets:
59
+
- from: s3://my-bucket/exports/
60
+
name: exports
61
+
params:
62
+
file_format: csv # Listing only includes matching files
63
+
file_extension: .csv.gz # Reads gzipped CSVs in the directory
64
+
```
65
+
66
+
Auto-detection applies to listing data connectors (S3, ABFS, GCS, HTTP/HTTPS, FTP, SFTP, SMB, NFS, file) and to the HTTPS connector's auto-format detection. Parquet handles compression internally and is not affected by this setting — see [Parquet](#parquet) for the codecs Parquet supports natively.
67
+
37
68
## Parquet
38
69
39
70
Spice reads any Parquet file regardless of the compression codec or data encoding.
0 commit comments