You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| [Delta Lake](https://delta.io/) | `file_format: delta` | Stable | Open table format with ACID transactions. Object stores only. |
120
-
| [Apache Iceberg](https://iceberg.apache.org/) | `file_format: iceberg` | Beta | Open table format for large analytic datasets. Object stores only. Requires a [catalog](../catalogs/index.md). |
120
+
| [Apache Iceberg](https://iceberg.apache.org/) | `file_format: iceberg` | Stable | Open table format for large analytic datasets. Object stores only. Requires a [catalog](../catalogs/index.md). |
121
121
| Microsoft Excel | `file_format: xlsx` | Roadmap | Excel spreadsheet format |
122
122
| Markdown | `file_format: md` | Stable | Plain text with formatting (document format) |
123
123
| Text | `file_format: txt` | Stable | Plain text files (document format) |
124
-
| PDF | `file_format: pdf` | Alpha | Portable Document Format (document format) |
124
+
| PDF | `file_format: pdf` | Beta | Portable Document Format (document format) |
125
125
| Microsoft Word | `file_format: docx` | Alpha | Word document format (document format) |
126
126
127
127
### Format-Specific Parameters
@@ -186,11 +186,11 @@ File-based connectors can expose per-file object store metadata as virtual colum
| Microsoft Word | `file_format: docx` | Alpha | ✅ |
314
314
315
315
### Document Formats {#document-formats}
@@ -320,7 +320,7 @@ Runtime schema evolution controls are planned for a future release. When availab
320
320
Document formats (Markdown, Text, PDF, Word) are handled differently from structured data formats. Each file becomes a row in the resulting table, with the file contents stored in a `content` column.
321
321
322
322
:::warning[Note]
323
-
Document formats in Alpha (PDF, DOCX) may not parse all structure or text from the underlying documents correctly.
323
+
Document formats in Alpha (DOCX) may not parse all structure or text from the underlying documents correctly.
Spice supports CSV, JSON, Parquet, Delta Lake, and Iceberg data file-formats for data connectors that can read files from a file system or cloud object storage (i.e. [`s3://`](../components/data-connectors/s3), [`abfs://`](../components/data-connectors/abfs), [`file://`](../components/data-connectors/file), etc.). Delta Lake and Iceberg are supported for object store connectors. Iceberg requires a catalog to be configured.
9
+
File-based [data connectors](../components/data-connectors/index.md) — including [`s3://`](../components/data-connectors/s3), [`abfs://`](../components/data-connectors/abfs), [`file://`](../components/data-connectors/file), [`ftp://`](../components/data-connectors/ftp), [`sftp://`](../components/data-connectors/ftp), and others — support multiple structured and document file formats. This page details the format-specific parameters available for each.
10
10
11
-
The parameters supported for specific file-formats are detailed on this page.
11
+
## Common Parameters
12
+
13
+
These parameters apply across multiple file formats.
|`file_format`| String | Inferred | Selects the file reader. If omitted, format is inferred from the file extension. See [Supported Formats](#supported-formats). |
18
+
|`file_extension`| String | Derived | Overrides the file extension filter used when listing files. Defaults to the extension matching the resolved format. |
19
+
|`schema_infer_max_records`| Integer |`1000`| Maximum number of records scanned to infer the schema. |
20
+
|`file_compression_type`| String |`UNCOMPRESSED`| File-level compression for CSV, TSV, and JSON files. Valid values: `GZIP`, `BZIP2`, `XZ`, `ZSTD`, `UNCOMPRESSED`. |
When `file_format` is omitted, Spice infers the format from the dataset path extension. If the extension does not match one of the values above, a configuration error is returned.
12
36
13
37
## Parquet
14
38
15
-
Spice automatically supports reading any Parquet file, regardless of the compression codec or data encoding used.
39
+
Spice reads any Parquet file regardless of the compression codec or data encoding.
|`csv_has_header`| Boolean |`true`| Whether the first row contains column headers. |
70
+
|`csv_quote`| Char |`"`| Character used to quote fields containing special characters. |
71
+
|`csv_escape`| Char |_none_| Character used to escape special characters within a field. |
72
+
|`csv_delimiter`| Char |`,`| Character used to separate fields. |
73
+
|`csv_schema_infer_max_records`| Integer |`1000`|**Deprecated.** Use `schema_infer_max_records` instead. Maximum records scanned for schema inference. |
74
+
75
+
## TSV
76
+
77
+
TSV (tab-separated values) is a first-class format. Set `file_format: tsv` or use a `.tsv` file extension. The delimiter is always tab and cannot be changed.
78
+
41
79
### Parameters
42
80
43
-
-`csv_has_header`: Optional. Indicate if the CSV file has header row. Defaults to `true`
44
-
-`csv_quote`: Optional. A one-character string used to quote fields containing special characters. Defaults to `"`
45
-
-`csv_escape`: Optional. A one-character string used to represent special characters or to include characters that would normally be interpreted as delimiters or new line characters within a field value. Defaults to `null`
46
-
-`csv_schema_infer_max_records`: Optional. A number used to set the limit in terms of records to scan to infer the schema. Defaults to `1000`
47
-
-`csv_delimiter`: Optional. A one-character string used to separate individual fields. Defaults to `,`
|`tsv_has_header`| Boolean |`true`| Whether the first row contains column headers. |
84
+
|`tsv_quote`| Char |`"`| Character used to quote fields containing special characters. |
85
+
|`tsv_escape`| Char |_none_| Character used to escape special characters within a field. |
86
+
|`tsv_schema_infer_max_records`| Integer |`1000`|**Deprecated.** Use `schema_infer_max_records` instead. Maximum records scanned for schema inference. |
48
87
49
88
## JSON
50
89
90
+
Set `file_format: json` for JSON files. Use the `json_format` parameter to select the parsing mode.
91
+
51
92
### Parameters
52
93
53
-
-`json_format`: Optional. Specifies the JSON format to parse. Valid values are `array`, `ndjson`, and `jsonl`. Defaults to `jsonl`
0 commit comments