|
| 1 | +# Generic File Source Options |
| 2 | + |
| 3 | +These options apply to `read_parquet`, `read_csv`, and `read_iceberg`. They are not tied to any single connector or format. Other readers (`read_json`, `read_warc`, `read_text`) do not support these options. |
| 4 | + |
| 5 | +## Ignoring Corrupt Files |
| 6 | + |
| 7 | +When reading large collections of files, some files may be unreadable — corrupt, truncated, or deleted between the time Daft lists them and the time it reads them. By default, Daft raises an error and halts the query. The `ignore_corrupt_files` option changes that behavior: qualifying files are silently skipped and the query continues with the remaining data. |
| 8 | + |
| 9 | +### Enabling `ignore_corrupt_files` |
| 10 | + |
| 11 | +Pass `ignore_corrupt_files=True` to any of the supported reader functions: |
| 12 | + |
| 13 | +```python |
| 14 | +import daft |
| 15 | + |
| 16 | +# Parquet / CSV (glob-based) |
| 17 | +df = daft.read_parquet("s3://my-bucket/data/**/*.parquet", ignore_corrupt_files=True) |
| 18 | +df = daft.read_csv("s3://my-bucket/data/**/*.csv", ignore_corrupt_files=True) |
| 19 | + |
| 20 | +# Iceberg |
| 21 | +import pyiceberg |
| 22 | +table = pyiceberg.table.StaticTable.from_metadata("s3://bucket/iceberg/metadata.json") |
| 23 | +df = daft.read_iceberg(table, ignore_corrupt_files=True) |
| 24 | + |
| 25 | +df.collect() |
| 26 | +``` |
| 27 | + |
| 28 | +### What counts as "corrupt" |
| 29 | + |
| 30 | +Daft skips a file when it encounters a problem that is specific to the file itself and cannot be resolved by retrying: |
| 31 | + |
| 32 | +| Category | Examples | |
| 33 | +|---|---| |
| 34 | +| **Invalid format** | Bad Parquet magic bytes, truncated footer, mismatched row/column counts | |
| 35 | +| **Corrupt data** | Unreadable row group, invalid CSV encoding, wrong field count in a row | |
| 36 | +| **Missing file** | File deleted between listing and reading (e.g. concurrent compaction or partition overwrite) | |
| 37 | + |
| 38 | +Daft does **not** skip files for transient infrastructure problems, because those can and should be retried: |
| 39 | + |
| 40 | +| Category | Examples | |
| 41 | +|---|---| |
| 42 | +| **Network errors** | Connection reset, read timeout, throttled I/O | |
| 43 | +| **Permission errors** | Access denied, insufficient credentials | |
| 44 | + |
| 45 | +This distinction matters. Silently retrying a permission error would mask a misconfiguration that needs human attention. |
| 46 | + |
| 47 | +### Observability: knowing what was skipped |
| 48 | + |
| 49 | +`ignore_corrupt_files` is designed around the principle that **errors should be visible, not hidden**. Daft provides two complementary observability mechanisms. |
| 50 | + |
| 51 | +#### Python warning logs |
| 52 | + |
| 53 | +Daft emits a `WARNING`-level log message for every skipped file, including the file path and the reason: |
| 54 | + |
| 55 | +``` |
| 56 | +WARNING daft.io - Skipping corrupt Parquet file s3://my-bucket/data/bad.parquet: ... |
| 57 | +WARNING daft.io - Skipping corrupt CSV chunk in s3://my-bucket/data/partial.csv: ... |
| 58 | +``` |
| 59 | + |
| 60 | +You can see these with standard Python logging: |
| 61 | + |
| 62 | +```python |
| 63 | +import logging |
| 64 | +logging.basicConfig(level=logging.WARNING) |
| 65 | +``` |
| 66 | + |
| 67 | +#### `df.skipped_corrupt_files` — programmatic access |
| 68 | + |
| 69 | +After materializing the dataframe with `.collect()`, the `skipped_corrupt_files` property returns the list of skipped `(path, reason)` pairs as structured data, so your pipeline code can act on them: |
| 70 | + |
| 71 | +```python |
| 72 | +df = daft.read_parquet("s3://my-bucket/data/**/*.parquet", ignore_corrupt_files=True) |
| 73 | +df.collect() |
| 74 | + |
| 75 | +skipped = df.skipped_corrupt_files # list[tuple[str, str, bool]] |
| 76 | +for path, reason, partial in skipped: |
| 77 | + tag = " (partial)" if partial else "" |
| 78 | + print(f"Skipped{tag}: {path}\n Reason: {reason}") |
| 79 | +``` |
| 80 | + |
| 81 | +Each entry is a `(path, reason, partial)` tuple. When `partial` is `True`, some batches from the file were already emitted before the corruption was detected — the file was not fully skipped. This can happen when corruption appears in a later row group. |
| 82 | + |
| 83 | +`skipped_corrupt_files` is available after calling `.collect()` on the dataframe. Other execution methods such as `.count_rows()` do not populate this property, because they operate on an internal dataframe rather than materializing the original one. |
| 84 | + |
| 85 | +### Handling skipped files in production |
| 86 | + |
| 87 | +Because `skipped_corrupt_files` is plain Python data, you can plug it directly into your existing alerting or data-quality workflows: |
| 88 | + |
| 89 | +```python |
| 90 | +import daft |
| 91 | + |
| 92 | +df = daft.read_parquet("s3://my-bucket/nightly/**/*.parquet", ignore_corrupt_files=True) |
| 93 | +df.write_parquet("s3://my-bucket/processed/") |
| 94 | + |
| 95 | +skipped = df.skipped_corrupt_files |
| 96 | +if skipped: |
| 97 | + # Option 1: send an alert |
| 98 | + send_alert(f"{len(skipped)} file(s) skipped during nightly run", details=skipped) |
| 99 | + |
| 100 | + # Option 2: push to a dead-letter queue for later reprocessing |
| 101 | + for path, reason, partial in skipped: |
| 102 | + dead_letter_queue.put({"path": path, "reason": reason, "partial": partial, "run": TODAY}) |
| 103 | +``` |
| 104 | + |
| 105 | +This pattern — **errors visible, impact contained, tooling to fix** — lets automated batch jobs complete reliably while still surfacing problems for human review. |
| 106 | + |
| 107 | +!!! warning "Do not use `ignore_corrupt_files` as a catch-all" |
| 108 | + This option is designed for files that are genuinely unreadable. It should not be used to suppress transient I/O errors (network issues, throttling) — Daft already retries those automatically. If you find yourself needing `ignore_corrupt_files` for a large fraction of your files, investigate the root cause rather than silencing the errors. |
| 109 | + |
| 110 | +### Supported formats |
| 111 | + |
| 112 | +| Format | File-level skip | Within-file error skip | |
| 113 | +|---|---|---| |
| 114 | +| Parquet (`read_parquet`) | Yes (bad footer, wrong magic bytes, file too small) | Yes (corrupt row group data) | |
| 115 | +| CSV (`read_csv`) | Yes (unreadable file, truncated) | Yes (bad encoding, wrong field count in chunk) | |
| 116 | +| Iceberg (`read_iceberg`) | Yes (data files go through the Rust Parquet reader) | Yes | |
| 117 | + |
| 118 | +!!! note "Iceberg delete files" |
| 119 | + Corruption in Iceberg *delete files* is not covered. If a delete file is unreadable, Daft will raise an error regardless of `ignore_corrupt_files`. Delete files are small metadata structures and corruption there generally indicates a more serious catalog inconsistency. |
| 120 | + |
| 121 | +!!! note "Count pushdown" |
| 122 | + When `ignore_corrupt_files` is enabled for Parquet, count pushdown is disabled. This means `df.count()` will read all row-group data instead of using the metadata-only optimization, which may be slower on large datasets. |
0 commit comments