feat: add ignore_corrupt_files option to read_parquet, read_csv and read_iceberg#6520
Conversation
Greptile SummaryThis PR adds an
Confidence Score: 3/5The happy-path (single collect() with ignore_corrupt_files=False) is unchanged. The two issues found are scoped to the new feature itself: the parquet corruption classifier is too permissive for IO errors, and the skipped-file snapshot in multi-input-id plan scenarios includes data from concurrent queries. Two real defects in the new feature's core contracts: src/daft-parquet/src/read.rs (is_parquet_corrupt IoError arm) and src/daft-local-execution/src/run.rs (try_finish should_remove=false branch) Important Files Changed
Sequence DiagramsequenceDiagram
participant Py as Python (DataFrame)
participant NE as NativeExecutor
participant PS as PlanState
participant STS as ScanTaskSource
participant Reader as Parquet/CSV Reader
participant Mutex as skipped_corrupt_files (Arc<Mutex>)
Py->>NE: run(plan, input_id)
NE->>PS: "create PlanState with Arc<Mutex>"
PS->>STS: ScanTaskSource::new(skipped_corrupt_files.clone())
loop For each scan task
STS->>Reader: read_scan_task(url, skipped_corrupt_files.clone())
alt "File is corrupt and ignore_corrupt_files=true"
Reader->>Mutex: lock().push((path, reason))
Reader-->>STS: Ok(empty_stream)
else File readable
Reader-->>STS: Ok(record_batch_stream)
end
end
Py->>NE: try_finish(fingerprint, input_id)
NE->>PS: task_handle.await (wait for all scan tasks)
PS->>Mutex: lock().clone() to skipped
NE-->>Py: ExecutionStats with skipped_corrupt_files
Py->>Py: df.skipped_corrupt_files returns metadata.skipped_corrupt_files
Reviews (2): Last reviewed commit: "ignore corrupt files" | Re-trigger Greptile |
|
Hi @chenghuichen , thanks for putting this together!
Fully agree with you. It seems like the main problem here is job failure when the data is bad. Would something like #6446 be of interest to you? For example, if it was possible to resume processing in a way that skips the rows that have been already processed, would you be OK failing the job? This would mean that the source needs to be fixed outside of daft before daft can process the data. I ask because although highlighting skipped files is a step towards not failing silently, failing the job is still louder and may still be preferable. For example, from the perspective of a new daft user that has no idea about file skipping. |
Thanks for the pointer — #6446 is a more principled solution to the underlying problem. Happy to close this PR if the team decides that's the right direction for #6468. |
Hi @rohitkulshreshtha @chenghuichen Suppose our daft job A failed. |
|
maybe This is syntactically invalid. When One might argue that users could simply delete the corrupted files manually. However, when dealing with a large number of damaged files, how can users identify each one individually? Moreover, as data consumers, deleting source data files is often not permitted — it's an illegitimate operation. |
|
Thanks for pushing on this @chenghuichen, and sorry for the long silence. To directly answer your 3/31 question: we don't want to close this, we want to land it. On the point @rohitkulshreshtha raised about Checkpoint V2 (#6446): looking at the proposal against this PR, I don't think they're substitutes. Checkpoint V2 filters rows by key via an anti-join over source inputs the store has seen succeed; it has no notion of whether a source file is parseable. The scenarios this PR targets (truncated parquet footer, missing file from concurrent compaction, damaged output from a prior Daft write that crashed mid-flight, which is the case @everySympathy flagged) hit the same corrupt file on every re-run, including runs that use a checkpoint. The two features compose rather than compete: checkpointing resumes progress, ignore_corrupt_files avoids re-aborting on the same bad file. The silent-data-loss concern is real, but the design here addresses it head-on: default is On @caican00's read_json request: valid case, but let's keep this PR scoped to parquet/csv/iceberg. @caican00 feel free open a follow-up issue so we don't lose it. The remaining blocker before I do a full review pass is a rebase onto main. Feel free to ping me once that's done! |
abf4e63 to
72a8e5a
Compare
|
Thanks for the detailed feedback @desmondcheongzx! Rebased onto main and ready for review. One small naming change also included: renamed |
|
@caican00 Thanks for the |
desmondcheongzx
left a comment
There was a problem hiding this comment.
Thanks for rebasing @chenghuichen! Took a closer look at your PR. Good stuff!
Left some comments that I think we should address
fa4e8e9 to
93e43b3
Compare
Merging this PR will not alter performance
Comparing Footnotes
|
|
@desmondcheongzx All updated. |
077cd52 to
cae8505
Compare
|
Rebased again. |
|
Hey @desmondcheongzx :) - Would you be able to have a look at this? Thanks! |
|
@desmondcheongzx - Checking to see if you are able to review this. Thanks! |
|
Rebased again. Reopened for re-trigger CI. |
a9f4f3e to
bd74c75
Compare
| // Accumulate skipped files from completed tasks so they are available in export_metrics(). | ||
| if let TaskEvent::Completed { ref stats, .. } = event | ||
| && !stats.skipped_corrupt_files.is_empty() | ||
| && let Ok(mut v) = self.skipped_corrupt_files.lock() | ||
| { | ||
| v.extend(stats.skipped_corrupt_files.iter().cloned()); | ||
| } | ||
|
|
There was a problem hiding this comment.
This cross-task accumulation (and the dedup in export_metrics) is the distributed half of the feature, but all the tests run on the local runner — so this path isn't exercised. Since the whole point is surfacing skips reliably, the distributed merge feels like the riskiest place to leave uncovered. Is there a good way to get a test on it here, or does that need the multi-worker harness? Happy to defer if it's disproportionate, just want to make sure it's a conscious call rather than an accidental gap.
|
Nicely done overall. Thanks for this contribution. Some non-blocking minor things: Code
Docs
|
81d5a57 to
db5b237
Compare
|
@rohitkulshreshtha Thanks for the review! Addressed the string-matching fragility, IoError denylist, and unreachable code — For partial reads: added a Count-pushdown stays disabled when Distributed test coverage: the cross-task accumulation and dedup in |
|
CI failures are pre-existing flaky tests on main (HuggingFace 429 rate limits). |
5693973 to
fc2b3ed
Compare
rohitkulshreshtha
left a comment
There was a problem hiding this comment.
Thanks, this all looks good — the typed-error classification (Error::Arrow → CorruptFile + the UnexpectedEof allowlist) is much more robust than the string matching, the dead count-pushdown branch is gone, and the partial flag + multi-row-group test cover the partial-read case I was worried about.
On the distributed coverage: totally fair to follow up separately. Thanks for driving this all the way.
fc2b3ed to
1639b94
Compare
Thank you very much! |
* origin/main: (115 commits) feat: add ignore_corrupt_files option to read_parquet, read_csv and read_iceberg (Eventual-Inc#6520) fix(deps): gate vllm to Linux so macOS/Windows resolve without CUDA wheels (Eventual-Inc#7095) fix: pass options in Gravitino PostgreSQL read method (Eventual-Inc#7047) feat(ray): Implement dynamic scale-in for RaySwordfishActor (Eventual-Inc#5903) feat(delta-lake): support column mapping for reads (Eventual-Inc#7005) feat(functions): add string distance/similarity functions (Eventual-Inc#7068) test(parquet): cover read_parquet edge cases (Eventual-Inc#7085) refactor(checkpoint): drop "seal" vocabulary from Rust API surface (Eventual-Inc#7078) fix(asof-join): use unknown clustering spec instead of hash (Eventual-Inc#7075) docs: standardize Slack links to use daft.ai/slack (Eventual-Inc#7066) feat: add try_cast function for safe type conversion (Eventual-Inc#6960) refactor(file): rename File byte-range fields to position/size (Eventual-Inc#6747) fix(ray): configure worker startup timeout on runner (Eventual-Inc#7055) feat(shuffle): default flight shuffle compression to lz4 (Eventual-Inc#7071) feat(iceberg): support branch and tag reads (Eventual-Inc#7042) fix(shuffle): concat recordbatches before repartition (Eventual-Inc#7064) perf: update jemalloc 5.3.0 → 5.3.1 to fix muzzy decay performance bug (Eventual-Inc#7059) feat: thread assume_sorted_and_aligned_partitions parameter through ASOF join (Eventual-Inc#7067) fix(flight-shuffle): reduce coordinator memory to O(map_tasks + partitions) (Eventual-Inc#7056) refactor(distributed): rename needs_hash_repartition to can_skip_hash_repartition (Eventual-Inc#7053) ... # Conflicts: # daft/checkpoint.py # src/daft-distributed/src/pipeline_node/limit.rs # src/daft-distributed/src/pipeline_node/stage_checkpoint_keys.rs # src/daft-distributed/src/scheduling/task.rs # src/daft-local-execution/src/pipeline.rs # src/daft-local-execution/src/sinks/blocking_sink.rs # src/daft-local-execution/src/sources/scan_task.rs
Motivation
Large data lakes accumulate corrupt or missing files over time. Without a skip option, a single bad file aborts an entire overnight batch job.
A simple skip flag, however, is itself a data quality hazard — a job that appears to succeed while quietly dropping data is worse than one that fails loudly. This PR treats observability as a first-class requirement: every skipped file is surfaced as structured data via
df.skipped_files: list[tuple[str, str]], available after any executing action. Pipeline code can iterate the(path, reason)pairs directly to alert, dead-letter queue, or audit log — skipped files are never silently discarded. The design goal is errors visible, impact contained, tooling to fix.What gets skipped
Network errors, timeouts, and permission errors are never swallowed — those should be retried or fixed, not silenced.
Observability
WARNINGlog per skipped file (path + reason)df.skipped_files: list[tuple[str, str]]available after any executing action, for alerting or dead-letter queuingWhat this PR does not do
_corrupt_recordcolumn (Parquet is binary columnar — there is no "raw corrupt row string" to preserve; for CSV, file-level skip is the right granularity for theignore_corrupt_filessemantic)ignore_corrupt_filesis a per-call parameter; a session default can be added later if there is demandpython_factory_func_scan_task(Lance, Paimon LSM-merge fallback, etc) are not covered. TheSkippedFilesCollectorlives in the Rust execution context and is unreachable from the Python callsite where those reads happen. A follow-up PR will introduce a sideband mechanism for this.Tests
tests/io/test_ignore_corrupt_files.py— Parquet, CSV, and Iceberg cases covering: corrupt files skipped, default raises, schema inference fallback, correctCOUNT(*), anddf.skipped_filespopulated/empty as expected.Docs
docs/connectors/ignore-corrupt-files.mdRelated Issues
Closes #6468