Skip to content

Fix Iceberg _pos to be file-global instead of task-local on split files#14808

Open
res-life wants to merge 14 commits into
NVIDIA:release/26.06from
res-life:iceberg-fix-corruption
Open

Fix Iceberg _pos to be file-global instead of task-local on split files#14808
res-life wants to merge 14 commits into
NVIDIA:release/26.06from
res-life:iceberg-fix-corruption

Conversation

@res-life
Copy link
Copy Markdown
Collaborator

@res-life res-life commented May 15, 2026

Fixes #14807.

Description

When Iceberg splits a single Parquet data file across multiple read tasks, GPU-side _pos values were computed task-locally starting from 0 instead of as the file-global row position. On CoW reads this produces wrong _pos column values; on MoR reads with positional deletes it causes silent data corruption because the wrong _pos is then matched against the delete file's position list.

Root cause

reader.getRowGroups() returns row groups filtered by the split's byte range (set via ParquetReadOptions.withRange). ParquetFileReader.readFooter also applies the range filter while parsing the footer, so getFooter().getBlocks() on a ranged reader returns only the range's row groups too. Accumulating cumulative row counts over a range-filtered block list yields task-local offsets starting at 0.

Fix

Open a second ParquetFileReader without the range to enumerate every row group in the file, build a map from absolute byte offset (getStartingPos) to file-global first-row index, then look up each filtered block via its getStartingPos. Both FetchRowPosition (the row-position metadata column) and positional-delete matching now see absolute positions within the file regardless of split.

Also makes FetchRowPosition.execute retry-safe under withRetryNoSplit: the per-block state (curBlockIndex, processedRowCount, processedBlockRowCounts) is advanced in locals and committed to the processor only after CudfColumnVector.fromLongs succeeds, so an OOM in column allocation does not leave the processor in a partially-advanced state.

Test plan

  • New test_iceberg_read_pos_with_split_file in integration_tests/src/main/python/iceberg/iceberg_test.py forces a single small file to be split across multiple scan tasks via write.parquet.row-group-size-bytes = 4096, read.split.target-size = 4096, read.split.open-file-cost = 0, and asserts GPU _pos matches CPU _pos over all 1500 rows.
  • New test_iceberg_read_mor_with_pos_deletes_split_file exercises the silent-corruption path directly: v2 Merge-on-Read table with positional delete files under the same split conditions; asserts GPU and CPU return the same surviving rows after DELETE.

Checklists

Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

Testing

  • Added or modified tests to cover new code paths
  • Covered by existing tests
    (Please provide the names of the existing tests in the PR description.)
  • Not required

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

Signed-off-by: Chong Gao chongg@nvidia.com

Chong Gao added 2 commits May 14, 2026 16:56
Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Copy Markdown
Collaborator Author

build

@res-life res-life marked this pull request as ready for review May 20, 2026 01:11
@res-life res-life self-assigned this May 20, 2026
@sameerz sameerz requested a review from abellina May 20, 2026 01:12
@res-life res-life requested review from abellina and firestarman and removed request for abellina May 20, 2026 01:12
@sameerz sameerz requested review from gerashegalov and removed request for firestarman May 20, 2026 01:13
@sameerz sameerz added the task Work required that improves the product but is not user facing label May 20, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 20, 2026

Greptile Summary

This PR fixes a data-correctness bug where Iceberg _pos values on split Parquet files were computed task-locally (starting at 0) instead of as file-global row indices. On CoW reads this produced wrong _pos output; on MoR reads with positional deletes it caused silent row corruption. The fix opens a second range-free ParquetFileReader to build a startingPos → global-first-row map for the split case and uses that map when looking up each filtered block. An independent OOM-retry correctness fix is also included: FetchRowPosition.execute now advances counters in locals and commits them only after CudfColumnVector.fromLongs succeeds, and process() snapshots/restores those counters around the entire withRetryNoSplit block so a retry cannot see pre-advanced state.

  • reader.scala: filterParquetBlocks conditionally opens a second footer-only reader (via file.copy(split = None)) when _pos is needed and the file is split, mapping getStartingPos → file-global first-row index by physical byte offset — a stable identifier across range-filtered footers.
  • GpuParquetReaderPostProcessor.scala: Two-level OOM-retry hardening: local-variable advancement in FetchRowPosition.execute plus snapshot/restore of all three counters at the withRetryNoSplit boundary in process().
  • GpuReaderFactory.scala / GpuCoalescingIcebergParquetReader.scala: _pos-projecting scans are routed away from the coalescing reader (!hasRowPositionMetadata added to canUseCoalescing); a defensive post-processor identity check is added to checkIfNeedToSplitDataBlock as a safety net for future changes.

Confidence Score: 5/5

This PR is safe to merge. It fixes a real data-corruption bug in Iceberg split-file reads, uses resource-safe wrappers throughout, and the OOM-retry hardening is correctly structured at both the action and batch levels.

The two-level OOM protection (locals-first in FetchRowPosition.execute plus snapshot/restore at the withRetryNoSplit boundary) is logically complete. The second-reader path for global row-position lookup is correctly guarded to run only when _pos is actually projected and the file is split, and both split and non-split paths are covered by the new unit and integration tests.

No files require special attention — the logic in reader.scala and GpuParquetReaderPostProcessor.scala is well-covered by the new unit and integration tests.

Important Files Changed

Filename Overview
iceberg/common/src/main/scala/com/nvidia/spark/rapids/iceberg/parquet/reader.scala Core fix: conditionally opens a second range-free ParquetFileReader to build a startingPos→global-row-index map for split files when _pos is projected; falls back to task-local accumulation otherwise. Logic is correct and resource-safe via withResource.
iceberg/common/src/main/scala/com/nvidia/spark/rapids/iceberg/parquet/GpuParquetReaderPostProcessor.scala OOM-retry hardening for FetchRowPosition: locals-first counter advancement in execute() and snapshot/restore in process(); both levels are needed and complement each other correctly.
iceberg/common/src/main/scala/org/apache/iceberg/spark/source/GpuReaderFactory.scala Adds !hasRowPositionMetadata to canUseCoalescing and disableCombining guards; correctly routes _pos-projecting scans to single-file/multi-thread readers.
iceberg/common/src/main/scala/com/nvidia/spark/rapids/iceberg/parquet/GpuCoalescingIcebergParquetReader.scala Defensive post-processor identity check added at the top of checkIfNeedToSplitDataBlock; curExtra/nextExtra extraction moved up to support the new check without behavioral change to existing logic.
integration_tests/src/main/python/iceberg/iceberg_test.py Two new integration tests covering the CoW _pos split-file bug and the MoR positional-delete silent-corruption path; both use assert_gpu_and_cpu_are_equal_collect for proper GPU execution verification.
tests/src/test/spark350/scala/com/nvidia/spark/rapids/iceberg/GpuPostProcessorRetrySuite.scala New retry suite using RmmSparkRetrySuiteBase; injects OOM after FetchRowPosition.fromLongs to verify snapshot/restore correctness; validates both the retried batch and the follow-up batch.
tests/src/test/spark350/scala/com/nvidia/spark/rapids/iceberg/GpuPostProcessorSuite.scala Adds createMultiBlockParquetInfo helper and a new FetchRowPosition multi-block test covering file-global offsets, intra-block, cross-block, and trailing-block batches.

Sequence Diagram

sequenceDiagram
    participant RF as GpuReaderFactory
    participant FPB as filterParquetBlocks
    participant RR as RangedReader
    participant FR as FullReader
    participant PP as PostProcessor

    RF->>RF: hasRowPositionMetadata check
    alt _pos projected
        RF->>FPB: filterParquetBlocks(file, requiredSchema)
        FPB->>RR: file.newReader with range
        RR-->>FPB: filteredBlocks (range-only row groups)
        FPB->>FR: "file.copy(split=None).newReader"
        FR-->>FPB: getFooter.getBlocks (all row groups)
        FPB->>FPB: build startingPos to fileGlobalFirstRow map
        FPB->>FPB: lookup each filteredBlock by startingPos
        FPB-->>PP: blocksFirstRowIndices with global offsets
        PP->>PP: FetchRowPosition emits file-global _pos
    else no _pos
        FPB->>RR: file.newReader with range
        RR-->>FPB: filteredBlocks
        FPB->>FPB: accumulate task-local indices
        FPB-->>PP: blocksFirstRowIndices task-local
    end
    Note over PP: OOM Retry
    PP->>PP: snapshot curBlockIndex and counters
    PP->>PP: withRetryNoSplit restores snapshot on retry
    PP->>PP: FetchRowPosition advances locals, commits after fromLongs
Loading

Reviews (8): Last reviewed commit: "Fix _pos retry test field id collision w..." | Re-trigger Greptile

@res-life res-life requested a review from a team May 20, 2026 03:12
Comment thread integration_tests/src/main/python/iceberg/iceberg_test.py Outdated
Chong Gao added 6 commits May 21, 2026 09:46
…acker id

- reader.scala: replace the bug-history comment with a brief description of the
  current logic.
- GpuParquetReaderPostProcessor.scala: name the enclosing withRetryNoSplit
  block instead of an unattributed "lambda".
- iceberg_test.py: drop internal tracker id from the regression test comment
  and describe the scenario the test exercises rather than reader internals.

Signed-off-by: Chong Gao <res_life@163.com>
Exercises the silent-corruption path on Merge-on-Read tables directly:
single Parquet data file split across multiple scan tasks plus a positional
delete file. The query does not project _pos; the Iceberg reader injects
ROW_POSITION into the read schema to match against the delete list, so an
incorrect _pos would silently drop or resurface rows without any user
projection of the metadata column. The test asserts the GPU result matches
CPU over the surviving rows.

Signed-off-by: Chong Gao <res_life@163.com>
A bare HashMap lookup on the filtered block's getStartingPos throws
NoSuchElementException with only the offset, leaving distributed-job
operators without a file path to diagnose against. Use getOrElse and raise
an IllegalStateException naming the file and offset instead.

Signed-off-by: Chong Gao <res_life@163.com>
…ile-global

The previous fix used reader.getFooter on the same ParquetFileReader that was
opened with ParquetReadOptions.withRange(start, end) for split scan tasks.
ParquetFileReader.readFooter applies the range filter while parsing the
footer, so getFooter().getBlocks() on a ranged reader returns only the
row groups inside the range. Accumulating row counts over that filtered
list reintroduces the task-local first-row indices the original commit was
meant to eliminate — visible directly when the data file has multiple row
groups and Iceberg's planner splits it across tasks, and indirectly on
MoR reads where the resulting wrong _pos values mismatch the positional
delete file and drop or resurface rows silently.

When file.split is defined, open a separate reader without the range to
enumerate every row group in the file with its absolute byte offset
(getStartingPos), build the file-global first-row-index map from that,
then look up the ranged reader's filtered blocks against it. When file
has no split, fall back to the existing single-reader path.

Signed-off-by: Chong Gao <res_life@163.com>
FetchRowPosition.execute already advances state in locals and commits to the
processor only after CudfColumnVector.fromLongs succeeds, but the wrapping
withRetryNoSplit covers the entire safeMap iteration in
GpuParquetReaderPostProcessor.process. A later field action (UpCast,
FillNull, GpuColumnVector.from, ...) in the same iteration can OOM and
trigger a retry of the whole block, at which point execute() would run again
against already-advanced counters and emit _pos values off by numRows.

Snapshot curBlockIndex / processedRowCount / processedBlockRowCounts before
entering withRetryNoSplit and restore them at the top of each attempt, so
the retry restarts from the same processor state regardless of which action
inside the block raised the OOM. Tighten the execute() comment to describe
the local-commit contract rather than claim full retry safety in isolation.

Signed-off-by: Chong Gao <res_life@163.com>
Existing GpuPostProcessorSuite cases always build a ParquetFileInfoWithBlockMeta
with a single block whose first-row index is 0, so they never exercise the
multi-block state-traversal logic in FetchRowPosition.execute (incrementing
localBlockIndex, picking blocksFirstRowIndices(localBlockIndex), accumulating
processedBlockRowCounts across blocks). The integration tests cover this
end-to-end but are slow to run.

Add a createMultiBlockParquetInfo helper that builds non-zero-based
blocksFirstRowIndices and a new test that drives the processor across two
blocks via three successive process() calls. The test asserts file-global
_pos values 500..799, 800..999, 1000..1299 — directly catching regressions
where _pos restarts at the task-local 0 or fails to advance into the next
block.

Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Copy Markdown
Collaborator Author

build

Comment thread integration_tests/src/main/python/iceberg/iceberg_test.py
Chong Gao added 2 commits May 22, 2026 15:16
filterParquetBlocks now opens the second range-less Parquet reader only
when the required schema actually projects ROW_POSITION (either by the
user or by a positional delete filter). Ordinary Iceberg scans go back
to the cheap ranged-reader accounting and avoid the extra footer fetch
that was paid on every data-file scan task.

Signed-off-by: Chong Gao <res_life@163.com>
GpuCoalescingIcebergParquetReader.checkIfNeedToSplitDataBlock now splits
whenever two adjacent blocks carry distinct GpuParquetReaderPostProcessor
instances. Each Iceberg split owns its own post-processor with private
block metadata and _pos counters, so two splits of the same physical
Parquet file share a file path but must not be coalesced under one
finalize call — the first split's post-processor would otherwise emit
wrong _pos values and index past its block array for rows that belong
to the second split.

Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Copy Markdown
Collaborator Author

build

@nvauto
Copy link
Copy Markdown
Collaborator

nvauto commented May 25, 2026

NOTE: release/26.06 has been created from main. Please retarget your PR to release/26.06 if it should be included in the release.

@res-life res-life changed the base branch from main to release/26.06 May 25, 2026 09:58
@sameerz sameerz added bug Something isn't working and removed task Work required that improves the product but is not user facing labels May 26, 2026
@johnnyzhon johnnyzhon added the QA_BUG_FIX Dev team pull this request to fix some QA bug label May 27, 2026
@gerashegalov gerashegalov requested review from a team and gerashegalov May 27, 2026 06:39
… splits

The earlier "Split coalescing chunks per post-processor identity" check was
effectively dead code: GpuMultiFileReader.populateCurrentBlockChunk only
calls checkIfNeedToSplitDataBlock when the next block's filePath differs
from the current chunk's filePath. Two Iceberg splits of the same physical
Parquet file share their real path, so the parent skipped the override and
appended the next block straight into the chunk while extraInfo (set once
at currentFile == null) stayed pinned to the first split's post-processor.
Finalize then ran P0 over rows from P1 — the original _pos failure shape
was still live.

Tag the path used by the parent equality check with the Iceberg split
range when building ParquetSingleDataBlockMeta. Same-file same-split blocks
still share the synthetic path and coalesce normally; same-file
different-split blocks now present as different "files" to the parent, hit
the path-change branch, invoke the override, and the postProcessor
identity check forces a chunk split where the new chunk picks up its own
extraInfo. The tag is only used by the parent for in-memory equality
(currentFile in populateCurrentBlockChunk); the real path is preserved on
the post-processor and remains the value materialized for the _file
metadata column.

Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Copy Markdown
Collaborator Author

build

Chong Gao added 2 commits May 29, 2026 16:16
… _pos retry test

Three follow-ups to PR review on the Iceberg _pos split-file fix:

1. GpuCoalescingIcebergParquetReader: the block comment around the
   synthetic split-tagged path claimed the tagged path was only used for
   the parent coalescer's equality checks. That was wrong — it is also
   stored as ParquetSingleDataBlockMeta.filePath, used as a key in the
   parent's per-file block map, logged, and passed to fileIO.newInputFile
   when blocks are read. The file still opens correctly because Hadoop
   FileSystem implementations strip the URI fragment when resolving the
   underlying path, but that is an implicit invariant. Rewrite the
   comment to document the full propagation, the FileSystem-fragment
   assumption it depends on, and the FileCache side effect (per-split
   cache keys, no cross-split sharing). Also point at the actual class
   name (MultiFileCoalescingPartitionReaderBase) instead of the file
   name. Behavior unchanged.

2. GpuPostProcessorSuite "FetchRowPosition emits file-global _pos across
   multiple blocks": the original batch shape (300 / 200 / 300 rows
   against blocks of 500 and 400) consumed block 0 in whole-batch
   increments, so no single process() call ever spanned a block
   boundary. The inner `if (curRowPos >= curBlockRowEnd)` branch in
   FetchRowPosition.execute was never reached mid-loop, leaving a
   mutation like `>` for `>=` undetected. Restructure the batches to
   300 / 100 / 350 / 150, so the third batch straddles block 0 → block
   1, and tighten assertPosRange to verify every element of each batch
   rather than just the first and last — a middle-row off-by-one would
   otherwise slip through.

3. Add GpuPostProcessorRetrySuite to exercise the snapshot/restore of
   FetchRowPosition counters around withRetryNoSplit in
   GpuParquetReaderPostProcessor.process. The fix exists because
   FetchRowPosition commits its counter advance to the processor as
   soon as fromLongs() succeeds, but a later field action in the same
   safeMap iteration (UpCast, FillNull, GpuColumnVector.from, ...) can
   OOM and cause withRetryNoSplit to rerun the whole block — without a
   pre-block snapshot/restore, the rerun would see already-advanced
   counters and emit _pos off by numRows. The new test projects _pos
   plus a missing optional field (which lowers to a FillNull action),
   injects a single GPU retry-OOM with skipCount=1 so
   FetchRowPosition.fromLongs succeeds and the next allocation
   (FillNull) fails, then asserts the produced _pos column matches the
   pre-retry expected sequence. A follow-up batch with no injected OOM
   confirms the counters end up where a single successful 300-row call
   would have left them — catching either a missing restore or an
   accidental double restore.

Signed-off-by: Chong Gao <res_life@163.com>
The prior "tag the path with the split range" approach broke regular
Iceberg scans: Hadoop FileSystem implementations (including the local
filesystem) do NOT strip URI fragments when resolving the underlying
file, so the tagged path `…parquet#iceberg-split=…` flowed through
ParquetSingleDataBlockMeta into fileIO.newInputFile and failed with
FileNotFoundException on every coalesced read.
test_iceberg_parquet_read_round_trip_all_types[COALESCING] reproduced
this on local files.

Back out the synthetic-path mechanism and instead route _pos-projecting
scans away from the coalescing reader at the factory:

- GpuReaderFactory.calcThreadConf: add `!hasRowPositionMetadata` to
  canUseCoalescing. Positional-delete and eq-delete cases are already
  excluded by hasNoDeletes, so this closes the only remaining path
  where coalescing could see ROW_POSITION in the required schema. The
  multi-thread and single-file readers finalize per IcebergPartitionedFile
  via findIcebergFile, so each split's post-processor is the one that
  finalizes its rows — same-file multi-split _pos is correct there.

- GpuCoalescingIcebergParquetReader.createParquetReader: revert to using
  info.filePath directly in ParquetSingleDataBlockMeta. Keep the
  postProcessor-identity check in checkIfNeedToSplitDataBlock as a
  defense-in-depth safety net (and update its comment to reflect that
  the _pos corruption it originally guarded is now prevented upstream
  by the factory routing change).

Signed-off-by: Chong Gao <res_life@163.com>
The second projected field in GpuPostProcessorRetrySuite used
`rowPosId + 1` for its field id. ROW_POSITION.fieldId() is
Integer.MAX_VALUE - 2 and FILE_PATH.fieldId() is Integer.MAX_VALUE - 1,
so `rowPosId + 1` collided with FILE_PATH. ActionBuildingVisitor
therefore lowered the second field to FetchFilePath, not FillNull, and
the test was no longer exercising the intended "later FillNull
allocation OOMs after FetchRowPosition committed counter state" retry
path documented in the comment.

Use an ordinary non-metadata field id (1) so the second field stays
optional/missing/non-constant and lowers to FillNull as intended.

Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Copy Markdown
Collaborator Author

build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working QA_BUG_FIX Dev team pull this request to fix some QA bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Iceberg _pos is task-local on split data files, causing silent data corruption on MoR reads with positional deletes

6 participants