Skip to content

misc: Fix Projector stream offset mapping for multi-column projection#539

Open
xiaoxmeng wants to merge 1 commit intofacebookincubator:mainfrom
xiaoxmeng:export-D95328184
Open

misc: Fix Projector stream offset mapping for multi-column projection#539
xiaoxmeng wants to merge 1 commit intofacebookincubator:mainfrom
xiaoxmeng:export-D95328184

Conversation

@xiaoxmeng
Copy link
Contributor

Summary:
CONTEXT: When projecting multiple top-level columns (e.g., int_traits_map and
long_traits_map), the Projector crashes during deserialization because the
projected schema's stream offsets don't match the actual data layout.

The root cause: buildProjectedSchema assigned output offsets sequentially via
depth-first traversal (all int_traits_map streams first, then long_traits_map).
But project() copies input streams to output positions based on
inputStreamIndices_ (sorted by input offset). When sibling subtrees' input
offsets interleave numerically (e.g., long_traits_map nulls=2 falls between
int_traits_map nulls=1 and its first child at offset 3), the depth-first
traversal assigns wrong output offsets — causing the Deserializer to
misinterpret stream data types and crash.

WHAT: Replace sequential offset allocation in buildProjectedSchema with an
offset map (input offset → output index) derived from sorted
inputStreamIndices_. This ensures projected schema offsets exactly match the
data layout produced by project(), regardless of how input stream offsets
interleave across sibling subtrees.

Differential Revision: D95328184

Summary:
CONTEXT: When projecting multiple top-level columns (e.g., int_traits_map and
long_traits_map), the Projector crashes during deserialization because the
projected schema's stream offsets don't match the actual data layout.

The root cause: buildProjectedSchema assigned output offsets sequentially via
depth-first traversal (all int_traits_map streams first, then long_traits_map).
But project() copies input streams to output positions based on
inputStreamIndices_ (sorted by input offset). When sibling subtrees' input
offsets interleave numerically (e.g., long_traits_map nulls=2 falls between
int_traits_map nulls=1 and its first child at offset 3), the depth-first
traversal assigns wrong output offsets — causing the Deserializer to
misinterpret stream data types and crash.

WHAT: Replace sequential offset allocation in buildProjectedSchema with an
offset map (input offset → output index) derived from sorted
inputStreamIndices_. This ensures projected schema offsets exactly match the
data layout produced by project(), regardless of how input stream offsets
interleave across sibling subtrees.

Differential Revision: D95328184
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 5, 2026
@meta-codesync
Copy link

meta-codesync bot commented Mar 5, 2026

@xiaoxmeng has exported this pull request. If you are a Meta employee, you can view the originating Diff in D95328184.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant