fix: Spark Schema Evolution Fix for nested columns #14075

the-other-tim-brown · 2025-10-10T17:16:10Z

Describe the issue this Pull Request addresses

Currently when we are trying to read a Parquet file through Spark and the file has a different schema than the requested schema we can have runtime errors. These errors happen specifically when a column contains a record and the query selects a subset of columns from that record. If the schema of the record for that particular data file does not have the any of the requested fields, then there will be an error.

Summary and Changelog

The Spark Parquet reader is updated to remove any fields from the selected fields if the those fields result in records with no sub-fields.

This is validated by enabling the FileGroupReader path in one of the existing tests.

Impact

Fixes a bug in the FileGroupReader path for Spark reads.

Risk Level

Low

Documentation Update

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

hudi-bot · 2025-10-13T20:28:41Z

CI report:

4fe01a6 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

the-other-tim-brown added 3 commits October 10, 2025 10:39

remove struct if struct has no fields

7dadfea

add test

8fefff5

only filter when no fields left in struct

d1fd7fd

github-actions bot added the size:M PR with lines of changes in (100, 300] label Oct 10, 2025

the-other-tim-brown added 5 commits October 10, 2025 14:50

better handle edge case

5e10fa1

remove changes from local testing

91fbd2f

update reader support

6df2f19

move read support, add testing, add to other spark versions

4c152bc

fix map handling

4fe01a6

the-other-tim-brown changed the title ~~Spark Schema on read fix~~ bug: Spark Schema on read fix Oct 13, 2025

the-other-tim-brown changed the title ~~bug: Spark Schema on read fix~~ bug: Spark Schema Evolution Fix for nested columns Oct 13, 2025

the-other-tim-brown marked this pull request as ready for review October 13, 2025 22:54

the-other-tim-brown changed the title ~~bug: Spark Schema Evolution Fix for nested columns~~ fix: Spark Schema Evolution Fix for nested columns Oct 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Spark Schema Evolution Fix for nested columns #14075

fix: Spark Schema Evolution Fix for nested columns #14075

the-other-tim-brown commented Oct 10, 2025 •

edited

Loading

Uh oh!

hudi-bot commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: Spark Schema Evolution Fix for nested columns #14075

Are you sure you want to change the base?

fix: Spark Schema Evolution Fix for nested columns #14075

Conversation

the-other-tim-brown commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-bot commented Oct 13, 2025

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

the-other-tim-brown commented Oct 10, 2025 •

edited

Loading