Skip to content

Conversation

the-other-tim-brown
Copy link
Contributor

@the-other-tim-brown the-other-tim-brown commented Oct 10, 2025

Describe the issue this Pull Request addresses

Currently when we are trying to read a Parquet file through Spark and the file has a different schema than the requested schema we can have runtime errors. These errors happen specifically when a column contains a record and the query selects a subset of columns from that record. If the schema of the record for that particular data file does not have the any of the requested fields, then there will be an error.

Summary and Changelog

The Spark Parquet reader is updated to remove any fields from the selected fields if the those fields result in records with no sub-fields.

This is validated by enabling the FileGroupReader path in one of the existing tests.

Impact

Fixes a bug in the FileGroupReader path for Spark reads.

Risk Level

Low

Documentation Update

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Oct 10, 2025
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@the-other-tim-brown the-other-tim-brown changed the title Spark Schema on read fix bug: Spark Schema on read fix Oct 13, 2025
@the-other-tim-brown the-other-tim-brown changed the title bug: Spark Schema on read fix bug: Spark Schema Evolution Fix for nested columns Oct 13, 2025
@the-other-tim-brown the-other-tim-brown marked this pull request as ready for review October 13, 2025 22:54
@the-other-tim-brown the-other-tim-brown changed the title bug: Spark Schema Evolution Fix for nested columns fix: Spark Schema Evolution Fix for nested columns Oct 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants