Skip to content

Conversation

@yzeng1618
Copy link
Contributor

Purpose of this pull request

Fix Hive Source initialization failures when the Hive table (or selected partitions) contains no readable data files (e.g. newly created empty table, empty partitions, or filters result in no files).

Before this pr:

  • TEXT tables may fail during source initialization with:
    • java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
    • caused by file read strategy referencing fileNames.get(0) while fileNames is empty.
  • ORC/PARQUET tables may build an empty (0-column) schema when no files exist, which can later fail Hive Sink initialization with:
    • java.lang.IllegalArgumentException
    • from FileSinkConfig precondition check (empty rowType).

Does this PR introduce any user-facing change?

Yes.

This pr adds a Hive Metastore (HMS) schema fallback when the table/partition location contains no data files:

  • Build a stable non-empty schema from HMS (table columns + optional partition columns), so the job can start and process 0 rows gracefully.
  • Keeps partition keys metadata in CatalogTable.

How was this patch tested?

  1. Added unit test:
  • HiveSourceConfigEmptyFilesTest
  1. Added e2e coverage:
  • Empty TEXT table read -> Assert (0 rows)
  • Empty PARQUET table Hive -> Hive job init

Check list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant