Skip to content

zephyr: widen inferred parquet schema via pa.unify_schemas#5142

Merged
ravwojdyla merged 1 commit intomainfrom
rav-zephyr-unify-schemas
Apr 23, 2026
Merged

zephyr: widen inferred parquet schema via pa.unify_schemas#5142
ravwojdyla merged 1 commit intomainfrom
rav-zephyr-unify-schemas

Conversation

@ravwojdyla-agent
Copy link
Copy Markdown
Contributor

  • writers' _accumulate_tables infers schema from the first _MICRO_BATCH_SIZE=8 records — so if those records have None for an optional field, the field gets pinned to pa.null() and later records with real values crash with ArrowInvalid: Invalid null value
    • real-world case: common-pile/stackv2's nested metadata.gha_language (959 null / 1041 str across ~2000 records) was deterministically failing
  • separately, pa.Table.from_pylist silently drops top-level keys missing from the pinned schema — any new column appearing in a later batch was being truncated without a signal 1
  • on mismatch, unify via pa.unify_schemas and rebuild the batch against the widened schema; reconcile prior chunks on yield via concat_tables(promote_options="permissive")
  • genuine type conflicts (e.g. int vs string for the same field) still raise with both schemas + inference origin shown, so operators can diagnose without extra instrumentation
  • explicit caller-provided schemas are a contract — mismatches raise without silent widening

Test plan

  • test_write_parquet_file_widens_null_to_concrete_type — null→string widening succeeds and lands the widened schema on disk
  • test_write_parquet_file_captures_fields_appearing_in_later_batches — new field survives to disk instead of being silently dropped
  • test_write_parquet_file_raises_on_incompatible_type_conflict — int vs string still surfaces as a clear error

🤖 Generated with Claude Code

Footnotes

  1. this silent-drop behavior was a latent data-loss bug; the new extra-keys detection catches it and routes through the same widen path.

@ravwojdyla-agent ravwojdyla-agent added the agent-generated Created by automation/agent label Apr 23, 2026
@ravwojdyla ravwojdyla force-pushed the rav-zephyr-unify-schemas branch 2 times, most recently from d59176a to 82bb2fa Compare April 23, 2026 23:03
``_accumulate_tables`` infers its schema from the first micro-batch
(``_MICRO_BATCH_SIZE=8``). If those first records happen to have ``None``
for a field — or to lack a field that appears later — downstream batches
that would legitimately widen the schema either crashed with
``ArrowInvalid: Invalid null value`` or (in the new-field case) were
silently truncated by ``pa.Table.from_pylist``.

Extract the schema-mismatch raise into ``_raise_schema_mismatch`` and
add a ``_build_table`` helper that unify-widens the inferred schema on
mismatch and reconciles chunks on yield via
``concat_tables(promote_options="permissive")``. Genuine incompatibilities
(e.g. int vs string) still raise with both schemas + inference origin
shown, so operators can diagnose without extra instrumentation.

An explicit caller-provided schema is treated as a contract: mismatches
raise without silent widening.

Tests cover: null→concrete widening, new-field-appears-later (previously
silently dropped), and int-vs-string conflict surfacing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ravwojdyla ravwojdyla force-pushed the rav-zephyr-unify-schemas branch from 82bb2fa to 6094008 Compare April 23, 2026 23:08
@ravwojdyla ravwojdyla requested review from rjpower and yonromai April 23, 2026 23:09
@ravwojdyla ravwojdyla merged commit 0497c64 into main Apr 23, 2026
39 checks passed
@ravwojdyla ravwojdyla deleted the rav-zephyr-unify-schemas branch April 23, 2026 23:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants