zephyr: widen inferred parquet schema via pa.unify_schemas by ravwojdyla-agent · Pull Request #5142 · marin-community/marin

ravwojdyla-agent · 2026-04-23T22:49:53Z

writers' _accumulate_tables infers schema from the first _MICRO_BATCH_SIZE=8 records — so if those records have None for an optional field, the field gets pinned to pa.null() and later records with real values crash with ArrowInvalid: Invalid null value
- real-world case: common-pile/stackv2's nested metadata.gha_language (959 null / 1041 str across ~2000 records) was deterministically failing
separately, pa.Table.from_pylist silently drops top-level keys missing from the pinned schema — any new column appearing in a later batch was being truncated without a signal ¹
on mismatch, unify via pa.unify_schemas and rebuild the batch against the widened schema; reconcile prior chunks on yield via concat_tables(promote_options="permissive")
genuine type conflicts (e.g. int vs string for the same field) still raise with both schemas + inference origin shown, so operators can diagnose without extra instrumentation
explicit caller-provided schemas are a contract — mismatches raise without silent widening

Test plan

test_write_parquet_file_widens_null_to_concrete_type — null→string widening succeeds and lands the widened schema on disk
test_write_parquet_file_captures_fields_appearing_in_later_batches — new field survives to disk instead of being silently dropped
test_write_parquet_file_raises_on_incompatible_type_conflict — int vs string still surfaces as a clear error

🤖 Generated with Claude Code

this silent-drop behavior was a latent data-loss bug; the new extra-keys detection catches it and routes through the same widen path. ↩

``_accumulate_tables`` infers its schema from the first micro-batch (``_MICRO_BATCH_SIZE=8``). If those first records happen to have ``None`` for a field — or to lack a field that appears later — downstream batches that would legitimately widen the schema either crashed with ``ArrowInvalid: Invalid null value`` or (in the new-field case) were silently truncated by ``pa.Table.from_pylist``. Extract the schema-mismatch raise into ``_raise_schema_mismatch`` and add a ``_build_table`` helper that unify-widens the inferred schema on mismatch and reconciles chunks on yield via ``concat_tables(promote_options="permissive")``. Genuine incompatibilities (e.g. int vs string) still raise with both schemas + inference origin shown, so operators can diagnose without extra instrumentation. An explicit caller-provided schema is treated as a contract: mismatches raise without silent widening. Tests cover: null→concrete widening, new-field-appears-later (previously silently dropped), and int-vs-string conflict surfacing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ravwojdyla-agent added the agent-generated Created by automation/agent label Apr 23, 2026

ravwojdyla force-pushed the rav-zephyr-unify-schemas branch 2 times, most recently from d59176a to 82bb2fa Compare April 23, 2026 23:03

ravwojdyla force-pushed the rav-zephyr-unify-schemas branch from 82bb2fa to 6094008 Compare April 23, 2026 23:08

ravwojdyla requested review from rjpower and yonromai April 23, 2026 23:09

rjpower approved these changes Apr 23, 2026

View reviewed changes

ravwojdyla merged commit 0497c64 into main Apr 23, 2026
39 checks passed

ravwojdyla deleted the rav-zephyr-unify-schemas branch April 23, 2026 23:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zephyr: widen inferred parquet schema via pa.unify_schemas#5142

zephyr: widen inferred parquet schema via pa.unify_schemas#5142
ravwojdyla merged 1 commit intomainfrom
rav-zephyr-unify-schemas

ravwojdyla-agent commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ravwojdyla-agent commented Apr 23, 2026

Test plan

Footnotes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants