Commit 0497c64
zephyr: widen inferred parquet schema via pa.unify_schemas (#5142)
* writers' ``_accumulate_tables`` infers schema from the first
``_MICRO_BATCH_SIZE=8`` records — so if those records have ``None`` for
an optional field, the field gets pinned to ``pa.null()`` and later
records with real values crash with ``ArrowInvalid: Invalid null value``
* real-world case: ``common-pile/stackv2``'s nested
``metadata.gha_language`` (959 null / 1041 str across ~2000 records) was
deterministically failing
* separately, ``pa.Table.from_pylist`` **silently drops** top-level keys
missing from the pinned schema — any new column appearing in a later
batch was being truncated without a signal [^1]
* on mismatch, unify via ``pa.unify_schemas`` and rebuild the batch
against the widened schema; reconcile prior chunks on yield via
``concat_tables(promote_options="permissive")``
* genuine type conflicts (e.g. ``int`` vs ``string`` for the same field)
still raise with both schemas + inference origin shown, so operators can
diagnose without extra instrumentation
* explicit caller-provided schemas are a contract — mismatches raise
without silent widening
## Test plan
- [x] `test_write_parquet_file_widens_null_to_concrete_type` —
null→string widening succeeds and lands the widened schema on disk
- [x]
`test_write_parquet_file_captures_fields_appearing_in_later_batches` —
new field survives to disk instead of being silently dropped
- [x] `test_write_parquet_file_raises_on_incompatible_type_conflict` —
int vs string still surfaces as a clear error
[^1]: this silent-drop behavior was a latent data-loss bug; the new
extra-keys detection catches it and routes through the same widen path.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Rafal Wojdyla <ravwojdyla@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 91972ca commit 0497c64
2 files changed
Lines changed: 110 additions & 27 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
171 | 171 | | |
172 | 172 | | |
173 | 173 | | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
174 | 187 | | |
175 | 188 | | |
176 | 189 | | |
177 | 190 | | |
178 | 191 | | |
179 | 192 | | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
180 | 239 | | |
181 | 240 | | |
182 | 241 | | |
183 | 242 | | |
184 | 243 | | |
185 | | - | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
186 | 247 | | |
187 | | - | |
188 | | - | |
189 | | - | |
190 | | - | |
191 | | - | |
192 | | - | |
193 | | - | |
194 | | - | |
195 | | - | |
196 | | - | |
197 | | - | |
198 | | - | |
199 | | - | |
200 | | - | |
| 248 | + | |
| 249 | + | |
201 | 250 | | |
202 | 251 | | |
203 | 252 | | |
204 | | - | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
205 | 257 | | |
206 | 258 | | |
207 | 259 | | |
208 | 260 | | |
209 | | - | |
| 261 | + | |
210 | 262 | | |
211 | 263 | | |
212 | 264 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
151 | 151 | | |
152 | 152 | | |
153 | 153 | | |
154 | | - | |
155 | | - | |
156 | | - | |
157 | | - | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
158 | 163 | | |
159 | 164 | | |
160 | 165 | | |
161 | | - | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
162 | 196 | | |
163 | 197 | | |
164 | | - | |
165 | | - | |
166 | | - | |
167 | | - | |
168 | | - | |
| 198 | + | |
| 199 | + | |
169 | 200 | | |
170 | 201 | | |
171 | 202 | | |
| |||
0 commit comments