fix: parallel arrow with differing schema#392
fix: parallel arrow with differing schema#392caseyclements merged 5 commits intomongodb-labs:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes two interrelated bugs that occur when using PyMongoArrow's schema inference with multi-batch data and the parallelism option (introduced in #384):
lib.pyxfix: When the inferred type for a field needs to be promoted fromint32toint64mid-stream (from #383), the old code was silently discarding all previously-appended values by creating a fresh builder. The fix correctly migrates existing values to the newInt64Builder.api.pyfix: When using parallel batch processing ("threads"or"processes"), different workers could infer different schemas (e.g., one batch hasint32, another hasint64), causingpa.concat_tableswithpromote_options="default"to raiseArrowTypeError. Changing topromote_options="permissive"resolves this.
Changes:
- Fix value-loss bug in
BuilderManager.parse_documentwhen promotingint32→int64builder during schema inference - Fix schema-incompatibility error in parallel paths by using
promote_options="permissive"inpa.concat_tables - Add integration test covering multi-batch, differing-schema scenario for all three
parallelismmodes
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
bindings/python/pymongoarrow/lib.pyx |
Preserves previously-appended int32 values when promoting to Int64Builder during schema inference |
bindings/python/pymongoarrow/api.py |
Uses promote_options="permissive" so parallel workers with differing int32/int64 schemas can be concatenated |
bindings/python/test/test_arrow.py |
Adds test_find_multiple_batches_of_different_schema covering the edge case for all parallelism modes |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
caseyclements
left a comment
There was a problem hiding this comment.
Hi @CoderJoshDK
Thanks for identifying and fixing this bug! In order to merge this, we simply need to fix the tests.
The problem you encountered: When you tried to iterate directly over a PyArrow array with for val in old_array:, you got TypeError: an integer is
required because:
1. Iterating over a PyArrow array yields Int64Scalar objects, not Python integers
2. The C++ Append() method expects a C int64_t type, not a Python object
3. Cython can't automatically convert Int64Scalar to int64_t
The fix: Use .to_numpy() to get a numpy array, which Cython can efficiently convert to C types:
values_np = old_array.to_numpy(zero_copy_only=False)
for i in range(n_vals):
(builder).builder.get().Append(values_np[i])
This approach:
• ✅ Avoids the BSON encode/decode overhead from the original PR
• ✅ Uses efficient numpy → C conversion
• ✅ Is zero-copy when there are no nulls
• ✅ Adds only ~34-62% overhead compared to no promotion at all
|
I've created INTPYTHON-933 to track this. I've also got a proposed fix too, but I have to run for the day.! |
|
Hey, thanks for getting on this quickly! I am a bit confused by your comment. For now, I have a simple patch where I cast val with The main thing that I can see being more efficient is skipping Either way, I look forward to your review and insight! |
|
Ok this lint failure was related to #393 hahaha, but glad to see everything else pass. Sorry for the sloppyness of some of these details! |
caseyclements
left a comment
There was a problem hiding this comment.
This looks great now @CoderJoshDK . The change you made sidesteps the bson N round-trip issue of the original submission, so we're all set!
|
Hi @CoderJoshDK . I've just released this as part of v1.13.0. Thanks again for your contribution! |
Changes in this PR
#384 introduces "parallel batch processing for PyMongoArrow" and #383 handles a case where the inferred schema should be promoted to Int64 (if previously Int32). However, they did not work together on the edge case where not all batches were promoted.
While I was creating this PR, I found a bug where no parallelism was ditching existing values:
Error
This was because the builder was being filled with values and then a new one was constructed, replacing all values.
This closes #390
Test Plan
I tested this in two ways. I added the unit tests and I ran this against a real cluster I was having issues with.
Checklist
Checklist for Author
I did not update the changelog because this fix still fits inside all the new changes
Checklist for Reviewer