fix: parallel arrow with differing schema by CoderJoshDK · Pull Request #392 · mongodb-labs/mongo-arrow

CoderJoshDK · 2026-03-09T00:19:49Z

Changes in this PR

#384 introduces "parallel batch processing for PyMongoArrow" and #383 handles a case where the inferred schema should be promoted to Int64 (if previously Int32). However, they did not work together on the edge case where not all batches were promoted.

While I was creating this PR, I found a bug where no parallelism was ditching existing values:

Error

>       self.assertTrue(
            table_off.equals(table_proc),
            msg=f"tables differ:\n{table_off}\n\n{table_proc}",
        )
E       AssertionError: False is not true : tables differ:
E       pyarrow.Table
E       _id: int32
E       value: int64
E       ----
E       _id: [[0,1,2,3,4,...,46,47,48,49,50]]
E       value: [[null,null,null,null,null,...,null,null,null,null,1099511627776]]
E
E       pyarrow.Table
E       _id: int32
E       value: int64
E       ----
E       _id: [[0,1,2,3,4,5,6,7,8,9],[10,11,12,13,14,15,16,17,18,19],...,[40,41,42,43,44,45,46,47,48,49],[50]]
E       value: [[0,1,2,3,4,5,6,7,8,9],[10,11,12,13,14,15,16,17,18,19],...,[40,41,42,43,44,45,46,47,48,49],[1099511627776]]

test/test_arrow.py:1467: AssertionError

This was because the builder was being filled with values and then a new one was constructed, replacing all values.

This closes #390

Test Plan

I tested this in two ways. I added the unit tests and I ran this against a real cluster I was having issues with.

Checklist

Checklist for Author

Did you update the changelog (if necessary)?
I did not update the changelog because this fix still fits inside all the new changes
Is there test coverage?
Is any followup work tracked in a JIRA ticket? If so, add link(s).

Checklist for Reviewer

Does the title of the PR reference a JIRA Ticket?
Do you fully understand the implementation? (Would you be comfortable explaining how this code works to someone else?)
Is all relevant documentation (README or docstring) updated?

Copilot

Pull request overview

This PR fixes two interrelated bugs that occur when using PyMongoArrow's schema inference with multi-batch data and the parallelism option (introduced in #384):

lib.pyx fix: When the inferred type for a field needs to be promoted from int32 to int64 mid-stream (from #383), the old code was silently discarding all previously-appended values by creating a fresh builder. The fix correctly migrates existing values to the new Int64Builder.
api.py fix: When using parallel batch processing ("threads" or "processes"), different workers could infer different schemas (e.g., one batch has int32, another has int64), causing pa.concat_tables with promote_options="default" to raise ArrowTypeError. Changing to promote_options="permissive" resolves this.

Changes:

Fix value-loss bug in BuilderManager.parse_document when promoting int32 → int64 builder during schema inference
Fix schema-incompatibility error in parallel paths by using promote_options="permissive" in pa.concat_tables
Add integration test covering multi-batch, differing-schema scenario for all three parallelism modes

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
`bindings/python/pymongoarrow/lib.pyx`	Preserves previously-appended int32 values when promoting to Int64Builder during schema inference
`bindings/python/pymongoarrow/api.py`	Uses `promote_options="permissive"` so parallel workers with differing int32/int64 schemas can be concatenated
`bindings/python/test/test_arrow.py`	Adds `test_find_multiple_batches_of_different_schema` covering the edge case for all parallelism modes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

caseyclements

Hi @CoderJoshDK

Thanks for identifying and fixing this bug! In order to merge this, we simply need to fix the tests.

The problem you encountered: When you tried to iterate directly over a PyArrow array with for val in old_array:, you got TypeError: an integer is
required because:

 1. Iterating over a PyArrow array yields Int64Scalar objects, not Python integers
 2. The C++ Append() method expects a C int64_t type, not a Python object
 3. Cython can't automatically convert Int64Scalar to int64_t

The fix: Use .to_numpy() to get a numpy array, which Cython can efficiently convert to C types:

values_np = old_array.to_numpy(zero_copy_only=False)
for i in range(n_vals):
(builder).builder.get().Append(values_np[i])

This approach:
• ✅ Avoids the BSON encode/decode overhead from the original PR
• ✅ Uses efficient numpy → C conversion
• ✅ Is zero-copy when there are no nulls
• ✅ Adds only ~34-62% overhead compared to no promotion at all

caseyclements · 2026-03-10T21:00:53Z

I've created INTPYTHON-933 to track this. I've also got a proposed fix too, but I have to run for the day.!

CoderJoshDK · 2026-03-10T22:48:14Z

Hey, thanks for getting on this quickly! I am a bit confused by your comment. For now, I have a simple patch where I cast val with to_py().
Casting to numpy still seems to require the builder completion and then evaluating for a null (NaN?) value.
To make sure that the overflow logic also captures None well, I added a new row to the test(s).

The main thing that I can see being more efficient is skipping builder.finish() but I am unsure of a way to do this.

Either way, I look forward to your review and insight!

CoderJoshDK · 2026-03-11T16:11:37Z

Ok this lint failure was related to #393 hahaha, but glad to see everything else pass.
(My local version of ruff is up to date compared to this branch and I forgot to ignore the other auto format changes.)

Sorry for the sloppyness of some of these details!

caseyclements

This looks great now @CoderJoshDK . The change you made sidesteps the bson N round-trip issue of the original submission, so we're all set!

caseyclements · 2026-03-12T21:45:08Z

Hi @CoderJoshDK . I've just released this as part of v1.13.0. Thanks again for your contribution!

fix: parallel arrow with differing schema

4568e37

CoderJoshDK requested a review from a team as a code owner March 9, 2026 00:19

CoderJoshDK requested review from caseyclements and removed request for a team March 9, 2026 00:19

caseyclements requested a review from Copilot March 9, 2026 21:28

Copilot started reviewing on behalf of caseyclements March 9, 2026 21:28 View session

Copilot AI reviewed Mar 9, 2026

View reviewed changes

Comment thread bindings/python/pymongoarrow/lib.pyx Outdated

perf: avoid python type casting

03e7763

caseyclements requested changes Mar 10, 2026

View reviewed changes

CoderJoshDK added 2 commits March 10, 2026 18:43

fix: cast scalar to int

f5bc863

perf: reduce test data size

5ab4e85

style: fix lint error

61f8891

caseyclements approved these changes Mar 12, 2026

View reviewed changes

caseyclements merged commit 3187a56 into mongodb-labs:main Mar 12, 2026
52 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: parallel arrow with differing schema#392

fix: parallel arrow with differing schema#392
caseyclements merged 5 commits intomongodb-labs:mainfrom
CoderJoshDK:parallel-schema

CoderJoshDK commented Mar 9, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

caseyclements left a comment

Uh oh!

caseyclements commented Mar 10, 2026

Uh oh!

CoderJoshDK commented Mar 10, 2026

Uh oh!

CoderJoshDK commented Mar 11, 2026

Uh oh!

caseyclements left a comment

Uh oh!

Uh oh!

caseyclements commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

CoderJoshDK commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes in this PR

Test Plan

Checklist

Checklist for Author

Checklist for Reviewer

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

caseyclements left a comment

Choose a reason for hiding this comment

Uh oh!

caseyclements commented Mar 10, 2026

Uh oh!

CoderJoshDK commented Mar 10, 2026

Uh oh!

CoderJoshDK commented Mar 11, 2026

Uh oh!

caseyclements left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

caseyclements commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CoderJoshDK commented Mar 9, 2026 •

edited

Loading