Skip to content

fix: parallel arrow with differing schema#392

Merged
caseyclements merged 5 commits intomongodb-labs:mainfrom
CoderJoshDK:parallel-schema
Mar 12, 2026
Merged

fix: parallel arrow with differing schema#392
caseyclements merged 5 commits intomongodb-labs:mainfrom
CoderJoshDK:parallel-schema

Conversation

@CoderJoshDK
Copy link
Copy Markdown
Contributor

@CoderJoshDK CoderJoshDK commented Mar 9, 2026

Changes in this PR

#384 introduces "parallel batch processing for PyMongoArrow" and #383 handles a case where the inferred schema should be promoted to Int64 (if previously Int32). However, they did not work together on the edge case where not all batches were promoted.

While I was creating this PR, I found a bug where no parallelism was ditching existing values:

Error

>       self.assertTrue(
            table_off.equals(table_proc),
            msg=f"tables differ:\n{table_off}\n\n{table_proc}",
        )
E       AssertionError: False is not true : tables differ:
E       pyarrow.Table
E       _id: int32
E       value: int64
E       ----
E       _id: [[0,1,2,3,4,...,46,47,48,49,50]]
E       value: [[null,null,null,null,null,...,null,null,null,null,1099511627776]]
E
E       pyarrow.Table
E       _id: int32
E       value: int64
E       ----
E       _id: [[0,1,2,3,4,5,6,7,8,9],[10,11,12,13,14,15,16,17,18,19],...,[40,41,42,43,44,45,46,47,48,49],[50]]
E       value: [[0,1,2,3,4,5,6,7,8,9],[10,11,12,13,14,15,16,17,18,19],...,[40,41,42,43,44,45,46,47,48,49],[1099511627776]]

test/test_arrow.py:1467: AssertionError

This was because the builder was being filled with values and then a new one was constructed, replacing all values.

This closes #390

Test Plan

I tested this in two ways. I added the unit tests and I ran this against a real cluster I was having issues with.

Checklist

Checklist for Author

  • Did you update the changelog (if necessary)?
    I did not update the changelog because this fix still fits inside all the new changes
  • Is there test coverage?
  • Is any followup work tracked in a JIRA ticket? If so, add link(s).

Checklist for Reviewer

  • Does the title of the PR reference a JIRA Ticket?
  • Do you fully understand the implementation? (Would you be comfortable explaining how this code works to someone else?)
  • Is all relevant documentation (README or docstring) updated?

@CoderJoshDK CoderJoshDK requested a review from a team as a code owner March 9, 2026 00:19
@CoderJoshDK CoderJoshDK requested review from caseyclements and removed request for a team March 9, 2026 00:19
@caseyclements caseyclements requested a review from Copilot March 9, 2026 21:28
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes two interrelated bugs that occur when using PyMongoArrow's schema inference with multi-batch data and the parallelism option (introduced in #384):

  1. lib.pyx fix: When the inferred type for a field needs to be promoted from int32 to int64 mid-stream (from #383), the old code was silently discarding all previously-appended values by creating a fresh builder. The fix correctly migrates existing values to the new Int64Builder.
  2. api.py fix: When using parallel batch processing ("threads" or "processes"), different workers could infer different schemas (e.g., one batch has int32, another has int64), causing pa.concat_tables with promote_options="default" to raise ArrowTypeError. Changing to promote_options="permissive" resolves this.

Changes:

  • Fix value-loss bug in BuilderManager.parse_document when promoting int32int64 builder during schema inference
  • Fix schema-incompatibility error in parallel paths by using promote_options="permissive" in pa.concat_tables
  • Add integration test covering multi-batch, differing-schema scenario for all three parallelism modes

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
bindings/python/pymongoarrow/lib.pyx Preserves previously-appended int32 values when promoting to Int64Builder during schema inference
bindings/python/pymongoarrow/api.py Uses promote_options="permissive" so parallel workers with differing int32/int64 schemas can be concatenated
bindings/python/test/test_arrow.py Adds test_find_multiple_batches_of_different_schema covering the edge case for all parallelism modes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread bindings/python/pymongoarrow/lib.pyx Outdated
Copy link
Copy Markdown
Contributor

@caseyclements caseyclements left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @CoderJoshDK

Thanks for identifying and fixing this bug! In order to merge this, we simply need to fix the tests.

The problem you encountered: When you tried to iterate directly over a PyArrow array with for val in old_array:, you got TypeError: an integer is
required because:

 1. Iterating over a PyArrow array yields Int64Scalar objects, not Python integers
 2. The C++ Append() method expects a C int64_t type, not a Python object
 3. Cython can't automatically convert Int64Scalar to int64_t

The fix: Use .to_numpy() to get a numpy array, which Cython can efficiently convert to C types:

values_np = old_array.to_numpy(zero_copy_only=False)
for i in range(n_vals):
(builder).builder.get().Append(values_np[i])

This approach:
• ✅ Avoids the BSON encode/decode overhead from the original PR
• ✅ Uses efficient numpy → C conversion
• ✅ Is zero-copy when there are no nulls
• ✅ Adds only ~34-62% overhead compared to no promotion at all

@caseyclements
Copy link
Copy Markdown
Contributor

I've created INTPYTHON-933 to track this. I've also got a proposed fix too, but I have to run for the day.!

@CoderJoshDK
Copy link
Copy Markdown
Contributor Author

Hey, thanks for getting on this quickly! I am a bit confused by your comment. For now, I have a simple patch where I cast val with to_py().
Casting to numpy still seems to require the builder completion and then evaluating for a null (NaN?) value.
To make sure that the overflow logic also captures None well, I added a new row to the test(s).

The main thing that I can see being more efficient is skipping builder.finish() but I am unsure of a way to do this.

Either way, I look forward to your review and insight!

@CoderJoshDK
Copy link
Copy Markdown
Contributor Author

Ok this lint failure was related to #393 hahaha, but glad to see everything else pass.
(My local version of ruff is up to date compared to this branch and I forgot to ignore the other auto format changes.)

Sorry for the sloppyness of some of these details!

Copy link
Copy Markdown
Contributor

@caseyclements caseyclements left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great now @CoderJoshDK . The change you made sidesteps the bson N round-trip issue of the original submission, so we're all set!

@caseyclements caseyclements merged commit 3187a56 into mongodb-labs:main Mar 12, 2026
52 checks passed
@caseyclements
Copy link
Copy Markdown
Contributor

Hi @CoderJoshDK . I've just released this as part of v1.13.0. Thanks again for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

New schema inference doesn't work with parallelism

3 participants