Numba adapter for PySpark, possible implementation routes via PyArrow and / or Awkward Array #3760

Goykhman · 2025-12-03T20:01:01Z

Goykhman
Dec 3, 2025

Dear Awkward Array Developers,

Let's say we have a Python application that executes a calculation over large distributed datasets. Let's say that the application is orchestrated and parallelized by PySpark.

Each partition of the distributed dataset(s) will be given as an argument to the 'main' function (entry point) of the application.

This is at the high level. So we need to call the 'main' function - map it over / apply it to each partition - and then combine the results.

The results will be distributed too, and possibly written into a distributed storage system (by PySpark).

Moving data from / to Spark space (JVM process) into Python space (the interpreter process) is a bit of a bottleneck (relatively speaking, of course, but let's hold on to this point for the sake of the argument).

Namely, let's say our application's 'main' function has all computationally heavy bits JIT-compiled using numba. To that end, the calculator of our application is a JIT'd function taking contiguous data arrays as its parameters (such as, numpy or awkward arrays), and it returns low-level arrays of data too.

In light of that, we probably would rather have the data move from JVM to Python Interpreter in a serialized memory-efficient way.

This way the array(s) of data that we give as argument(s) to our JIT'd calculator come in as-is (up to being wrapped and pointed at by a Python object proxy, like a numpy/awkward array), without having to be assembled entry-by-entry in a Python Interpreter for-loop by iterating over the individual entries' Python objects.

PySpark furnishes mapInArrow that appears to be a good access point to the DataFrame's partitions, giving users a handle to the low-level contiguous memory arrays of the data (and an assortment of metadata, such a bitmaps of nulls, or offsets for non-uniformly sized arrays).

If we take this route, the only question is how to bridge the remaining gap between the PyArrow arrays and the numba typing context.

I have started on this path in the numbarrow repo, link above to the demo.

I am particularly interested in the cases when some data is NULL in the original DataFrame, which constraints our ability to apply 'to_numpy' methods of both arrow and awkward without the penalty of the copy. I was thinking that taking the view over the array of data as-is and retaining the bitmap to identify (in)valid array entries might be a way to allow for the missing data while avoiding copies.

To summarize, I see two goals here, one, to avoid the copies (especially the in-interpreter copies through de-serialized Python objects), two, to maintain consistency of nopython mode both in the I/O as well as in the computation stages.

Looking forward to any feedback! Would be happy to collaborate on this, it would be great to shape this into a utility.

Thank you.

Mikhail Goykhman

ianna · 2025-12-14T09:47:07Z

ianna
Dec 14, 2025
Maintainer

@Goykhman - Thank you again for sharing your detailed outline. Awkward Array does implement an array view, with each array numba type generating its own view that is hashed and then lowered to Numba. This mechanism is designed to avoid unnecessary copies while enabling nopython‑mode compilation paths.

In addition, we provide an Awkward extension to PyArrow, which allows Arrow buffers and metadata (including null bitmaps and offsets) to be represented directly in Awkward’s layout classes. This extension is the natural bridge between Arrow and Awkward, and it should help in your scenario of passing contiguous memory arrays into JIT‑compiled functions without incurring Python‑level object overhead.

The next step is to look into what is needed to make the view work for your case—particularly ensuring that Arrow’s null handling integrates cleanly with Awkward’s extension and that the lowered views remain compatible with Numba. That way, you can avoid copies while retaining validity information and maintain consistency in nopython.

We’d be glad to collaborate on exploring this integration and shaping it into a utility.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numba adapter for PySpark, possible implementation routes via PyArrow and / or Awkward Array #3760

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Numba adapter for PySpark, possible implementation routes via PyArrow and / or Awkward Array #3760

Uh oh!

Goykhman Dec 3, 2025

Replies: 1 comment

Uh oh!

ianna Dec 14, 2025 Maintainer

Goykhman
Dec 3, 2025

ianna
Dec 14, 2025
Maintainer