polars-numba: Write Numba functions for Polars

While Polars does have support for normal Numba functions, these are not ideal: Polars native data type is Arrow, but Numba by default works on NumPy arrays. As a result, using Numba:

Requires converting Arrow columns to NumPy arrays, and back, which can increase memory usage.
Does not support missing data correctly, since that concept doesn't exist in NumPy. This is especially a problem for operations that aren't just one-value-at-a-time, for a example a max() or windowed function can give the wrong results.
Supports fewer data types, compared to Arrow's more complex support.

This package aims to fix that, by integrating Numba support for Arrow with Polars, building on the Numba support from the Awkward Array project.

Memory usage

Awkward Array's memory representation is compatible with Arrow. When doing read-only operations, then, it requires zero additional memory (zero-copy) because it can just use the Arrow data structure directly, with no copying.

When creating a new Series, however, peak memory usage may be twice as high as the resulting Series, perhaps because of the temporary ArrayBuilder object required by the current Awkward Array API. This ought to be fixable with a new API.

Current status: proof-of-concept

Currently this is just enough code for users to try it out and provide feedback. Given sufficient positive feedback we will turn this into a real package on PyPI, do supporting work in other packages if necessary (e.g. improvements to Awkward Array's Numba code), etc.

How to install:

$ pip install git+https://github.com/g-research/polars-numba.git

Or add package git+https://github.com/g-research/polars-numba.git as a dependency to your requirements.txt/pyproject.toml/etc..

Using `polars-numba`

Again, this is just a proof-of-concept, but you can see usage examples:

examples.py shows the basic API.
examples2.py has examples with datetimes, lists, and structs.

To see API documentation on how to create complex results, see the API documentation for ArrayBuilder.

Chunking vs whole series

In most cases, the function called with @arrow_jit will be called with the full Series. The only exception is when you explicitly opted-in to receiving partial batches of the Series, by using @arrow_jit(is_elementwise=True, returns_scalar=False); both options are necessary. (The is_elementwise parameter is passed to map_batches().)

See example_chunking.py and look at its output to get a sense of what happens.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
src/polars_numba		src/polars_numba
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example_chunking.py		example_chunking.py
example_parallelism.py		example_parallelism.py
examples.py		examples.py
examples2.py		examples2.py
examples_memory_usage.py		examples_memory_usage.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

polars-numba: Write Numba functions for Polars

Memory usage

Current status: proof-of-concept

Using `polars-numba`

Chunking vs whole series

About

Releases

Packages

Contributors 2

Languages

License

G-Research/polars-numba

Folders and files

Latest commit

History

Repository files navigation

polars-numba: Write Numba functions for Polars

Memory usage

Current status: proof-of-concept

Using polars-numba

Chunking vs whole series

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Using `polars-numba`

Packages