Skip to content

Arrow SupportΒ #8329

@wiredfool

Description

@wiredfool

Following on to some of the discussion in #1888, specifically here: #1888 (comment)

Rationale

Arrow is the emerging memory layout for zero copy sharing of data in the new data ecosystem. It is an uncompressed columnar format, specifically designed for interop between different implementations and languages. It can be viewed as the spiritual successor to the existing numpy array interface that we provide. The arrow format is supported by numpy 2, pandas 2, polars, pyarrow, and arro3, and others in the python ecosystem.

What Support means

  • The ability to export an image to an arrow array and read/process that data with no memory copies
  • The ability to read an image in arrow array storage with 0 copies.

Technical Details

(Apache docs are here: https://arrow.apache.org/docs/format/Columnar.html)

An Arrow Schema is a set of metadata, containing type information, and potentially child schemas. An Arrow Array has an (implicitly) associated schema, metadata about the length of the storage, as well as a buffer of a contiguously allocated chunk of memory for the data. The Arrow Array will generally have the same parent/child arrangement as the schema structure.

  • obj.__arrow_c_schema__() must return a PyCapsule with an arrow_schema name and an arrow schema struct.
  • obj.__arrow_c_array__(schema=None) must return a tuple of the schema above and a PyCapsule with an arrow_array name and an arrow array struct. The schema is advisory, caller may request a format.

The lifetime of the Schema and Array structures is dependent on the caller -- so there are release callbacks that must be called when the caller is done with the memory. This complicates the lifetime of our image storage.

We have two cases at the moment:

  1. single channel image
  2. multichannel image

A single channel image can be encoded as a single array of height*width items, using the type of the underlying storage. (e.g., uint8/int32/float32).

A multichannel image can be encoded in a similar manner, using 4*height*width items in the array. The caller would be responsible for knowing that it's 4 elements per pixel. It's also possible to use a parent type of a FixedWidthArray of 4 elements, and a child array of 4*height*width elements. The fixed width arrays are statically defined, so the underlying array is still the same continuous block of memory.

Flat:

<pyarrow.lib.FixedSizeListArray object at 0x106ad4280>
[
    20,
    21,
    67,
    255
    17,
    18,
    62,
    255
...

Nested:

<pyarrow.lib.FixedSizeListArray object at 0x106ad4280>
[
  [
    20,
    21,
    67,
    255
  ],
  [
    17,
    18,
    62,
    255
  ],

An alternate encoding of a multichannel image would be to use a struct of channels, e.g. Struct[r,g,b,a]. This would require 4 child arrays, each allocated in a continuous chunk, as in planar image storage. This is not compatible with our current storage.

While our core storage is generally compatible with this layout, there are three issues:

  1. The block allocator in ImagingAllocateArray packs a number of scanlines in a 16mb block, leaving empty space at the end of the block. This limits the array length to < 1 16mb block. This is not an issue with the single chunk ImagingAllocateBlock, which does the image in one chunk. (note, blocks for the array allocator, arrow arrays fully work with the block allocator. Naming is hard.) This may be possible to work around with the streaming interface.
  2. Some modes have line length padding (BGR;15, BGR;24), and will not work without copying.
  3. Some modes have ignored pixel bands (LA/PA). This is a documentation issue for consumers.

Implementation Notes

The PR #8330 implements Pillow->Arrow for images that don't trip the above caveats.

There are no additional build or runtime dependencies. The arrow structures are designed to be copied into a header and used from there. (licensing is not an issue as those fragments are under an Apache License). There is an additional test dependency on PyArrow at the moment. In theory, numpy 2 could be used for this, but I'm not sure if we'd be testing the legacy array access or arrow access.

The lifetime of the core imaging struct is now separated from the imaging Python Object. There's effectively a refcount implemented for this -- there's an initial 1 for the image->im reference, every arrow array that references an image increments it, and calling ImagingDelete decrements it.

Outstanding Questions

For consumers of data -- what's the most useful format?

  • Flat array arr[(y*(width)+x)*4 + channel]
  • or Fixed Pixel array arr[y*(width)+x][channel]?
  • Would it make sense to embed this into a set of FixedArrays that are a line length, arr[y][x][channel]?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions