[FEA] Interchange protocol between processes on the same device.

A related issue: https://github.com/rapidsai/cudf/issues/11462

We would like to transfer a cuDF dataframe between a JVM process and a Python process without data copy . This is primarily used in PySpark environment where Spark can execute user-defined Python functions in JVM processes. The current solution in Spark is to serialize the dataframe into arrow format and perform an inter-process transfer on host. This is not efficient for both memory usage and computation time. Our proposed solution at the moment is to use the CUDA IPC mechanism for transferring metadata between 2 processes without actually copying data between host and device or between different processes. The message sent between 2 processes in our proposed method is a JSON document that describes some properties of the dataframe along with an encoded CUDA IPC handle.

**Describe the solution you'd like**
We would like to add two roundtrip methods for generating an IPC message that describes the dataframe and reconstructs the dataframe from that message at the C++/C level along with wrappers in java/python.  

``` python
df = cudf.DataFrame({"a": [1, 2, 3]})
msg = df.to_ipc()

# in a different process but on the same CUDA device
df = cudf.DataFrame.from_ipc(msg)
```

As for a quick design of the message, for the lack of a better term, we can jsonify the `__dataframe__` protocol along with the use of encoded CUDA IPC handle. The message sent between processes can be something similar to:
``` json
{
  "columns": [
    {
      "describe_categorical": {
        "is_dictionary": true,
        "is_ordered": true,
        "mapping": {
          "a": 0
        }
      },
      "describe_null": [
        3,
        null
      ],
      "dtype": [
        0,
        64,
        "i",
        "="
      ],
      "buffers": [{"ipc_handle": "aeb6df622e4d6ed84dac526c8815c52c5bb2a34855a765270d6e1a502dc6574b"}],
      "metadata": null,
      "null_count": null,
      "offset": 0,
      "size": 128
    },
    {
      ...
    }
  ]
}
```

We might be able to reuse some logic in the `df_protocol.py` module for implementing this feature.

**Describe alternatives you've considered**
- Suggested by @jakirkham , I looked into the serialize and deserialize methods. They are good starting points but not suitable for transferring ownership between 2 different language environments due to the use of pickle in type serialization.
- Suggested by @shwina , I looked into the dataframe protocol, which is very close to what we want, except for its Python only interface and the use of a pointer.
- `mapInArrow` method from PySpark, which still requires a full data transfer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEA] Interchange protocol between processes on the same device. #11514

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] Interchange protocol between processes on the same device. #11514

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions