Description
A related issue: #11462
We would like to transfer a cuDF dataframe between a JVM process and a Python process without data copy . This is primarily used in PySpark environment where Spark can execute user-defined Python functions in JVM processes. The current solution in Spark is to serialize the dataframe into arrow format and perform an inter-process transfer on host. This is not efficient for both memory usage and computation time. Our proposed solution at the moment is to use the CUDA IPC mechanism for transferring metadata between 2 processes without actually copying data between host and device or between different processes. The message sent between 2 processes in our proposed method is a JSON document that describes some properties of the dataframe along with an encoded CUDA IPC handle.
Describe the solution you'd like
We would like to add two roundtrip methods for generating an IPC message that describes the dataframe and reconstructs the dataframe from that message at the C++/C level along with wrappers in java/python.
df = cudf.DataFrame({"a": [1, 2, 3]})
msg = df.to_ipc()
# in a different process but on the same CUDA device
df = cudf.DataFrame.from_ipc(msg)
As for a quick design of the message, for the lack of a better term, we can jsonify the __dataframe__
protocol along with the use of encoded CUDA IPC handle. The message sent between processes can be something similar to:
{
"columns": [
{
"describe_categorical": {
"is_dictionary": true,
"is_ordered": true,
"mapping": {
"a": 0
}
},
"describe_null": [
3,
null
],
"dtype": [
0,
64,
"i",
"="
],
"buffers": [{"ipc_handle": "aeb6df622e4d6ed84dac526c8815c52c5bb2a34855a765270d6e1a502dc6574b"}],
"metadata": null,
"null_count": null,
"offset": 0,
"size": 128
},
{
...
}
]
}
We might be able to reuse some logic in the df_protocol.py
module for implementing this feature.
Describe alternatives you've considered
- Suggested by @jakirkham , I looked into the serialize and deserialize methods. They are good starting points but not suitable for transferring ownership between 2 different language environments due to the use of pickle in type serialization.
- Suggested by @shwina , I looked into the dataframe protocol, which is very close to what we want, except for its Python only interface and the use of a pointer.
mapInArrow
method from PySpark, which still requires a full data transfer.
Metadata
Metadata
Assignees
Type
Projects
Status