Basic Arrow.jl-based collect and createDataFrame#115
Conversation
Functions collect_arrow, collect_tuples, and collect_df are provided, which all use Arrow.jl and Spark's Arrow support to transfer data from Spark to Julia. collect_arrow returns the raw Arrow.jl table, collect_df returns the DataFrame from DataFrames.jl, collect_tuples returns a simple Vector of named tuples. createDataFrame now has overloads which accept a DataFrame or abstract Table This version create a temporary file for each transfer, but I actually think it's preferable in many ways to socket based transfer: * Simpler :) * Arrow.jl will mmap the file, so it can in theory handle sligtly-larger-than-RAM datasets * or, if you have /tmp in tmpfs (RAM-disk), it will just mmap the chunk of memory, without additional copying on Julia side This commit still includes 2 versions for both collectToArrow and fromArrow, since I couldn't yet decide which is better
|
Thanks a lot! It's great to see Arrow getting back to Spark.jl!
I'm totally fine with files. If I remember correctly, PySpark uses (used?) both - sockets and files in different places or with different settings. Anyway, I realized that it's not necessarily a good idea to follow PySpark or SparkR design since many decisions in them were made specifically to that languages or due to certain conditions that may not apply to our case. So let's do whatever is the best for Julia.
Can't we just generate, say, a big Parquet file?
Requires.jl was designed to do exactly this, though I haven't used it for years now and don't know its status. |
Functions
collect_arrow,collect_tuples, andcollect_dfare provided,which all use Arrow.jl and Spark's Arrow support to transfer data
from Spark to Julia.
collect_arrowreturns the raw Arrow.jl table,collect_dfreturns the DataFrame from DataFrames.jl,collect_tuplesreturns a simple Vector of named tuples.
createDataFramenow has overloads which accept a DataFrame or abstract TableThis version create a temporary file for each transfer, but I actually think it's
preferable in many ways to socket based transfer:
datasets
without additional copying on Julia side
However, if you think sockets would be preferable I can change it (PySpark and SparkR use sockets)
Few things are missing
I included 2 versions for both collectToArrow and fromArrow, since I couldn't yet decide which is better.
I added DataFrames.jl to the dependencies, but it doesn't seems right to depend on this fairly non-trivial library just for
collect_df. Does Julia support something like optional dependencies? It would seem nicer to import DataFrames only if it's already installed, otherwise error-out incollect_df