I was wondering if it would be pertinent to add a halve method for a GroupedDataFrame (as a package extension). It is fairly simple to write:
function halve(gdf::GroupedDataFrame)
(left, right) = halve(keys(gdf))
return (gdf[left], gdf[right])
end
It is useful because DataFrames will still pick and choose when to spawn threads in its combine method, even when you have the threads=true kwarg set. In practice I have found a few cases recently where the DataFrames implementation is only single-threaded, and writing a Folds-based reducer like this has utilized all CPU cores and sped up my computations:
init = DataFrame(...) # empty, correct columns and types
Folds.mapreduce(vcat, groupby(df, :key); init) do subdf
...
end
I was wondering if it would be pertinent to add a
halvemethod for a GroupedDataFrame (as a package extension). It is fairly simple to write:It is useful because DataFrames will still pick and choose when to spawn threads in its
combinemethod, even when you have thethreads=truekwarg set. In practice I have found a few cases recently where the DataFrames implementation is only single-threaded, and writing a Folds-based reducer like this has utilized all CPU cores and sped up my computations: