Skip to content

Support GroupedDataFrame? #4

@nateybear

Description

@nateybear

I was wondering if it would be pertinent to add a halve method for a GroupedDataFrame (as a package extension). It is fairly simple to write:

function halve(gdf::GroupedDataFrame)
    (left, right) = halve(keys(gdf))
    return (gdf[left], gdf[right])
end

It is useful because DataFrames will still pick and choose when to spawn threads in its combine method, even when you have the threads=true kwarg set. In practice I have found a few cases recently where the DataFrames implementation is only single-threaded, and writing a Folds-based reducer like this has utilized all CPU cores and sped up my computations:

init = DataFrame(...) # empty, correct columns and types
Folds.mapreduce(vcat, groupby(df, :key); init) do subdf
   ...
end

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions