Very slow approx_size for DataFrames

When benchmarking parallel application which uses Dagger, it seems like `MemPool.approx_size` is the bottleneck due to it falling back to `Base.summarysize`. 

Here is a quick MWE:

```julia
julia>  using BenchmarkTools, DataFrames, MemPool

julia> df = DataFrame(a=1:1000_000, b=randn(1000_000), c=repeat([:aa], 1000_000));

julia> @benchmark MemPool.approx_size($df)
BenchmarkTools.Trial: 
  memory estimate:  61.03 MiB
  allocs estimate:  1999540
  --------------
  minimum time:     110.895 ms (4.59% GC)
  median time:      119.604 ms (2.47% GC)
  mean time:        122.978 ms (2.83% GC)
  maximum time:     146.009 ms (1.46% GC)
  --------------
  samples:          41
  evals/sample:     1
```

Here is a sketch of an alternative implementation which is much faster:
```julia
julia> function MemPool.approx_size(df::DataFrame)
       dsize = mapreduce(MemPool.approx_size, +, eachcol(df))
       namesize = mapreduce(MemPool.approx_size, +, names(df))
       return dsize + namesize
       end

julia> @benchmark MemPool.approx_size($df)
BenchmarkTools.Trial: 
  memory estimate:  704 bytes
  allocs estimate:  13
  --------------
  minimum time:     535.700 μs (0.00% GC)
  median time:      636.800 μs (0.00% GC)
  mean time:        664.967 μs (0.00% GC)
  maximum time:     1.525 ms (0.00% GC)
  --------------
  samples:          7499
  evals/sample:     1
```
The above implementation is not 100% correct, but I hope it shows that there is some potential for improvement.

Don't know if there is some interface which can be used to avoid the dependency, e.g. Tables.jl.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow approx_size for DataFrames #48

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development