Description
I think one could get greatly increase buy-in for MLUtil.jl if every Tables.jl compatible table would automatically implement the "data container" API. To get performance, one would still want to implement the concrete table types as well, but having it "just work" for all tables would be nice. I guess, since "table" is itself just an interface, rather than an abstract type, this would need to be implemented as part of the data container API, right? As Tables.jl is very lightweight, I don't see that as a big issue (and I could probably find someone to help with the integration).
Even so, there seems to be a problem implementing the interface for certain tables. MLUtils.jl interprets tuples in a very specific way. For example shuffleobs((x1, x2))
treats x1
and x2
as separate data containers, which are to be shuffled simultaneously, with the same base observation index shuffle. But some tables are tuples. The following example is even a tuple-table whose elements are themselves tables (of a different type):
julia> X
((a = [1, 3], b = [2, 3]), (a = [2, 5], b = [4, 7]))
julia> Tables.istable(X)
true
So is such a tuple a pair of data containers or a single data container? The current API cannot distinguish them.
I wonder:
- How attached are people to current tuple-based dispatch for coupled multi-container processing?
- Is there a big use-case for tables that are also tuples? @quinnj
Possibly this discussion is related.
Tables that are tuples are problematic elsewhere.