Description
@ablaom I am not sure if this is the best place to start this discussion, but it is a follow up to https://discourse.julialang.org/t/random-access-to-rows-of-a-table/77386 and JuliaData/Tables.jl#278.
The key point is to avoid creating functions having essentially the same functionalities across DataAPI.jl, Tables.jl, and MLUtils.jl (possibly other ML packages I am not aware of).
Assume for a moment that Tables.jl table is a source of data for some ML model and you want operations to be efficient.
My understanding that your high-level workflow is the following:
- the user starts with a Tables.jl table.
- then the user does observation subsetting, feature selection, feature transformation operations on this table (either eagerly or lazily).
- finally the user transforms the result of step 2 to an object to some other type (again - either lazily or eagerly) to another value that can be accepted as an input by the ML algorithm.
The question is:
What functionalities you need to have in DataAPI.jl and Tables.jl so that it is efficient and you do not need to provide duplicate definitions of concepts in MLUtils.jl (or some other packages)?
Another consideration (raised in the linked discussions) is that I would expect that what we develop is consistent with the interfaces that Base Julia already defines (e.g. iterator interface, abstract vector interface, indexing interface, view interface)