Description
We currently have an efficient and consistent solution to skip missing values for unweighted single-argument functions via f(skipmissing(x))
. For multiple-argument functions like cor
we don't have a great solution yet (https://github.com/JuliaLang/Statistics.jl/pull/34). Another case where we don't have a good solution is weighted functions, which are not currently in Statistics but should be imported from StatsBase (https://github.com/JuliaLang/Statistics.jl/issues/87).
A reasonable solution would be to use f(skipmissing(x), weights=w)
, with a typical definition being:
function f(s::SkipMissing{<:AbstractVector}; weights::AbstractVector)
size(s.x) == size(weights) || throw(DimensionMismatch())
inds= find(!ismissing, s.x)
f(view(s.x, inds), weights=view(weights, inds))
end
That is, we would assume that weights refer to the original vector so that we skip those corresponding to missing entries. This is admittedly a bit weird in terms of implementation as weights are not wrapped in skipmissing
. A wrapper like skipmissing(weighted(x, w))
(inspired by what was proposed at JuliaLang/julia#33310) would be cleaner in that regard. But that would still be quite ad-hoc, as skipmissing
currently only accepts collections (and weighted
cannot be one since it's not just about multiplying weights and values), and the resulting object would basically be only used for dispatch without implementing any common methods.
The generalization to multiple-argument functions poses the same challenges as cor
. For these, the simplest solution would be to use a skipmissing
keyword argument, a bit like pairwise
. Again, the alternative would be to use wrappers like skipmissing(weighted(w, x, y))
.
Overall, the problem is that we have conflicting goals:
- be able to skip missing values with functions that don't have any special support for them using
f(skipmissing(x))
- use a similar syntax for unweighted and weighted functions, e.g.
f(skipmissing(x))
vsf(skipmissing(x), weights=w)
, orf(skipmissing(x))
vsf(skipmissing(weighted(x, w)))
, orf(x, skipmissing=true)
vsf(x, skipmissing=true, weights=w)
- use a similar syntax for single- and multiple-argument functions, e.g.
f(skipmissing(x))
vsf(skipmissing(x, y))
, orf(x, skipmissing=true)
vsf(x, y, skipmissing=true)
- use a similar syntax for simple functions operating on vectors (like
mean
) and complex functions operating on whole tables (likefit(MODEL, ..., data=df, weights=w)
and which skip missing values by default)