Skip to content

Missing values and weighting #88

Open
@nalimilan

Description

@nalimilan

We currently have an efficient and consistent solution to skip missing values for unweighted single-argument functions via f(skipmissing(x)). For multiple-argument functions like cor we don't have a great solution yet (https://github.com/JuliaLang/Statistics.jl/pull/34). Another case where we don't have a good solution is weighted functions, which are not currently in Statistics but should be imported from StatsBase (https://github.com/JuliaLang/Statistics.jl/issues/87).

A reasonable solution would be to use f(skipmissing(x), weights=w), with a typical definition being:

function f(s::SkipMissing{<:AbstractVector}; weights::AbstractVector)
    size(s.x) == size(weights) || throw(DimensionMismatch())
    inds= find(!ismissing, s.x)
    f(view(s.x, inds), weights=view(weights, inds))
end

That is, we would assume that weights refer to the original vector so that we skip those corresponding to missing entries. This is admittedly a bit weird in terms of implementation as weights are not wrapped in skipmissing. A wrapper like skipmissing(weighted(x, w)) (inspired by what was proposed at JuliaLang/julia#33310) would be cleaner in that regard. But that would still be quite ad-hoc, as skipmissing currently only accepts collections (and weighted cannot be one since it's not just about multiplying weights and values), and the resulting object would basically be only used for dispatch without implementing any common methods.

The generalization to multiple-argument functions poses the same challenges as cor. For these, the simplest solution would be to use a skipmissing keyword argument, a bit like pairwise. Again, the alternative would be to use wrappers like skipmissing(weighted(w, x, y)).

Overall, the problem is that we have conflicting goals:

  • be able to skip missing values with functions that don't have any special support for them using f(skipmissing(x))
  • use a similar syntax for unweighted and weighted functions, e.g. f(skipmissing(x)) vs f(skipmissing(x), weights=w), or f(skipmissing(x)) vs f(skipmissing(weighted(x, w))), or f(x, skipmissing=true) vs f(x, skipmissing=true, weights=w)
  • use a similar syntax for single- and multiple-argument functions, e.g. f(skipmissing(x)) vs f(skipmissing(x, y)), or f(x, skipmissing=true) vs f(x, y, skipmissing=true)
  • use a similar syntax for simple functions operating on vectors (like mean) and complex functions operating on whole tables (like fit(MODEL, ..., data=df, weights=w) and which skip missing values by default)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions