Skip to content

[Tracking] Add clustering metrics #26

Open
@ablaom

Description

@ablaom

Here I have in mind those metrics that compare a clustering model with independent ground truth (as opposed to "internal" measures of quality, such as the Calinski-Harabasz index). The following look like good candidates:

  • Rand index
  • Hubert & Arabie Adjusted Rand index
  • Mirkin's index
  • Hubert's index
  • variation of information
  • V-measure
  • mutual information

The Clustering.jl package already has implementations, which assumes the clusters are labelled with integers. The first four are combined into one function, which returns a tuple instead of a single measurement, which deviates from the StatisticalMeasures.jl idiom. These could either be separate measures, or we could add a field for the desired variation.

Given that the definition of these measures are pretty simple, I think it's more trouble than it's worth to write and maintain interfaces for the existing code, which also requires making Clustering.jl a (conditional) dependency. I therefore propose new implementations here. The vanilla Rand index would make a great start.

Here's what traits would look like for these measures:

consumes_multiple_observations = true
kind_of_proxy = LearnAPI.LabelAmbiguous()
observation_scitype = Union{Missing, ScientificTypesBase.Finite}
orientation = StatisticalMeasuresBase.Score() # all except variation of information
orientation = StatisticalMeasuresBase.Loss() # variation of information
human_name = ... <string>

For others not mentioned above, the fallbacks suffice.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions