-
Notifications
You must be signed in to change notification settings - Fork 3
Expand file tree
/
Copy pathREADME.qmd
More file actions
122 lines (97 loc) · 4.78 KB
/
Copy pathREADME.qmd
File metadata and controls
122 lines (97 loc) · 4.78 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
title: "aricode"
subtitle: "fast computations of clustering comparison measures"
format: gfm
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
fig.path = "man/figures/"
)
```
<!-- badges: start -->
[](https://github.com/jchiquet/aricode/actions/workflows/R-CMD-check.yaml)
[](https://CRAN.R-project.org/package=aricode)
[](https://codecov.io/gh/jchiquet/aricode)
[](https://www.tidyverse.org/lifecycle/#stable)
[](https://github.com/jchiquet/aricode/commits/master)
<!-- badges: end -->
A package for efficient computations of standard clustering comparison measures
## Installation
Stable version on the [CRAN](https://cran.rstudio.com/web/packages/aricode/).
```{r install_cran, eval = FALSE}
install.packages("aricode")
```
The development version is available via:
```{r install_github, eval = FALSE}
devtools::install_github("jchiquet/aricode")
```
## Description
Computation of measures for clustering comparison (ARI, AMI, NID and even the $\chi^2$ distance) are usually based on the contingency table. Traditional implementations (e.g., function `adjustedRandIndex` of package **mclust**) are in $\Omega(n + u v)$ where
- $n$ is the size of the vectors the classifications of which are to be compared,
- $u$ and $v$ are the respective number of classes in each vectors.
In **aricode** we propose an implementation, based on radix sort, that is in $\Theta(n)$ in time and space.
Importantly, the complexity does not depend on $u$ and $v$.
Our implementation of the ARI for instance is one or two orders of magnitude faster than some standard implementation in `R`.
## Available measures and functions
The functions included in `aricode` are:
- `(A)RI`: computes the (adjusted) Rand index
- `MARI`: computes the modified adjusted rand index as defined in [Sundqvist et al, 2023](https://doi.org/10.1007/s00180-022-01230-7)
- `NID`: computes the normalized information distance
- `NMI`: computes the normalized mutual information
- `NVI`: computes the the normalized variation information
- `AMI`: computes the adjusted mutual information
- `Chi2`: computes the Chi-square statistics
- `Frobenius` compute the Frobenius norm between two classification as defined in [Arlot et al, 2019](https://jmlr.org/papers/v20/16-155.html)
- `entropy`: computes the conditional and joint entropies
- `compare_clustering`: computes all clustering comparison measures at once
- `sort_pairs`: radix sort for pairs of elements
## Timings
Here are some timings to compare the cost of computing the adjusted Rand Index with **aricode** or with the commonly used function `adjustedRandIndex` of the *mclust* package: the cost of the latter can be prohibitive for large vectors:
```{r timings_function, echo=FALSE, message=FALSE, warning=FALSE}
library(aricode)
library(mclust)
library(ggplot2)
time.aricode <- function(times, c1, c2) {
replicate(times, system.time(ARI(c1, c2))[3])
}
time.mclust <- function(times, c1, c2) {
replicate(times, system.time(mclust::adjustedRandIndex(c1, c2))[3])
}
time.method <- function(times, c1, c2, n) {
rbind(
data.frame(time = time.aricode(times, c1, c2), expr = "aricode", n = n),
data.frame(time = time.mclust(times, c1, c2), expr = "mclust", n = n)
)
}
# with similar classif, number of classes grows with n
sim.timings <- function(n, times = 10) {
c1 <- sample(1:(n / 200), n, replace = TRUE)
c2 <- c1
i_change <- sample(1:n, n / 50, replace = FALSE)
c2[i_change] <- c2[rev(i_change)]
out <- time.method(times, c1, c2, n)
data.frame(time = out$time, method = out$expr, n = n)
}
```
```{r timings_run, echo=FALSE, message=FALSE, warning=FALSE, cache=TRUE}
# with similar classif, number of classes grows with n
ns <- sort(c(200 * 2^(3:14), 150 * 2^(3:15)))
timings <- do.call("rbind", lapply(ns, sim.timings))
```
```{r timings_plot, echo=FALSE, message=FALSE, warning=FALSE}
p.timings <- ggplot(timings, aes(x = n, y = time, colour = method)) +
geom_smooth(data = dplyr::filter(timings, n > 1e4), method = "lm") +
geom_point(size = 0.25, alpha = 0.9) +
labs(y = "time (sec.)") +
scale_x_log10(
breaks = scales::trans_breaks("log10", function(x) 10^x),
labels = scales::trans_format("log10", scales::math_format(10^.x))
) +
scale_y_log10(
breaks = scales::trans_breaks("log10", function(x) 10^x),
labels = scales::trans_format("log10", scales::math_format(10^.x))
) +
annotation_logticks()
p.timings + ggtitle("number of classes grows with n") + theme_bw()
```