The assumed matrix layout for correlation is unintuitive

The docs for the [correlation methods](https://docs.rs/ndarray-stats/0.5.0/ndarray_stats/trait.CorrelationExt.html#tymethod.cov) say:

> Let (r, o) be the shape of M:
> * r is the number of random variables;
> * o is the number of observations we have collected for each random variable.

What this implicitly says is that "M should be a matrix with r rows, corresponding to random variables, and o columns, corresponding to observations". We know this because `ndarray` has an explicit definition for rows and columns, whereby the first axis refers to the rows and the second axis is called the column axis. For example refer to [`nrows`](https://docs.rs/ndarray/0.15.3/ndarray/struct.ArrayBase.html#method.nrows) and [`ncols`](https://docs.rs/ndarray/0.15.3/ndarray/struct.ArrayBase.html#method.ncols) functions.

However I find this assumption is counter-intuitive. The convention in my experience is to use the "tidy" layout which is that each row corresponds to an observation and each column corresponds to a variable. I refer here to Hadley Wickham's work, and this figure ([e.g. here](https://r4ds.had.co.nz/tidy-data.html#fig:tidy-structure)):
![image](https://user-images.githubusercontent.com/5019367/127249146-8e1c09d1-7865-48bb-8772-b780c45e6d1e.png).

Also this is how R works:
```R
> mat
     [,1] [,2]
[1,]    1    5
[2,]    2    6
[3,]    3    7
[4,]    4    8
> nrow(mat)
[1] 4
> ncol(mat)
[1] 2
> cov(mat)
         [,1]     [,2]
[1,] 1.666667 1.666667
[2,] 1.666667 1.666667
```

Thirdly, in terms of the Rust data science ecosystem, note that `polars` (as far as I know, the best supported data frame library in Rust) outputs matricies with the same assumptions. If you create a DataFrame with 2 series (which correspond to variables) and 3 rows, and run [`.to_ndarray()`](https://docs.rs/polars/0.14.7/polars/prelude/struct.DataFrame.html#method.to_ndarray), you will get a (3, 2) `ndarray`. Then when you call `.cov()` on it, you will get something that is not the covariance matrix that you are after.

One argument in the defence of the current method is [`numpy.cov`](https://numpy.org/doc/stable/reference/generated/numpy.cov.html#numpy.cov), which makes the same assumption, as it takes:
> A 1-D or 2-D array containing multiple variables and observations. Each row of m represents a variable, and each column a single observation of all those variables.

My suggestions is therefore to consider reversing the assumed dimensions for these methods in the next major (breaking) release. I realise that using `.t()` is not a difficult thing to do, but unfortunately forgetting to do this in your code will result in a valid matrix that may continue into downstream code without the user realising that it is not the correct covariance matrix. This happened to me and I'd like to spare other users from this issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The assumed matrix layout for correlation is unintuitive #81

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The assumed matrix layout for correlation is unintuitive #81

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions