Description
The docs for the correlation methods say:
Let (r, o) be the shape of M:
- r is the number of random variables;
- o is the number of observations we have collected for each random variable.
What this implicitly says is that "M should be a matrix with r rows, corresponding to random variables, and o columns, corresponding to observations". We know this because ndarray
has an explicit definition for rows and columns, whereby the first axis refers to the rows and the second axis is called the column axis. For example refer to nrows
and ncols
functions.
However I find this assumption is counter-intuitive. The convention in my experience is to use the "tidy" layout which is that each row corresponds to an observation and each column corresponds to a variable. I refer here to Hadley Wickham's work, and this figure (e.g. here):
.
Also this is how R works:
> mat
[,1] [,2]
[1,] 1 5
[2,] 2 6
[3,] 3 7
[4,] 4 8
> nrow(mat)
[1] 4
> ncol(mat)
[1] 2
> cov(mat)
[,1] [,2]
[1,] 1.666667 1.666667
[2,] 1.666667 1.666667
Thirdly, in terms of the Rust data science ecosystem, note that polars
(as far as I know, the best supported data frame library in Rust) outputs matricies with the same assumptions. If you create a DataFrame with 2 series (which correspond to variables) and 3 rows, and run .to_ndarray()
, you will get a (3, 2) ndarray
. Then when you call .cov()
on it, you will get something that is not the covariance matrix that you are after.
One argument in the defence of the current method is numpy.cov
, which makes the same assumption, as it takes:
A 1-D or 2-D array containing multiple variables and observations. Each row of m represents a variable, and each column a single observation of all those variables.
My suggestions is therefore to consider reversing the assumed dimensions for these methods in the next major (breaking) release. I realise that using .t()
is not a difficult thing to do, but unfortunately forgetting to do this in your code will result in a valid matrix that may continue into downstream code without the user realising that it is not the correct covariance matrix. This happened to me and I'd like to spare other users from this issue.