Skip to content

The assumed matrix layout for correlation is unintuitive #81

Open
@multimeric

Description

@multimeric

The docs for the correlation methods say:

Let (r, o) be the shape of M:

  • r is the number of random variables;
  • o is the number of observations we have collected for each random variable.

What this implicitly says is that "M should be a matrix with r rows, corresponding to random variables, and o columns, corresponding to observations". We know this because ndarray has an explicit definition for rows and columns, whereby the first axis refers to the rows and the second axis is called the column axis. For example refer to nrows and ncols functions.

However I find this assumption is counter-intuitive. The convention in my experience is to use the "tidy" layout which is that each row corresponds to an observation and each column corresponds to a variable. I refer here to Hadley Wickham's work, and this figure (e.g. here):
image.

Also this is how R works:

> mat
     [,1] [,2]
[1,]    1    5
[2,]    2    6
[3,]    3    7
[4,]    4    8
> nrow(mat)
[1] 4
> ncol(mat)
[1] 2
> cov(mat)
         [,1]     [,2]
[1,] 1.666667 1.666667
[2,] 1.666667 1.666667

Thirdly, in terms of the Rust data science ecosystem, note that polars (as far as I know, the best supported data frame library in Rust) outputs matricies with the same assumptions. If you create a DataFrame with 2 series (which correspond to variables) and 3 rows, and run .to_ndarray(), you will get a (3, 2) ndarray. Then when you call .cov() on it, you will get something that is not the covariance matrix that you are after.

One argument in the defence of the current method is numpy.cov, which makes the same assumption, as it takes:

A 1-D or 2-D array containing multiple variables and observations. Each row of m represents a variable, and each column a single observation of all those variables.

My suggestions is therefore to consider reversing the assumed dimensions for these methods in the next major (breaking) release. I realise that using .t() is not a difficult thing to do, but unfortunately forgetting to do this in your code will result in a valid matrix that may continue into downstream code without the user realising that it is not the correct covariance matrix. This happened to me and I'd like to spare other users from this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions