Description
Implement one or more forms of the matrix normal distribution. The general distribution is of the form
matrix_normal(matrix[N, P] y | matrix[N, P] mu, cov_matrix[N, N] SigmaRow, cov_matrix[P, P] SigmaCol),
where SigmaRow
and SigmaCol
are positive definite covariance matrices. The covariance matrix SigmaRow
controls the covariance of the rows of y
whereas SigmaRow
controls the covariances of the columns.
There are three design decisions about how SigmaRow
and SigmaCol
are parameterized, which are whether to use
- diagonal or dense matrices,
- covariance or precision (inverse covariance), and
- Cholesky factors or full matrices.
There is also the issue of how to resolve the non-identifiability of scale between the two matrices, because
matrix_normal(y | mu, SigmaRow, SigmaCol) == matrix_normal(y | mu, c * SigmaRow, (1 / c) * SigmaCol).
The log density can be expressed as follows, but this is not how we'd compute it.
log matrix_normal(y | mu, SigmaRow, SigmaCol)
= - 1 / 2 * tr(inv(SigmaCol) * (y - mu)' * inv(SigmaRow) * (y - mu))
- N / 2 * log det(SigmaRow)
- P / 2 * log det(SigmaCol)
- N * P / 2 * log(2 * pi())
If LRow
and LCol
are Cholesky factors of SigmaRow
and SigmaCol
so that, e.g., SigmaRow = LRow * LRow'
, and z ~ normal(0, I)
, then
y = mu + LRow * z * LCol ~ matrix_normal(mu, SigmaRow, SigmaCol)
As shown on the Wikipedia page for matrix normal, we can evaluate that trace term as
tr(inv(SigmaCol) * (y - mu)' * inv(SigmaRow) * (y - mu))
= (vec(y) - vec(mu))' * inv(SigmaCol kronecker SigmaRow) * (vec(y) - vec(mu))
where vec(y)
is the flattening of y
. We then want to use the property of Kronecker products that
inv(SigmaCol kronecker SigmaRow) = inv(SigmaCol) kronecker inv(SigmaRow).
@bgoodri suggested going further with Cholesky factorizations LRow
and LCol
using the rules for inverse of the product of square matrixes inv(A * B) = inv(B) * inv(A)
and the trace of a product rule tr(A' * B) = vec(A)' * vec(B)
, to derive
tr(inv(SigmaCol) * (y - mu)' * inv(SigmaRow) * (y - mu))
= tr(inv(LCol * LCol') * (y - mu)' * inv(LRow * LRow') * (y - mu))
= tr(inv(LCol') * inv(LCol) * (y - mu)' * inv(LRow') * inv(LRow) * (y - mu))
= tr((LCol' \ LCol \ (y - mu)') * (LRow' \ LRow \ (y - mu)))
= dot_product(vec(LCol' \ LCol \ (y - mu')), vec(LRow' \ LRow \ (y - mu)))
Identifiability will be left up to the user. One way to achieve it is to define one of the matrices to be a correlation matrix with a simplex of scales. That's most efficient when done with a simple product of a Cholesky factor rather than a quadratic form for the full correlation matrix. A second, more symmetric way to achieve a proper posterior is to decompose both covariance matrices into scales time Cholesky factors and put priors on both of the scales.
There is also a matrix Student-t distribution, which should just follow whatever matrix normal does.