Description
Summary:
There are several steps that could be taken to improve memory and speed efficiency of csr_matrix_times_vector
in reverse mode.
Description:
- Introduce typedef
typedef Eigen::Matrix<result_t, -1, 1> vector_t;
- Use zeros to avoid double allocation of
var
. See Eigen advanced initialization for more.
Eigen::Matrix<result_t, Eigen::Dynamic, 1> result(m);
result.setZero();
can be
vector_t result = vector_t::Zero(m);
- Remove the redundant
check_range
:
for (int nze = u[row] - stan::error_index::value; nze < row_end_in_w;
++nze, ++i) {
check_range("csr_matrix_times_vector", "j", n, v[nze]);
because all of v
was previously checked for the same range previously
for (unsigned int i = 0; i < v.size(); ++i)
check_range("csr_matrix_times_vector", "v[]", n, v[i]);
-
We don't need the
stan::
namespace within Stan, so we can replacestan::error_index::value
witherror_index::value
. -
We can then refactor this
Eigen::Matrix<result_t, Eigen::Dynamic, 1> b_sub(idx);
b_sub.setZero();
for (int nze = u[row] - stan::error_index::value; nze < row_end_in_w;
++nze, ++i) {
check_range("csr_matrix_times_vector", "j", n, v[nze]);
b_sub.coeffRef(i) = b.coeffRef(v[nze] - stan::error_index::value);
}
using the typedef in the zero init and removing the redundant check to get
vector_t b_sub = vector_t::Zero(idx);
for (int nze = u[row] - error_index::value; nze < row_end_in_w; ++nze, ++i)
b_sub.coeffRef(i) = b.coeffRef(v[nze] - error_index::value);
Given that it's pulling out a subset of coeffs at this point, there's no way to make this more efficient.
-
Instead of defining
w_sub
, the result ofsegment
should just be passed todot_product
-
Now, pulling a version of this up into
rev
, what you could do is rather than building upw_sub
andb_sub
, directly build up the memory for thevari
for the dot product in place. That is, allocate thevari
and copy thevari
from the coefficient vector in directly and same for matrix -
An even more serious reverse-mode memory optimization would be to store the operands in a single one of the outputs with a chain method that lazily computes all the gradients in the reverse pass; this removes all the memory of the arcs in the expression graph (though it may not improve speed and may even hurt it because the non-local memory penalty in pulling out the coefficients gets paid twice)
Current Version:
v2.17.0