Skip to content

improve csr_matrix_times_vector efficiency #757

Open
@bob-carpenter

Description

@bob-carpenter

Summary:

There are several steps that could be taken to improve memory and speed efficiency of csr_matrix_times_vector in reverse mode.

Description:

  • Introduce typedef
typedef Eigen::Matrix<result_t, -1, 1> vector_t;
Eigen::Matrix<result_t, Eigen::Dynamic, 1> result(m);
result.setZero();

can be

vector_t result = vector_t::Zero(m);
  • Remove the redundant check_range:
for (int nze = u[row] - stan::error_index::value; nze < row_end_in_w;
         ++nze, ++i) {
      check_range("csr_matrix_times_vector", "j", n, v[nze]);

because all of v was previously checked for the same range previously

  for (unsigned int i = 0; i < v.size(); ++i)
    check_range("csr_matrix_times_vector", "v[]", n, v[i]);
  • We don't need the stan:: namespace within Stan, so we can replace stan::error_index::value with error_index::value.

  • We can then refactor this

Eigen::Matrix<result_t, Eigen::Dynamic, 1> b_sub(idx);
    b_sub.setZero();
    for (int nze = u[row] - stan::error_index::value; nze < row_end_in_w;
         ++nze, ++i) {
      check_range("csr_matrix_times_vector", "j", n, v[nze]);
      b_sub.coeffRef(i) = b.coeffRef(v[nze] - stan::error_index::value);
    }

using the typedef in the zero init and removing the redundant check to get

vector_t b_sub = vector_t::Zero(idx);
for (int nze = u[row] - error_index::value; nze < row_end_in_w; ++nze, ++i)
  b_sub.coeffRef(i) = b.coeffRef(v[nze] - error_index::value);

Given that it's pulling out a subset of coeffs at this point, there's no way to make this more efficient.

  • Instead of defining w_sub, the result of segment should just be passed to dot_product

  • Now, pulling a version of this up into rev, what you could do is rather than building up w_sub and b_sub, directly build up the memory for the vari for the dot product in place. That is, allocate the vari and copy the vari from the coefficient vector in directly and same for matrix

  • An even more serious reverse-mode memory optimization would be to store the operands in a single one of the outputs with a chain method that lazily computes all the gradients in the reverse pass; this removes all the memory of the arcs in the expression graph (though it may not improve speed and may even hurt it because the non-local memory penalty in pulling out the coefficients gets paid twice)

Current Version:

v2.17.0

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions