improve csr_matrix_times_vector efficiency

#### Summary:

There are several steps that could be taken to improve memory and speed efficiency of `csr_matrix_times_vector` in reverse mode.

#### Description:

* Introduce typedef
```
typedef Eigen::Matrix<result_t, -1, 1> vector_t;
```

* Use zeros to avoid double allocation of `var`.  See [Eigen advanced initialization](https://eigen.tuxfamily.org/dox/group__TutorialAdvancedInitialization.html) for more. 
```
Eigen::Matrix<result_t, Eigen::Dynamic, 1> result(m);
result.setZero();
```
can be
```
vector_t result = vector_t::Zero(m);
```

* Remove the redundant `check_range`:
```
for (int nze = u[row] - stan::error_index::value; nze < row_end_in_w;
         ++nze, ++i) {
      check_range("csr_matrix_times_vector", "j", n, v[nze]);
```
because all of `v` was previously checked for the same range previously
```
  for (unsigned int i = 0; i < v.size(); ++i)
    check_range("csr_matrix_times_vector", "v[]", n, v[i]);
```

* We don't need the `stan::` namespace within Stan, so we can replace `stan::error_index::value` with `error_index::value`.  

* We can then refactor this
```
Eigen::Matrix<result_t, Eigen::Dynamic, 1> b_sub(idx);
    b_sub.setZero();
    for (int nze = u[row] - stan::error_index::value; nze < row_end_in_w;
         ++nze, ++i) {
      check_range("csr_matrix_times_vector", "j", n, v[nze]);
      b_sub.coeffRef(i) = b.coeffRef(v[nze] - stan::error_index::value);
    }
```
using the typedef in the zero init and removing the redundant check to get
```
vector_t b_sub = vector_t::Zero(idx);
for (int nze = u[row] - error_index::value; nze < row_end_in_w; ++nze, ++i)
  b_sub.coeffRef(i) = b.coeffRef(v[nze] - error_index::value);
```
Given that it's pulling out a subset of coeffs at this point, there's no way to make this more efficient.

* Instead of defining `w_sub`, the result of `segment` should just be passed to `dot_product`

* Now, pulling a version of this up into `rev`, what you could do is rather than building up `w_sub` and `b_sub`, directly build up the memory for the `vari` for the dot product in place.  That is, allocate the `vari` and copy the `vari` from the coefficient vector in directly and same for matrix

* An even more serious reverse-mode memory optimization would be to store the operands in a single one of the outputs with a chain method that lazily computes all the gradients in the reverse pass;  this removes all the memory of the arcs in the expression graph (though it may not improve speed and may even hurt it because the non-local memory penalty in pulling out the coefficients gets paid twice)

#### Current Version:
v2.17.0


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

improve csr_matrix_times_vector efficiency #757

Summary:

Description:

Current Version:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

improve csr_matrix_times_vector efficiency #757

Description

Summary:

Description:

Current Version:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions