Skip to content

[rocsparse] Fix double-free / use-after-free when copying a csritsv mat_info#7910

Open
kliegeois wants to merge 5 commits into
ROCm:developfrom
kliegeois:csritsv-doublefree
Open

[rocsparse] Fix double-free / use-after-free when copying a csritsv mat_info#7910
kliegeois wants to merge 5 commits into
ROCm:developfrom
kliegeois:csritsv-doublefree

Conversation

@kliegeois
Copy link
Copy Markdown
Contributor

Summary

rocsparse_csritsv_analysis allocates device buffers that are owned by the rocsparse_csritsv_info and freed in its destructor:

  • ptr_end — the submatrix row-offset buffer (allocated when is_submatrix == true), and
  • the zero-pivot position metadata held by the position_t/pivot_info_t base.

rocsparse_copy_mat_info did not copy these correctly, which made a copied rocsparse_mat_info unsafe to use and unsafe to destroy. This PR fixes the copy semantics, adds a regression test that exercises the exact copy/destroy lifetime, and repairs the --memstat build so the bug can be detected deterministically.

Changes

  • rocsparse_csritsv_info.cppcopy() now deep-copies the owned ptr_end device buffer when the source owns it (is_submatrix == true), and frees any buffer the destination already owns before overwriting it. When ptr_end is merely an alias into csr_row_ptr + 1 (not owned), the shallow copy is kept because the destructor will not free it.
  • rocsparse_position_t.cppcopy_position_async() now creates the destination position buffer using the source's index type. A freshly created destination has a default/invalid index type; reading it produced a buffer with the wrong element size and corrupted any subsequent solve.
  • rocsparse_memstat.hpp — the tracked allocation macros (rocsparse_hipMalloc / MallocAsync / HostMalloc / MallocManaged) now cast the pointer argument to void**. The memstat entry points take void**, but many call sites pass typed T** that only compiled against the templated hipMalloc used in the non-memstat path. This fixes a pre-existing --memstat build break (e.g. csrmv wg_flags, hyb index buffers) and is what lets memstat observe the double-free.
  • testing_csritsv.cpp — new regression sub-test (runs for general matrices that analyze/solve without a pivot): analyze a source info, rocsparse_copy_mat_info into a fresh info, destroy the source, solve with the copy and compare to the reference, then destroy the copy.

How the test fails without the fix

The regression test reproduces the precise lifetime that triggered the bug: copy → destroy source → use copy → destroy copy. It surfaces the two defects in two complementary ways.

1. Wrong result (fails in any build)

With the position-copy defect, the copied info's pivot/position buffer is created with the destination's invalid default index type instead of the source's. Solving with the copy then reads garbage metadata and the iterative solve does not produce the reference solution — hy_iterative.near_check(dy, ...) fails (the copy's dy stays effectively zero / wrong). This makes the test fail deterministically in a normal release build, independent of any sanitizer.

2. Double-free / use-after-free (deterministic under --memstat)

With the shallow ptr_end copy, copy_info shares the same device allocation as src_info:

  • rocsparse_destroy_mat_info(src_info) frees that buffer — the copy is now using freed memory (use-after-free) for the solve.
  • rocsparse_destroy_mat_info(copy_info) frees the same pointer a second time (double-free).

In a normal release build this is latent: HIP 7.0's asynchronous hipFree does not report the second free, so the process does not crash and the bug hides. Built with --memstat and run with ROCSPARSE_MEMSTAT=1, rocSPARSE tracks every allocation by address; the second free looks up an address that was already removed from the tracking map and throws "Cannot remove address from the memstat database." The exception escapes the (noexcept) info destructor and aborts:

[ RUN      ] quick/csritsv.level2/f32_r_1_1_0_NT_ND_L_force_auto_1b_csr_nos4
terminate called after throwing an instance of 'rocsparse_status_'
Aborted (core dumped)        # exit code 134

With the fix in place, the same --memstat run is clean:

[----------] 9 tests from quick/csritsv (... ms total)
[  PASSED  ] 9 tests.

Test plan

  • --memstat build compiles (./install.sh -c --memstat ...).
  • ROCSPARSE_MEMSTAT=1 ./rocsparse-test --gtest_filter='*quick/csritsv*'9 PASSED with the fix.
  • Reintroducing the shallow ptr_end copy → same run aborts (exit 134) via the memstat double-free detection, confirming the regression test catches the bug.
  • Release build: the regression sub-test passes with the fix and fails (wrong result) when the position-copy fix is reverted.

kliegeois added 5 commits May 30, 2026 01:55
_rocsparse_csritsv_info::copy() did a shallow copy of the device pointer
ptr_end. When is_submatrix is true that buffer is owned by the info
object and freed in its destructor, so the shallow copy left two info
objects owning the same allocation, causing a double-free/use-after-free
when both were destroyed.

Deep-copy the device buffer when the source owns it (is_submatrix), and
free any buffer the destination already owns before overwriting it to
avoid leaks. The non-owning case (ptr_end aliasing csr_row_ptr + 1) keeps
the shallow copy since the destructor does not free it.
Add a regression sub-test in testing_csritsv that copies an analyzed
csritsv mat_info with rocsparse_copy_mat_info, destroys the source, then
solves with the copy and compares against the reference. This exercises
the exact double-free/use-after-free scenario fixed by deep-copying the
csritsv ptr_end device buffer, and uses the copy after the source is
gone so it can only succeed once the copy owns its own allocation.

While writing the test, a second bug in the same copy path surfaced:
position_t::copy_position_async created the destination position buffer
with the destination's default (invalid) index type instead of the
source's, corrupting the copied info so that solving with it returned a
wrong result. Use the source index type. The new test deterministically
fails without this fix and passes with it.
The memstat allocation entry points take a 'void**', but many call sites
pass typed 'T**' pointers (e.g. uint32_t**, rocsparse_int**) that only
compile against the templated hipMalloc used in the non-memstat path. This
broke the --memstat build across several files (csrmv wg_flags, hyb index
buffers, and others).

Cast inside the rocsparse_hipMalloc/MallocAsync/HostMalloc/MallocManaged
macros so both build configurations share identical call sites. This also
makes the csritsv ptr_end double-free deterministically detectable under
ROCSPARSE_MEMSTAT=1.
Update the copyright end-year to 2026 on the files touched in this branch
and reformat the rocsparse_hipMallocAsync macro to satisfy clang-format-14.
- Document the csritsv mat_info double-free/use-after-free fix and the
  --memstat build fix in the rocSPARSE changelog.
- Synchronize the device before freeing a pre-existing ptr_end buffer in
  _rocsparse_csritsv_info::copy(), matching the destructor's HIP 7.0
  asynchronous-free handling.
- Save and restore the handle pointer mode in the csritsv copy regression
  test so it does not leak state to later code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant