[rocsparse] Fix double-free / use-after-free when copying a csritsv mat_info#7910
Open
kliegeois wants to merge 5 commits into
Open
[rocsparse] Fix double-free / use-after-free when copying a csritsv mat_info#7910kliegeois wants to merge 5 commits into
kliegeois wants to merge 5 commits into
Conversation
_rocsparse_csritsv_info::copy() did a shallow copy of the device pointer ptr_end. When is_submatrix is true that buffer is owned by the info object and freed in its destructor, so the shallow copy left two info objects owning the same allocation, causing a double-free/use-after-free when both were destroyed. Deep-copy the device buffer when the source owns it (is_submatrix), and free any buffer the destination already owns before overwriting it to avoid leaks. The non-owning case (ptr_end aliasing csr_row_ptr + 1) keeps the shallow copy since the destructor does not free it.
Add a regression sub-test in testing_csritsv that copies an analyzed csritsv mat_info with rocsparse_copy_mat_info, destroys the source, then solves with the copy and compares against the reference. This exercises the exact double-free/use-after-free scenario fixed by deep-copying the csritsv ptr_end device buffer, and uses the copy after the source is gone so it can only succeed once the copy owns its own allocation. While writing the test, a second bug in the same copy path surfaced: position_t::copy_position_async created the destination position buffer with the destination's default (invalid) index type instead of the source's, corrupting the copied info so that solving with it returned a wrong result. Use the source index type. The new test deterministically fails without this fix and passes with it.
The memstat allocation entry points take a 'void**', but many call sites pass typed 'T**' pointers (e.g. uint32_t**, rocsparse_int**) that only compile against the templated hipMalloc used in the non-memstat path. This broke the --memstat build across several files (csrmv wg_flags, hyb index buffers, and others). Cast inside the rocsparse_hipMalloc/MallocAsync/HostMalloc/MallocManaged macros so both build configurations share identical call sites. This also makes the csritsv ptr_end double-free deterministically detectable under ROCSPARSE_MEMSTAT=1.
Update the copyright end-year to 2026 on the files touched in this branch and reformat the rocsparse_hipMallocAsync macro to satisfy clang-format-14.
- Document the csritsv mat_info double-free/use-after-free fix and the --memstat build fix in the rocSPARSE changelog. - Synchronize the device before freeing a pre-existing ptr_end buffer in _rocsparse_csritsv_info::copy(), matching the destructor's HIP 7.0 asynchronous-free handling. - Save and restore the handle pointer mode in the csritsv copy regression test so it does not leak state to later code.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
rocsparse_csritsv_analysisallocates device buffers that are owned by therocsparse_csritsv_infoand freed in its destructor:ptr_end— the submatrix row-offset buffer (allocated whenis_submatrix == true), andposition_t/pivot_info_tbase.rocsparse_copy_mat_infodid not copy these correctly, which made a copiedrocsparse_mat_infounsafe to use and unsafe to destroy. This PR fixes the copy semantics, adds a regression test that exercises the exact copy/destroy lifetime, and repairs the--memstatbuild so the bug can be detected deterministically.Changes
rocsparse_csritsv_info.cpp—copy()now deep-copies the ownedptr_enddevice buffer when the source owns it (is_submatrix == true), and frees any buffer the destination already owns before overwriting it. Whenptr_endis merely an alias intocsr_row_ptr + 1(not owned), the shallow copy is kept because the destructor will not free it.rocsparse_position_t.cpp—copy_position_async()now creates the destination position buffer using the source's index type. A freshly created destination has a default/invalid index type; reading it produced a buffer with the wrong element size and corrupted any subsequent solve.rocsparse_memstat.hpp— the tracked allocation macros (rocsparse_hipMalloc/MallocAsync/HostMalloc/MallocManaged) now cast the pointer argument tovoid**. The memstat entry points takevoid**, but many call sites pass typedT**that only compiled against the templatedhipMallocused in the non-memstat path. This fixes a pre-existing--memstatbuild break (e.g.csrmvwg_flags, hyb index buffers) and is what lets memstat observe the double-free.testing_csritsv.cpp— new regression sub-test (runs for general matrices that analyze/solve without a pivot): analyze a source info,rocsparse_copy_mat_infointo a fresh info, destroy the source, solve with the copy and compare to the reference, then destroy the copy.How the test fails without the fix
The regression test reproduces the precise lifetime that triggered the bug:
copy → destroy source → use copy → destroy copy. It surfaces the two defects in two complementary ways.1. Wrong result (fails in any build)
With the position-copy defect, the copied info's pivot/position buffer is created with the destination's invalid default index type instead of the source's. Solving with the copy then reads garbage metadata and the iterative solve does not produce the reference solution —
hy_iterative.near_check(dy, ...)fails (the copy'sdystays effectively zero / wrong). This makes the test fail deterministically in a normal release build, independent of any sanitizer.2. Double-free / use-after-free (deterministic under
--memstat)With the shallow
ptr_endcopy,copy_infoshares the same device allocation assrc_info:rocsparse_destroy_mat_info(src_info)frees that buffer — the copy is now using freed memory (use-after-free) for the solve.rocsparse_destroy_mat_info(copy_info)frees the same pointer a second time (double-free).In a normal release build this is latent: HIP 7.0's asynchronous
hipFreedoes not report the second free, so the process does not crash and the bug hides. Built with--memstatand run withROCSPARSE_MEMSTAT=1, rocSPARSE tracks every allocation by address; the second free looks up an address that was already removed from the tracking map and throws "Cannot remove address from the memstat database." The exception escapes the (noexcept) info destructor and aborts:With the fix in place, the same
--memstatrun is clean:Test plan
--memstatbuild compiles (./install.sh -c --memstat ...).ROCSPARSE_MEMSTAT=1 ./rocsparse-test --gtest_filter='*quick/csritsv*'→9 PASSEDwith the fix.ptr_endcopy → same run aborts (exit 134) via the memstat double-free detection, confirming the regression test catches the bug.