[QNN-EP]: Fix inference failures while running with htp_shared_memory (#23892)

quic-ashigarg · Ashish Garg · web-flow · commit 788ca51b044b · 2025-03-04T23:02:58.000-08:00
### Description
When using the enable_htp_shared_memory feature, we see that the address
of the buffer passed to rpcmem_free is incorrect. So the rpc buffers are
not freed leading to memory exhaustion.

### Motivation and Context
When using the enable_htp_shared_memory_allocator feature for QNN in
GenAI extensions, it leads to inference failures during the second
prompt. As GenAI memory asks are higher, it surfaces sooner in gen AI
use cases.

Co-authored-by: Ashish Garg &lt;ashigarg@qti.qualcomm.com&gt;
diff --git a/onnxruntime/core/providers/qnn/qnn_allocator.cc b/onnxruntime/core/providers/qnn/qnn_allocator.cc
@@ -181,7 +181,9 @@ void HtpSharedMemoryAllocator::Free(void* allocation_address) {
   // Avoid throwing exceptions as this may be running from a destructor.
   try {
     // take ownership of shared memory and free at end of scope
-    auto shared_memory = WrapSharedMemoryWithUniquePtr(allocation_address, rpcmem_lib_->Api());
+    const size_t allocation_offset = AllocationOffsetFromStartOfHeader();
+    void* raw_allocation_address = (void*)((std::byte*)allocation_address - allocation_offset);
+    auto shared_memory = WrapSharedMemoryWithUniquePtr(raw_allocation_address, rpcmem_lib_->Api());
 
     // destroy header
     allocation_header.~AllocationHeader();