[FEA] Optionally attach stacktrace to runtime exceptions

**Is your feature request related to a problem? Please describe.**

Having a stacktrace when an exception is being thrown is extremely useful for debugging. Till now, we could only see some short message like this:
```
Caused by: ai.rapids.cudf.CudfException: std::bad_alloc: out_of_memory: RMM failure at: .../libcudf/cmake-build/_deps/rmm-src/cpp/include/rmm/mr/device/pool_memory_resource.hpp:262: Maximum pool size exceeded (failed to allocate 1024.000000 B): Not enough room to grow, current/max/try size = 1024.000000 B, 1024.000000 B, 1024.000000 B
```
Imagine that this exception is being thrown from a large function with multiple recursive call chains, it is very painful to chase down to the line throwing this message.

**Describe the solution you'd like**

Stacktrace has already been implemented in https://github.com/rapidsai/rmm/pull/596 but it is not generated and attached to the runtime exceptions. If we do so, with a stacktrace, we can immediately identify the source of issue. For example:
```
Caused by: ai.rapids.cudf.CudfException: std::bad_alloc: out_of_memory: RMM failure at: .../libcudf/cmake-build/_deps/rmm-src/cpp/include/rmm/mr/device/pool_memory_resource.hpp:262: Maximum pool size exceeded (failed to allocate 1024.000000 B): Not enough room to grow, current/max/try size = 1024.000000 B, 1024.000000 B, 1024.000000 B
        ========== native stack frame ==========
#0: /tmp/cudf6512148284615527050.so : rmm::mr::pool_memory_resource<rmm::mr::device_memory_resource>::try_to_expand(unsigned long, unsigned long, rmm::cuda_stream_view)::{lambda(char const*)#1}::operator()(char const*) const+0x204
#1: /tmp/cudf6512148284615527050.so : rmm::mr::pool_memory_resource<rmm::mr::device_memory_resource>::try_to_expand(unsigned long, unsigned long, rmm::cuda_stream_view)+0x26b
#2: /tmp/cudf6512148284615527050.so : rmm::mr::detail::stream_ordered_memory_resource<rmm::mr::pool_memory_resource<rmm::mr::device_memory_resource>, rmm::mr::detail::coalescing_free_list>::get_block(unsigned long, rmm::mr::detail::stream_ordered_memory_resource<rmm::mr::pool_memory_resource<rmm::mr::device_memory_resource>, rmm::mr::detail::coalescing_free_list>::stream_event_pair)+0x45c
#3: /tmp/cudf6512148284615527050.so : rmm::mr::detail::stream_ordered_memory_resource<rmm::mr::pool_memory_resource<rmm::mr::device_memory_resource>, rmm::mr::detail::coalescing_free_list>::do_allocate(unsigned long, rmm::cuda_stream_view)+0x65
#4: /tmp/cudf6512148284615527050.so : void* cuda::mr::__4::_Resource_vtable_builder::_Alloc_async<rmm::mr::device_memory_resource>(void*, unsigned long, unsigned long, cuda::__4::stream_ref)+0x1b
#5: /tmp/cudf6512148284615527050.so : rmm::mr::logging_resource_adaptor<rmm::mr::device_memory_resource>::do_allocate(unsigned long, rmm::cuda_stream_view)+0x26
#6: /tmp/cudf6512148284615527050.so : +0x26910c7
#7: /tmp/cudf6512148284615527050.so : void* cuda::mr::__4::_Resource_vtable_builder::_Alloc_async<rmm::mr::device_memory_resource>(void*, unsigned long, unsigned long, cuda::__4::stream_ref)+0x1b
#8: /tmp/cudf6512148284615527050.so : Java_ai_rapids_cudf_Rmm_allocInternal+0xbb
#9: [0x718628fe718e]

```

The stacktrace above is generated from my POC code implemented in cudf: https://github.com/rapidsai/cudf/pull/18512 and is already being used in debugging spark-rapids customer issues. It would be much better to permanently implement this feature here instead of patching it in cudf.


**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

**Additional context**

It is understandable that the production code needs to be clean and optimized. Thus, stacktrace is only attached to the runtime exception classes when we need to. In particular, only attach stacktrace when the compiler flag `RMM_ENABLE_STACK_TRACES` is defined.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEA] Optionally attach stacktrace to runtime exceptions #1894

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] Optionally attach stacktrace to runtime exceptions #1894

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions