Description
Is your feature request related to a problem? Please describe.
Having a stacktrace when an exception is being thrown is extremely useful for debugging. Till now, we could only see some short message like this:
Caused by: ai.rapids.cudf.CudfException: std::bad_alloc: out_of_memory: RMM failure at: .../libcudf/cmake-build/_deps/rmm-src/cpp/include/rmm/mr/device/pool_memory_resource.hpp:262: Maximum pool size exceeded (failed to allocate 1024.000000 B): Not enough room to grow, current/max/try size = 1024.000000 B, 1024.000000 B, 1024.000000 B
Imagine that this exception is being thrown from a large function with multiple recursive call chains, it is very painful to chase down to the line throwing this message.
Describe the solution you'd like
Stacktrace has already been implemented in #596 but it is not generated and attached to the runtime exceptions. If we do so, with a stacktrace, we can immediately identify the source of issue. For example:
Caused by: ai.rapids.cudf.CudfException: std::bad_alloc: out_of_memory: RMM failure at: .../libcudf/cmake-build/_deps/rmm-src/cpp/include/rmm/mr/device/pool_memory_resource.hpp:262: Maximum pool size exceeded (failed to allocate 1024.000000 B): Not enough room to grow, current/max/try size = 1024.000000 B, 1024.000000 B, 1024.000000 B
========== native stack frame ==========
#0: /tmp/cudf6512148284615527050.so : rmm::mr::pool_memory_resource<rmm::mr::device_memory_resource>::try_to_expand(unsigned long, unsigned long, rmm::cuda_stream_view)::{lambda(char const*)#1}::operator()(char const*) const+0x204
#1: /tmp/cudf6512148284615527050.so : rmm::mr::pool_memory_resource<rmm::mr::device_memory_resource>::try_to_expand(unsigned long, unsigned long, rmm::cuda_stream_view)+0x26b
#2: /tmp/cudf6512148284615527050.so : rmm::mr::detail::stream_ordered_memory_resource<rmm::mr::pool_memory_resource<rmm::mr::device_memory_resource>, rmm::mr::detail::coalescing_free_list>::get_block(unsigned long, rmm::mr::detail::stream_ordered_memory_resource<rmm::mr::pool_memory_resource<rmm::mr::device_memory_resource>, rmm::mr::detail::coalescing_free_list>::stream_event_pair)+0x45c
#3: /tmp/cudf6512148284615527050.so : rmm::mr::detail::stream_ordered_memory_resource<rmm::mr::pool_memory_resource<rmm::mr::device_memory_resource>, rmm::mr::detail::coalescing_free_list>::do_allocate(unsigned long, rmm::cuda_stream_view)+0x65
#4: /tmp/cudf6512148284615527050.so : void* cuda::mr::__4::_Resource_vtable_builder::_Alloc_async<rmm::mr::device_memory_resource>(void*, unsigned long, unsigned long, cuda::__4::stream_ref)+0x1b
#5: /tmp/cudf6512148284615527050.so : rmm::mr::logging_resource_adaptor<rmm::mr::device_memory_resource>::do_allocate(unsigned long, rmm::cuda_stream_view)+0x26
#6: /tmp/cudf6512148284615527050.so : +0x26910c7
#7: /tmp/cudf6512148284615527050.so : void* cuda::mr::__4::_Resource_vtable_builder::_Alloc_async<rmm::mr::device_memory_resource>(void*, unsigned long, unsigned long, cuda::__4::stream_ref)+0x1b
#8: /tmp/cudf6512148284615527050.so : Java_ai_rapids_cudf_Rmm_allocInternal+0xbb
#9: [0x718628fe718e]
The stacktrace above is generated from my POC code implemented in cudf: rapidsai/cudf#18512 and is already being used in debugging spark-rapids customer issues. It would be much better to permanently implement this feature here instead of patching it in cudf.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
It is understandable that the production code needs to be clean and optimized. Thus, stacktrace is only attached to the runtime exception classes when we need to. In particular, only attach stacktrace when the compiler flag RMM_ENABLE_STACK_TRACES
is defined.
Metadata
Metadata
Assignees
Type
Projects
Status