Skip to content

[FEA] Optionally attach stacktrace to runtime exceptions #1894

Open
@ttnghia

Description

@ttnghia

Is your feature request related to a problem? Please describe.

Having a stacktrace when an exception is being thrown is extremely useful for debugging. Till now, we could only see some short message like this:

Caused by: ai.rapids.cudf.CudfException: std::bad_alloc: out_of_memory: RMM failure at: .../libcudf/cmake-build/_deps/rmm-src/cpp/include/rmm/mr/device/pool_memory_resource.hpp:262: Maximum pool size exceeded (failed to allocate 1024.000000 B): Not enough room to grow, current/max/try size = 1024.000000 B, 1024.000000 B, 1024.000000 B

Imagine that this exception is being thrown from a large function with multiple recursive call chains, it is very painful to chase down to the line throwing this message.

Describe the solution you'd like

Stacktrace has already been implemented in #596 but it is not generated and attached to the runtime exceptions. If we do so, with a stacktrace, we can immediately identify the source of issue. For example:

Caused by: ai.rapids.cudf.CudfException: std::bad_alloc: out_of_memory: RMM failure at: .../libcudf/cmake-build/_deps/rmm-src/cpp/include/rmm/mr/device/pool_memory_resource.hpp:262: Maximum pool size exceeded (failed to allocate 1024.000000 B): Not enough room to grow, current/max/try size = 1024.000000 B, 1024.000000 B, 1024.000000 B
        ========== native stack frame ==========
#0: /tmp/cudf6512148284615527050.so : rmm::mr::pool_memory_resource<rmm::mr::device_memory_resource>::try_to_expand(unsigned long, unsigned long, rmm::cuda_stream_view)::{lambda(char const*)#1}::operator()(char const*) const+0x204
#1: /tmp/cudf6512148284615527050.so : rmm::mr::pool_memory_resource<rmm::mr::device_memory_resource>::try_to_expand(unsigned long, unsigned long, rmm::cuda_stream_view)+0x26b
#2: /tmp/cudf6512148284615527050.so : rmm::mr::detail::stream_ordered_memory_resource<rmm::mr::pool_memory_resource<rmm::mr::device_memory_resource>, rmm::mr::detail::coalescing_free_list>::get_block(unsigned long, rmm::mr::detail::stream_ordered_memory_resource<rmm::mr::pool_memory_resource<rmm::mr::device_memory_resource>, rmm::mr::detail::coalescing_free_list>::stream_event_pair)+0x45c
#3: /tmp/cudf6512148284615527050.so : rmm::mr::detail::stream_ordered_memory_resource<rmm::mr::pool_memory_resource<rmm::mr::device_memory_resource>, rmm::mr::detail::coalescing_free_list>::do_allocate(unsigned long, rmm::cuda_stream_view)+0x65
#4: /tmp/cudf6512148284615527050.so : void* cuda::mr::__4::_Resource_vtable_builder::_Alloc_async<rmm::mr::device_memory_resource>(void*, unsigned long, unsigned long, cuda::__4::stream_ref)+0x1b
#5: /tmp/cudf6512148284615527050.so : rmm::mr::logging_resource_adaptor<rmm::mr::device_memory_resource>::do_allocate(unsigned long, rmm::cuda_stream_view)+0x26
#6: /tmp/cudf6512148284615527050.so : +0x26910c7
#7: /tmp/cudf6512148284615527050.so : void* cuda::mr::__4::_Resource_vtable_builder::_Alloc_async<rmm::mr::device_memory_resource>(void*, unsigned long, unsigned long, cuda::__4::stream_ref)+0x1b
#8: /tmp/cudf6512148284615527050.so : Java_ai_rapids_cudf_Rmm_allocInternal+0xbb
#9: [0x718628fe718e]

The stacktrace above is generated from my POC code implemented in cudf: rapidsai/cudf#18512 and is already being used in debugging spark-rapids customer issues. It would be much better to permanently implement this feature here instead of patching it in cudf.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context

It is understandable that the production code needs to be clean and optimized. Thus, stacktrace is only attached to the runtime exception classes when we need to. In particular, only attach stacktrace when the compiler flag RMM_ENABLE_STACK_TRACES is defined.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    To-do

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions