Skip to content

Profiling error with kineto on a ML workload #1030

Open
@SKPsanjeevi

Description

I am profiling a ML workload using torch profiler. The code appears as:

    with profile(activities=[
                    ProfilerActivity.CPU,
                    ProfilerActivity.CUDA],
                record_shapes=True
            ) as prof:
        main_args = parse_main_args()
        main(main_args, DETECTED_SYSTEM)
    prof.export_chrome_trace("torch_trace.json")
    # print(prof.key_averages().table(sort_by="self_cpu_time_total", row_limit=20))
    # print(prof.key_averages().table(sort_by="self_cuda_time_total", row_limit=20))

The code runs fine without the profiler. The code also runs fine to finish with the torch profiler. However when the profiler reaches the export statement, I get the following error:

[mlperf-inference-skps-x86-64-29200:6413 :0:6413] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x55ea1b76a8cc)
==== backtrace (tid:   6413) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000006743c49 libkineto::CuptiCallbackApi::__callback_switchboard()  ???:0
 2 0x00000000067441ba libkineto::callback_switchboard()  CuptiCallbackApi.cpp:0
 3 0x0000000000117456 cuptiEnableAllDomains()  ???:0
 4 0x000000000010f5c4 cuptiGetRecommendedBufferSize()  ???:0
 5 0x000000000010d3a8 cuptiGetRecommendedBufferSize()  ???:0
 6 0x00000000001b295d cudbgApiInit()  ???:0
 7 0x00000000001b393b cudbgApiInit()  ???:0
 8 0x00000000001ae05c cudbgApiInit()  ???:0
 9 0x00000000002d2188 cuStreamWaitEvent()  ???:0
10 0x0000000000027ee8 __cudaRegisterUnifiedTable()  ???:0
11 0x000000000002856d __cudaRegisterUnifiedTable()  ???:0
12 0x0000000000045495 secure_getenv()  ???:0
13 0x0000000000045610 exit()  ???:0
14 0x0000000000029d97 __libc_init_first()  ???:0
15 0x0000000000029e40 __libc_start_main()  ???:0
16 0x000000000024ec65 _start()  ???:0
=================================
/bin/bash: line 1:  6413 Segmentation fault      (core dumped) LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/lib/x86_64-linux-gnu:/work/build/inference/loadgen/build python3.10 -m code.main --benchmarks=dlrm-v2 --scenarios=offline --action="run_harness" 2>&1
      6414 Done                    | tee /work/build/logs/2025.01.21-19.47.54/stdout.txt
make: *** [Makefile:46: run_harness] Error 139

How to resolve this error? The machine is DGX H200x8, Ubuntu 22.04.4 LTS (Jammy Jellyfish)

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions