Profiling error with kineto on a ML workload

I am profiling a ML workload using torch profiler. The code appears as:

        with profile(activities=[
                        ProfilerActivity.CPU,
                        ProfilerActivity.CUDA],
                    record_shapes=True
                ) as prof:
            main_args = parse_main_args()
            main(main_args, DETECTED_SYSTEM)
        prof.export_chrome_trace("torch_trace.json")
        # print(prof.key_averages().table(sort_by="self_cpu_time_total", row_limit=20))
        # print(prof.key_averages().table(sort_by="self_cuda_time_total", row_limit=20))

The code runs fine without the profiler. The code also runs fine to finish with the torch profiler. However when the profiler reaches the export statement, I get the following error:

    [mlperf-inference-skps-x86-64-29200:6413 :0:6413] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x55ea1b76a8cc)
    ==== backtrace (tid:   6413) ====
     0 0x0000000000042520 __sigaction()  ???:0
     1 0x0000000006743c49 libkineto::CuptiCallbackApi::__callback_switchboard()  ???:0
     2 0x00000000067441ba libkineto::callback_switchboard()  CuptiCallbackApi.cpp:0
     3 0x0000000000117456 cuptiEnableAllDomains()  ???:0
     4 0x000000000010f5c4 cuptiGetRecommendedBufferSize()  ???:0
     5 0x000000000010d3a8 cuptiGetRecommendedBufferSize()  ???:0
     6 0x00000000001b295d cudbgApiInit()  ???:0
     7 0x00000000001b393b cudbgApiInit()  ???:0
     8 0x00000000001ae05c cudbgApiInit()  ???:0
     9 0x00000000002d2188 cuStreamWaitEvent()  ???:0
    10 0x0000000000027ee8 __cudaRegisterUnifiedTable()  ???:0
    11 0x000000000002856d __cudaRegisterUnifiedTable()  ???:0
    12 0x0000000000045495 secure_getenv()  ???:0
    13 0x0000000000045610 exit()  ???:0
    14 0x0000000000029d97 __libc_init_first()  ???:0
    15 0x0000000000029e40 __libc_start_main()  ???:0
    16 0x000000000024ec65 _start()  ???:0
    =================================
    /bin/bash: line 1:  6413 Segmentation fault      (core dumped) LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/lib/x86_64-linux-gnu:/work/build/inference/loadgen/build python3.10 -m code.main --benchmarks=dlrm-v2 --scenarios=offline --action="run_harness" 2>&1
          6414 Done                    | tee /work/build/logs/2025.01.21-19.47.54/stdout.txt
    make: *** [Makefile:46: run_harness] Error 139

How to resolve this error? The machine is DGX H200x8, Ubuntu 22.04.4 LTS (Jammy Jellyfish)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Profiling error with kineto on a ML workload #1030

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Profiling error with kineto on a ML workload #1030

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions