-
Notifications
You must be signed in to change notification settings - Fork 347
Description
Component / Area
runtime
Issue Type (optional)
Runtime Crash
Observed
When running Inference Server with 32 processes on a single Galaxy device, a segmentation fault is observed. Although the server may still start because to the restart mechanism on inference server side, these failures increase the server warm-up time and should be resolved
Expected
No Segmentation Fault
1. Steps (exact commands)
The issue was observed on Inference server when running 32 workers on single Galaxy
2. Input data / link or description
CI runs:
(BGE-Large-En) https://github.com/tenstorrent/tt-shield/actions/runs/21837489932/job/63017897729
(SDXL) https://github.com/tenstorrent/tt-shield/actions/runs/20646315010/job/59287068847
3. Frequency
Occasionally
1. Software Versions
/
2. Hardware Details
Wormhole Galaxy
Is this a regression?
Unknown
Regression Details
No response
Logs & Diagnostics
[22130886e376:00293] *** Process received signal ***
[22130886e376:00293] Signal: Segmentation fault (11)
[22130886e376:00293] Signal code: Address not mapped (1)
[22130886e376:00293] Failing at address: 0x1c0
[22130886e376:00293] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7c8a8a40d520]
2026-01-01 23:42:28.214 | info | BuildKernels | Skipping deleting built cache (build.cpp:111)
2026-01-01 23:42:28,216 - INFO - Device 13: Found 1 available TTNN devices: [0]
2026-01-01 23:42:28.216 | info | Distributed | Using custom mesh graph descriptor: /home/container_app_user/tt-metal/tt_metal/fabric/mesh_graph_descriptors/n150_mesh_graph_descriptor.textproto (metal_context.cpp:830)
2026-01-01 23:42:28,217 - INFO - Device 5: Found 1 available TTNN devices: [0]
2026-01-01 23:42:28.217 | info | Fabric | TopologyMapper mapping start (mesh=0): n_log=1, n_phys=1, log_deg_hist={0:1}, phys_deg_hist={0:1} (topology_mapper_utils.cpp:171)
[22130886e376:00293] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x1a684d)[0x7c8a8a57184d]
[22130886e376:00293] [ 2] 2026-01-01 23:42:28.217 | info | Distributed | Using custom mesh graph descriptor: /home/container_app_user/tt-metal/tt_metal/fabric/mesh_graph_descriptors/n150_mesh_graph_descriptor.textproto (metal_context.cpp:830)
/lib/x86_64-linux-gnu/libstdc++.so.6(+0x178cb4)[0x7c897a674cb4]
[22130886e376:00293] [ 3] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa6616)[0x7c897a5a2616]
[22130886e376:00293] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(_ZNSt10filesystem10remove_allERKNS_7__cxx114pathE+0xc2)[0x7c897a679da2]
[22130886e376:00293] [ 5] /home/container_app_user/tt-metal/build/lib/libtt_metal.so(_ZN2tt8tt_metal9inspector6LoggerC2ERKNSt10filesystem7__cxx114pathE+0x229)[0x7c885eade719]
[22130886e376:00293] [ 6] /home/container_app_user/tt-metal/build/lib/libtt_metal.so(_ZN2tt8tt_metal9inspector4DataC2Ev+0x33)[0x7c885ead4b63]
[22130886e376:00293] [ 7] /home/container_app_user/tt-metal/build/lib/libtt_metal.so(_ZN2tt8tt_metal9Inspector10initializeEv+0x3b)[0x7c885ead90cb]
[22130886e376:00293] [ 8] /home/container_app_user/tt-metal/build/lib/libtt_metal.so(_ZN2tt8tt_metal12MetalContext10initializeERKNS0_18DispatchCoreConfigEhRKSt6vectorIjSaIjEEmb+0x3e9)[0x7c885e8c5799]
[22130886e376:00293] [ 9] /home/container_app_user/tt-metal/build/lib/libtt_metal.so(_ZN2tt8tt_metal12MetalContext25initialize_device_managerERKSt6vectorIiSaIiEEhmmRKNS0_18DispatchCoreConfigESt4spanIKjLm18446744073709551615EEmbb+0xb3)[0x7c885e8c5313]
2026-01-01 23:42:28.218 | info | Fabric | TopologyMapper mapping start (mesh=0): n_log=1, n_phys=1, log_deg_hist={0:1}, phys_deg_hist={0:1} (topology_mapper_utils.cpp:171)
[22130886e376:00293] [10] /home/container_app_user/tt-metal/build/lib/libtt_metal.so(_ZN2tt8tt_metal6detail13CreateDevicesERKSt6vectorIiSaIiEEhmmRKNS0_18DispatchCoreConfigERKS2_IjSaIjEEmbbb+0x112)[0x7c885edf4732]
[22130886e376:00293] [11] /home/container_app_user/tt-metal/build/lib/libtt_metal.so(_ZN2tt8tt_metal11distributed10MeshDevice13ScopedDevicesC1ERKSt6vectorINS1_11MaybeRemoteIiEESaIS6_EESA_mmmmRKNS0_18DispatchCoreConfigE+0xaf)[0x7c885ebeddcf]
[22130886e376:00293] [12] /home/container_app_user/tt-metal/build/lib/libtt_metal.so(_ZN2tt8tt_metal11distributed10MeshDevice6createERKNS1_16MeshDeviceConfigEmmmRKNS0_18DispatchCoreConfigESt4spanIKjLm18446744073709551615EEm+0x848)[0x7c885ebef628]
[22130886e376:00293] [13] /home/container_app_user/tt-metal/build/lib/_ttnncpp.so(_ZN4ttnn11distributed16open_mesh_deviceEmmmRKN2tt8tt_metal18DispatchCoreConfigERKSt8optionalINS2_11distributed9MeshShapeEERKS6_INS7_14MeshCoordinateEERKSt6vectorIiSaIiEEm+0x75)[0x7c88197baea5]
[22130886e376:00293] [14] /home/container_app_user/tt-metal/ttnn/ttnn/_ttnn.so(+0x2ca29c)[0x7c8a10e3629c]
[22130886e376:00293] [15] /home/container_app_user/tt-metal/ttnn/ttnn/_ttnn.so(+0x28a778)[0x7c8a10df6778]
Priority
None
Impact
This issue was observed on multiple models, including SDXL and BGE-Large which are P0 for UF releases.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status