Description
We have previously used torch_xla == 2.4 to run our experiments on v5e tpu nodes.
While upgrading to v6e, we encountered some issues. Firstly, when running the same code, we received this error when calling xmp.spawn(_mp_fn, args=(),start_method='fork')
File "/usr/lib/python3.10/concurrent/futures/process.py", line 611, in init
raise ValueError("max_workers must be greater than 0")
ValueError: max_workers must be greater than 0
Understanding that v6e are newer, we upgraded to torch_xla==2.6.0 using this install:
pip install "torch_xla[tpu] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.6.0.dev20241201-cp310-cp310-linux_x86_64.whl" -f https://storage.googleapis.com/libtpu-releases/index.html -f https://storage.googleapis.com/libtpu-wheels/index.html
pip3 install torch==2.6.0.dev20241201+cpu torchvision==0.20.0.dev20241201+cpu --index-url https://download.pytorch.org/whl/nightly/cpu
After modifying to adhere to new torch ie. using xla.launch, none of the intermediate print statements were executing during the training run using xm.add_step_closure.
We assume this is because it's not hitting a barrier for some reason, so we attempted to force that using xm.optimizer_step(optimizer,barrier=True) instead of xm.optimizer_step(optimizer,barrier=False), which leads to the following errors:
F0120 12:29:19.154320 34242 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
*** Check failure stack trace: ***
@ 0x7c16a78a55c4 (unknown)
@ 0x7c16a78a50f8 (unknown)
@ 0x7c16a7ada9a9 (unknown)
@ 0x7c16a1b9a092 (unknown)
@ 0x7c16a1b9d225 (unknown)
@ 0x7c16a1b9bf84 (unknown)
@ 0x7c16a1b9fda0 (unknown)
@ 0x7c16a1ba04fa (unknown)
@ 0x7c169fbd07de (unknown)
@ 0x7c169fbd35cc (unknown)
@ 0x7c169fbd6d57 (unknown)
@ 0x7c16a7469733 (unknown)
@ 0x7c16a746f9f6 (unknown)
@ 0x7c16a74784c5 (unknown)
@ 0x7c16a7723583 (unknown)
@ 0x7c18a0094ac3 (unknown)
https://symbolize.stripped_domain/r/?trace=7c16a78a55c4,7c16a78a50f7,7c16a7ada9a8,7c16a1b9a091,7c16a1b9d224,7c16a1b9bf83,7c16a1b9fd9f,7c16a1ba04f9,7c169fbd07dd,7c169fbd35cb,7c169fbd6d56,7c16a7469732,7c16a746f9f5,7c16a74784c4,7c16a7723582,7c18a0094ac2&map=
https://symbolize.stripped_domain/r/?trace=7c18a00969fc,7c18a004251f&map=
*** SIGABRT received by PID 28602 (TID 34242) on cpu 20 from PID 28602; ***
E0120 12:29:19.315807 34242 coredump_hook.cc:301] RAW: Remote crash data gathering hook invoked.
E0120 12:29:19.315822 34242 coredump_hook.cc:340] RAW: Skipping coredump since rlimit was 0 at process start.
E0120 12:29:19.315829 34242 client.cc:269] RAW: Coroner client retries enabled, will retry for up to 30 sec.
E0120 12:29:19.315832 34242 coredump_hook.cc:396] RAW: Sending fingerprint to remote end.
E0120 12:29:19.315861 34242 coredump_hook.cc:405] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0120 12:29:19.315865 34242 coredump_hook.cc:457] RAW: Dumping core locally.
F0120 12:29:19.830939 34285 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
*** Check failure stack trace: ***
@ 0x7d32016a55c4 (unknown)
@ 0x7d32016a50f8 (unknown)
@ 0x7d32018da9a9 (unknown)
@ 0x7d31fb99a092 (unknown)
@ 0x7d31fb99d225 (unknown)
@ 0x7d31fb99bf84 (unknown)
@ 0x7d31fb99fda0 (unknown)
@ 0x7d31fb9a04fa (unknown)
@ 0x7d31f99d07de (unknown)
@ 0x7d31f99d35cc (unknown)
@ 0x7d31f99d6d57 (unknown)
@ 0x7d3201269733 (unknown)
@ 0x7d320126f9f6 (unknown)
@ 0x7d32012784c5 (unknown)
@ 0x7d3201523583 (unknown)
@ 0x7d33fa094ac3 (unknown)
https://symbolize.stripped_domain/r/?trace=7d32016a55c4,7d32016a50f7,7d32018da9a8,7d31fb99a091,7d31fb99d224,7d31fb99bf83,7d31fb99fd9f,7d31fb9a04f9,7d31f99d07dd,7d31f99d35cb,7d31f99d6d56,7d3201269732,7d320126f9f5,7d32012784c4,7d3201523582,7d33fa094ac2&map=
https://symbolize.stripped_domain/r/?trace=7d33fa0969fc,7d33fa04251f&map=
*** SIGABRT received by PID 28601 (TID 34285) on cpu 134 from PID 28601; ***
E0120 12:29:19.847869 34285 coredump_hook.cc:301] RAW: Remote crash data gathering hook invoked.
E0120 12:29:19.847887 34285 coredump_hook.cc:340] RAW: Skipping coredump since rlimit was 0 at process start.
E0120 12:29:19.847894 34285 client.cc:269] RAW: Coroner client retries enabled, will retry for up to 30 sec.
E0120 12:29:19.847898 34285 coredump_hook.cc:396] RAW: Sending fingerprint to remote end.
E0120 12:29:19.847931 34285 coredump_hook.cc:405] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0120 12:29:19.847935 34285 coredump_hook.cc:457] RAW: Dumping core locally.
F0120 12:29:23.305308 34355 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
*** Check failure stack trace: ***
@ 0x70cb5d4a55c4 (unknown)
@ 0x70cb5d4a50f8 (unknown)
@ 0x70cb5d6da9a9 (unknown)
@ 0x70cb5779a092 (unknown)
@ 0x70cb5779d225 (unknown)
@ 0x70cb5779bf84 (unknown)
@ 0x70cb5779fda0 (unknown)
@ 0x70cb577a04fa (unknown)
@ 0x70cb557d07de (unknown)
@ 0x70cb557d35cc (unknown)
@ 0x70cb557d6d57 (unknown)
@ 0x70cb5d069733 (unknown)
@ 0x70cb5d06f9f6 (unknown)
@ 0x70cb5d0784c5 (unknown)
@ 0x70cb5d323583 (unknown)
@ 0x70cd55e94ac3 (unknown)
https://symbolize.stripped_domain/r/?trace=70cb5d4a55c4,70cb5d4a50f7,70cb5d6da9a8,70cb5779a091,70cb5779d224,70cb5779bf83,70cb5779fd9f,70cb577a04f9,70cb557d07dd,70cb557d35cb,70cb557d6d56,70cb5d069732,70cb5d06f9f5,70cb5d0784c4,70cb5d323582,70cd55e94ac2&map=
https://symbolize.stripped_domain/r/?trace=70cd55e969fc,70cd55e4251f&map=
*** SIGABRT received by PID 28603 (TID 34355) on cpu 65 from PID 28603; ***
E0120 12:29:23.405821 34355 coredump_hook.cc:301] RAW: Remote crash data gathering hook invoked.
E0120 12:29:23.405837 34355 coredump_hook.cc:340] RAW: Skipping coredump since rlimit was 0 at process start.
E0120 12:29:23.405845 34355 client.cc:269] RAW: Coroner client retries enabled, will retry for up to 30 sec.
E0120 12:29:23.405848 34355 coredump_hook.cc:396] RAW: Sending fingerprint to remote end.
E0120 12:29:23.405884 34355 coredump_hook.cc:405] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0120 12:29:23.405888 34355 coredump_hook.cc:457] RAW: Dumping core locally.
F0120 12:29:19.154320 34242 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
E0120 12:29:24.483387 34242 process_state.cc:806] RAW: Raising signal 6 with default behavior
F0120 12:29:24.800956 34326 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
*** Check failure stack trace: ***
@ 0x75e6f8ea55c4 (unknown)
@ 0x75e6f8ea50f8 (unknown)
@ 0x75e6f90da9a9 (unknown)
@ 0x75e6f319a092 (unknown)
@ 0x75e6f319d225 (unknown)
@ 0x75e6f319bf84 (unknown)
@ 0x75e6f319fda0 (unknown)
@ 0x75e6f31a04fa (unknown)
@ 0x75e6f11d07de (unknown)
@ 0x75e6f11d35cc (unknown)
@ 0x75e6f11d6d57 (unknown)
@ 0x75e6f8a69733 (unknown)
@ 0x75e6f8a6f9f6 (unknown)
@ 0x75e6f8a784c5 (unknown)
@ 0x75e6f8d23583 (unknown)
@ 0x75e8f1694ac3 (unknown)
https://symbolize.stripped_domain/r/?trace=75e6f8ea55c4,75e6f8ea50f7,75e6f90da9a8,75e6f319a091,75e6f319d224,75e6f319bf83,75e6f319fd9f,75e6f31a04f9,75e6f11d07dd,75e6f11d35cb,75e6f11d6d56,75e6f8a69732,75e6f8a6f9f5,75e6f8a784c4,75e6f8d23582,75e8f1694ac2&map=
https://symbolize.stripped_domain/r/?trace=75e8f16969fc,75e8f164251f&map=
*** SIGABRT received by PID 28598 (TID 34326) on cpu 53 from PID 28598; ***
E0120 12:29:24.820820 34326 coredump_hook.cc:301] RAW: Remote crash data gathering hook invoked.
E0120 12:29:24.820834 34326 coredump_hook.cc:340] RAW: Skipping coredump since rlimit was 0 at process start.
E0120 12:29:24.820839 34326 client.cc:269] RAW: Coroner client retries enabled, will retry for up to 30 sec.
E0120 12:29:24.820842 34326 coredump_hook.cc:396] RAW: Sending fingerprint to remote end.
E0120 12:29:24.825330 34326 coredump_hook.cc:405] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0120 12:29:24.825456 34326 coredump_hook.cc:457] RAW: Dumping core locally.
F0120 12:29:19.830939 34285 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
E0120 12:29:26.228373 34285 process_state.cc:806] RAW: Raising signal 6 with default behavior
F0120 12:29:24.800956 34326 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
E0120 12:29:27.231930 34326 process_state.cc:806] RAW: Raising signal 6 with default behavior
F0120 12:29:23.305308 34355 fusion_emitter.cc:6694] Check failed: fusion_util::IsFusibleUnalignedDUS(user, target_)
E0120 12:29:27.291814 34355 process_state.cc:806] RAW: Raising signal 6 with default behavior
The repo is quite large, and I can't reproduce this using a minimal example. Do you have any advice on how to troubleshoot this/have you seen this before?
Is it possible to use version 2.4 on v6e compute nodes?
Thank you very much for the help.