-
Notifications
You must be signed in to change notification settings - Fork 559
Description
🐛 Bug
We are seeing random crash (likely memory corruption) in sentencepiece with torch-xla 2.9 when doing sentencepiece vocab loading:
#import torch
import torch_xla
import sentencepiece as spm
sp_model = spm.SentencePieceProcessor("/home/ubuntu/souseki_sentencepiece.model")
On Ubuntu22 we get:
(test_venv_py310) ubuntu@ip-172-31-1-215:~$ python repro2.py
WARNING:root:Defaulting to PJRT_DEVICE=CPU
terminate called after throwing an instance of 'std::system_error'
what(): Invalid argument
Aborted (core dumped)
In repeated runs, sometime it works without any crash:
(test_venv_py310) ubuntu@ip-172-31-1-215:~$ python repro2.py
WARNING:root:Defaulting to PJRT_DEVICE=CPU
(test_venv_py310) ubuntu@ip-172-31-1-215:~$
If we uncomment import torch, we also see:
WARNING:root:Defaulting to PJRT_DEVICE=CPU
terminate called after throwing an instance of 'std::system_error'
what(): Invalid argument
Aborted (core dumped)
(test_venv_py310) ubuntu@ip-172-31-1-215:~$ python repro2.py
WARNING:root:Defaulting to PJRT_DEVICE=CPU
(test_venv_py310) ubuntu@ip-172-31-1-215:~$ python repro2.py
WARNING:root:Defaulting to PJRT_DEVICE=CPU
(test_venv_py310) ubuntu@ip-172-31-1-215:~$ vi repro2.py
(test_venv_py310) ubuntu@ip-172-31-1-215:~$ python repro2.py
WARNING:root:Defaulting to PJRT_DEVICE=CPU
terminate called after throwing an instance of 'c10::Error'
what(): kernels_.find(DispatchKey::Undefined) == kernels_.end() INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp":278, please report a bug to PyTorch.
Exception raised from hasKernelForAnyDispatchKey at /pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:278 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x70c71717cb80 in /home/ubuntu/test_venv_py310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x69 (0x70c71710f095 in /home/ubuntu/test_venv_py310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::impl::OperatorEntry::hasKernelForAnyDispatchKey(c10::DispatchKeySet) const + 0x6a (0x70c6fc1dd2ba in /home/ubuntu/test_venv_py310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10::impl::OperatorEntry::computeDispatchTableEntryWithDebug(c10::Dispatcher const&, c10::DispatchKey) const + 0x124 (0x70c6fc1e0d04 in /home/ubuntu/test_venv_py310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10::impl::OperatorEntry::computeDispatchTableEntry(c10::Dispatcher const&, c10::DispatchKey) const + 0x9 (0x70c6fc1e0ea9 in /home/ubuntu/test_venv_py310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10::impl::OperatorEntry::updateDispatchTableEntry_(c10::Dispatcher const&, c10::DispatchKey) + 0x38 (0x70c6fc1e0ee8 in /home/ubuntu/test_venv_py310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10::impl::OperatorEntry::updateDispatchTable_(c10::Dispatcher const&, c10::DispatchKey) + 0x95 (0x70c6fc1e1095 in /home/ubuntu/test_venv_py310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10::Dispatcher::deregisterImpl_(c10::OperatorHandle const&, c10::OperatorName const&, std::optional<c10::DispatchKey>, std::_List_iterator<c10::impl::AnnotatedKernel>) + 0x27 (0x70c6fc1d2a17 in /home/ubuntu/test_venv_py310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x19d2b31 (0x70c6fc1d2b31 in /home/ubuntu/test_venv_py310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #9: torch::detail::TorchLibraryInit::~TorchLibraryInit() + 0x38 (0x70c5dc23a698 in /home/ubuntu/test_venv_py310/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so)
frame #10: <unknown function> + 0x45495 (0x70c71ea45495 in /lib/x86_64-linux-gnu/libc.so.6)
frame #11: on_exit + 0 (0x70c71ea45610 in /lib/x86_64-linux-gnu/libc.so.6)
frame #12: <unknown function> + 0x29d97 (0x70c71ea29d97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #13: __libc_start_main + 0x80 (0x70c71ea29e40 in /lib/x86_64-linux-gnu/libc.so.6)
<omitting python frames>
Aborted (core dumped)
You may have to run several times to trigger the crash. It's also strange that it is stable if we either uninstall accelerate or install accelerate, depending on the environment.
(test_venv_py310) ubuntu@ip-172-31-1-215:~$ pip uninstall accelerate
Found existing installation: accelerate 1.11.0
Uninstalling accelerate-1.11.0:
Would remove:
/home/ubuntu/test_venv_py310/bin/accelerate
/home/ubuntu/test_venv_py310/bin/accelerate-config
/home/ubuntu/test_venv_py310/bin/accelerate-estimate-memory
/home/ubuntu/test_venv_py310/bin/accelerate-launch
/home/ubuntu/test_venv_py310/bin/accelerate-merge-weights
/home/ubuntu/test_venv_py310/lib/python3.10/site-packages/accelerate-1.11.0.dist-info/*
/home/ubuntu/test_venv_py310/lib/python3.10/site-packages/accelerate/*
Proceed (Y/n)? y
Successfully uninstalled accelerate-1.11.0
(test_venv_py310) ubuntu@ip-172-31-1-215:~$ python repro2.py
WARNING:root:Defaulting to PJRT_DEVICE=CPU
(test_venv_py310) ubuntu@ip-172-31-1-215:~$ python repro2.py
WARNING:root:Defaulting to PJRT_DEVICE=CPU
(test_venv_py310) ubuntu@ip-172-31-1-215:~$ python repro2.py
WARNING:root:Defaulting to PJRT_DEVICE=CPU
(test_venv_py310) ubuntu@ip-172-31-1-215:~$ python repro2.py
WARNING:root:Defaulting to PJRT_DEVICE=CPU
(test_venv_py310) ubuntu@ip-172-31-1-215:~$ python repro2.py
WARNING:root:Defaulting to PJRT_DEVICE=CPU
(test_venv_py310) ubuntu@ip-172-31-1-215:~$ python repro2.py
WARNING:root:Defaulting to PJRT_DEVICE=CPU
By bisecting torch-xla commits, I narrowed down to the upgrade to August openxla 748ac9b which includes an update to protobuf 6.31.1: openxla/xla@72a784f:
b098be87dde58fe48e5effe72c0bb6b9b4ba5b6e bad 8/22/2025
748ac9b1032cea9499f8062a10607eceb4a84cb7 bad 8/22/2025
6b6ef5c7d757f955565b2083c48d936bfd758dcd good 8/22/2025
b84c83b46615f767e6d94cda959db8178ddd95b5 good 8/21/2025
0f56dec9a33a993d4c14cb755bdd25490cabba21 good 8/19/2025
a1c6ee92c85e8b0955c20892ed68f032a6015c09 good 8/16/2025
Building torch-xla with DEBUG=1 also avoid sentencepiece crash, so that's not a debug option.
Looking at Sentencepiece, which has it's own copy of protobuf, I see that the last update was to version 3.14 5 years ago (google/sentencepiece@152a87f)
Compiling latest Sentencepiece didn't help. I don't know how to update protobuf-lite there.
To Reproduce
Install if you are using python 3.10 env:
pip install https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.9.0rc1-cp310-cp310-linux_x86_64.whl
pip install accelerate torch==2.9 sentencepiece
Download any sentencepiece model, example from HF:
cd ~/
wget https://huggingface.co/ganchengguang/RoBERTa-base-japanese-sentencepiece/resolve/main/souseki_sentencepiece.model
The model argument to SentencePieceProcessor needs to be absolute path. Change it to match your environment.
#import torch
import torch_xla
import sentencepiece as spm
sp_model = spm.SentencePieceProcessor("/home/ubuntu/souseki_sentencepiece.model")
Expected behavior
No crash
Environment
- Reproducible on XLA backend [CPU/TPU]: CPU
- torch_xla version: 2.9