Skip to content

[torch-xla 2.9RC1] Random crash in sentencepiece with torch-xla 2.9 when doing vocab loading #9691

@jeffhataws

Description

@jeffhataws

🐛 Bug

We are seeing random crash (likely memory corruption) in sentencepiece with torch-xla 2.9 when doing sentencepiece vocab loading:

#import torch
import torch_xla
import sentencepiece as spm
sp_model = spm.SentencePieceProcessor("/home/ubuntu/souseki_sentencepiece.model")

On Ubuntu22 we get:

(test_venv_py310) ubuntu@ip-172-31-1-215:~$ python repro2.py
WARNING:root:Defaulting to PJRT_DEVICE=CPU
terminate called after throwing an instance of 'std::system_error'
  what():  Invalid argument
Aborted (core dumped)

In repeated runs, sometime it works without any crash:

(test_venv_py310) ubuntu@ip-172-31-1-215:~$ python repro2.py
WARNING:root:Defaulting to PJRT_DEVICE=CPU
(test_venv_py310) ubuntu@ip-172-31-1-215:~$ 

If we uncomment import torch, we also see:

WARNING:root:Defaulting to PJRT_DEVICE=CPU
terminate called after throwing an instance of 'std::system_error'
  what():  Invalid argument
Aborted (core dumped)
(test_venv_py310) ubuntu@ip-172-31-1-215:~$ python repro2.py
WARNING:root:Defaulting to PJRT_DEVICE=CPU
(test_venv_py310) ubuntu@ip-172-31-1-215:~$ python repro2.py
WARNING:root:Defaulting to PJRT_DEVICE=CPU
(test_venv_py310) ubuntu@ip-172-31-1-215:~$ vi repro2.py 
(test_venv_py310) ubuntu@ip-172-31-1-215:~$ python repro2.py
WARNING:root:Defaulting to PJRT_DEVICE=CPU
terminate called after throwing an instance of 'c10::Error'
  what():  kernels_.find(DispatchKey::Undefined) == kernels_.end() INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp":278, please report a bug to PyTorch. 
Exception raised from hasKernelForAnyDispatchKey at /pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:278 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x70c71717cb80 in /home/ubuntu/test_venv_py310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x69 (0x70c71710f095 in /home/ubuntu/test_venv_py310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::impl::OperatorEntry::hasKernelForAnyDispatchKey(c10::DispatchKeySet) const + 0x6a (0x70c6fc1dd2ba in /home/ubuntu/test_venv_py310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10::impl::OperatorEntry::computeDispatchTableEntryWithDebug(c10::Dispatcher const&, c10::DispatchKey) const + 0x124 (0x70c6fc1e0d04 in /home/ubuntu/test_venv_py310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10::impl::OperatorEntry::computeDispatchTableEntry(c10::Dispatcher const&, c10::DispatchKey) const + 0x9 (0x70c6fc1e0ea9 in /home/ubuntu/test_venv_py310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10::impl::OperatorEntry::updateDispatchTableEntry_(c10::Dispatcher const&, c10::DispatchKey) + 0x38 (0x70c6fc1e0ee8 in /home/ubuntu/test_venv_py310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10::impl::OperatorEntry::updateDispatchTable_(c10::Dispatcher const&, c10::DispatchKey) + 0x95 (0x70c6fc1e1095 in /home/ubuntu/test_venv_py310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10::Dispatcher::deregisterImpl_(c10::OperatorHandle const&, c10::OperatorName const&, std::optional<c10::DispatchKey>, std::_List_iterator<c10::impl::AnnotatedKernel>) + 0x27 (0x70c6fc1d2a17 in /home/ubuntu/test_venv_py310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x19d2b31 (0x70c6fc1d2b31 in /home/ubuntu/test_venv_py310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #9: torch::detail::TorchLibraryInit::~TorchLibraryInit() + 0x38 (0x70c5dc23a698 in /home/ubuntu/test_venv_py310/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so)
frame #10: <unknown function> + 0x45495 (0x70c71ea45495 in /lib/x86_64-linux-gnu/libc.so.6)
frame #11: on_exit + 0 (0x70c71ea45610 in /lib/x86_64-linux-gnu/libc.so.6)
frame #12: <unknown function> + 0x29d97 (0x70c71ea29d97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #13: __libc_start_main + 0x80 (0x70c71ea29e40 in /lib/x86_64-linux-gnu/libc.so.6)
<omitting python frames>

Aborted (core dumped)

You may have to run several times to trigger the crash. It's also strange that it is stable if we either uninstall accelerate or install accelerate, depending on the environment.

(test_venv_py310) ubuntu@ip-172-31-1-215:~$ pip uninstall accelerate
Found existing installation: accelerate 1.11.0
Uninstalling accelerate-1.11.0:
  Would remove:
    /home/ubuntu/test_venv_py310/bin/accelerate
    /home/ubuntu/test_venv_py310/bin/accelerate-config
    /home/ubuntu/test_venv_py310/bin/accelerate-estimate-memory
    /home/ubuntu/test_venv_py310/bin/accelerate-launch
    /home/ubuntu/test_venv_py310/bin/accelerate-merge-weights
    /home/ubuntu/test_venv_py310/lib/python3.10/site-packages/accelerate-1.11.0.dist-info/*
    /home/ubuntu/test_venv_py310/lib/python3.10/site-packages/accelerate/*
Proceed (Y/n)? y
  Successfully uninstalled accelerate-1.11.0
(test_venv_py310) ubuntu@ip-172-31-1-215:~$ python repro2.py
WARNING:root:Defaulting to PJRT_DEVICE=CPU
(test_venv_py310) ubuntu@ip-172-31-1-215:~$ python repro2.py
WARNING:root:Defaulting to PJRT_DEVICE=CPU
(test_venv_py310) ubuntu@ip-172-31-1-215:~$ python repro2.py
WARNING:root:Defaulting to PJRT_DEVICE=CPU
(test_venv_py310) ubuntu@ip-172-31-1-215:~$ python repro2.py
WARNING:root:Defaulting to PJRT_DEVICE=CPU
(test_venv_py310) ubuntu@ip-172-31-1-215:~$ python repro2.py
WARNING:root:Defaulting to PJRT_DEVICE=CPU
(test_venv_py310) ubuntu@ip-172-31-1-215:~$ python repro2.py
WARNING:root:Defaulting to PJRT_DEVICE=CPU

By bisecting torch-xla commits, I narrowed down to the upgrade to August openxla 748ac9b which includes an update to protobuf 6.31.1: openxla/xla@72a784f:

b098be87dde58fe48e5effe72c0bb6b9b4ba5b6e    bad 8/22/2025
748ac9b1032cea9499f8062a10607eceb4a84cb7    bad 8/22/2025
6b6ef5c7d757f955565b2083c48d936bfd758dcd    good    8/22/2025
b84c83b46615f767e6d94cda959db8178ddd95b5    good    8/21/2025
0f56dec9a33a993d4c14cb755bdd25490cabba21    good    8/19/2025
a1c6ee92c85e8b0955c20892ed68f032a6015c09    good    8/16/2025

Building torch-xla with DEBUG=1 also avoid sentencepiece crash, so that's not a debug option.

Looking at Sentencepiece, which has it's own copy of protobuf, I see that the last update was to version 3.14 5 years ago (google/sentencepiece@152a87f)

Compiling latest Sentencepiece didn't help. I don't know how to update protobuf-lite there.

To Reproduce

Install if you are using python 3.10 env:

pip install https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.9.0rc1-cp310-cp310-linux_x86_64.whl
pip install accelerate torch==2.9 sentencepiece

Download any sentencepiece model, example from HF:

cd ~/
wget https://huggingface.co/ganchengguang/RoBERTa-base-japanese-sentencepiece/resolve/main/souseki_sentencepiece.model

The model argument to SentencePieceProcessor needs to be absolute path. Change it to match your environment.

#import torch
import torch_xla
import sentencepiece as spm
sp_model = spm.SentencePieceProcessor("/home/ubuntu/souseki_sentencepiece.model")

Expected behavior

No crash

Environment

  • Reproducible on XLA backend [CPU/TPU]: CPU
  • torch_xla version: 2.9

Additional context

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions