-
Notifications
You must be signed in to change notification settings - Fork 82
Open
Labels
type:bugSomething isn't workingSomething isn't working
Description
Environment
- TPU: v5litepod-32 (32 chips, 8 hosts)
- JAX: 0.9.0
- jaxlib: 0.9.0
- orbax-checkpoint: 0.11.32
- Python: 3.12
- OS: Ubuntu 22.04 (tpu-ubuntu2204-base)
Problem
When converting a HuggingFace Llama 8B checkpoint to MaxText format using llama_or_mistral_ckpt.py, the checkpoint save fails silently. The layer conversion completes successfully (32/32 layers), but the Orbax save operation fails during atomic file rename.
Error Message
ValueError: NOT_FOUND: Error writing "params.params.decoder.decoder_norm.scale/c/0" in OCDBT database at local file "/home/user/maxtext-checkpoints/r1-distill-llama-8b/0.orbax-checkpoint-tmp/items.orbax-checkpoint-tmp/ocdbt.process_0/": Failed to rename fd: 14 "/home/user/maxtext-checkpoints/r1-distill-llama-8b/0.orbax-checkpoint-tmp/items.orbax-checkpoint-tmp/ocdbt.process_0/d/8f39efd1e47e6a0ada381d99719b652c.__lock" to: "/home/user/maxtext-checkpoints/r1-distill-llama-8b/0.orbax-checkpoint-tmp/items.orbax-checkpoint-tmp/ocdbt.process_0/d/8f39efd1e47e6a0ada381d99719b652c" [OS error 2: ENOENT No such file or directory] [source locations='tensorstore/internal/os/file_util_posix.cc:406\ntensorstore/kvstore/file/file_key_value_store.cc:662\ntensorstore/kvstore/transaction.cc:133']
Full Stack Trace
File "/home/user/venv312/lib/python3.12/site-packages/orbax/checkpoint/_src/futures/future.py", line 309, in _target_setting_result
self._result = target()
File "/home/user/venv312/lib/python3.12/site-packages/orbax/checkpoint/_src/futures/future.py", line 426, in <lambda>
target=lambda: asyncio_utils.run_sync(coro),
File "/home/user/venv312/lib/python3.12/site-packages/orbax/checkpoint/_src/asyncio_utils.py", line 36, in run_sync
return asyncio.run(coro)
File "/home/user/venv312/lib/python3.12/site-packages/orbax/checkpoint/_src/serialization/jax_array_handlers.py", line 373, in _serialize_without_dispatcher
await async_serialize_replica_slices_batch(
File "/home/user/venv312/lib/python3.12/site-packages/orbax/checkpoint/_src/serialization/jax_array_handlers.py", line 618, in _async_serialize_replica_slices
await asyncio.gather(*write_coros)
File "/home/user/venv312/lib/python3.12/site-packages/orbax/checkpoint/_src/serialization/serialization.py", line 265, in async_serialize_from_host
await asyncio.gather(*write_coros)
File "/home/user/venv312/lib/python3.12/site-packages/orbax/checkpoint/_src/serialization/serialization.py", line 253, in write_fragment
await t[fragment.index].write(
ValueError: NOT_FOUND: Error writing "params.params.decoder.decoder_norm.scale/c/0" in OCDBT database...Steps to Reproduce
- Create a v5litepod-32 TPU
- Install MaxText:
pip install -e .from https://github.com/AI-Hypercomputer/maxtext - Download a HuggingFace Llama model (e.g., deepseek-ai/DeepSeek-R1-Distill-Llama-8B)
- Run the checkpoint conversion:
JAX_PLATFORMS=cpu python3 llama_or_mistral_ckpt.py \
--base-model-path=/path/to/hf-model \
--maxtext-model-path=/path/to/output \
--model-size=llama3-8b \
--huggingface-checkpoint=TrueObserved Behavior
- Layer conversion completes:
layers: 100%|██████████| 32/32 - Sharding conversion completes
- Orbax save starts but fails during atomic rename
- Checkpoint directory contains ~11GB of data but is incomplete (missing tree structure metadata)
- Subsequent attempts to load the checkpoint fail with "No structure could be identified"
What We Tried
- Saving to local disk instead of GCS - same issue
- Using
JAX_PLATFORMS=cpu- same issue - Disabling OCDBT/Zarr3 with legacy format flags - same issue (save appears to succeed but 0 bytes written)
- Different JAX versions (0.5.0, 0.8.1, 0.9.0) - same issue
Analysis
The error appears to be a race condition or file descriptor issue in the tensorstore OCDBT layer:
- Orbax creates a
.lockfile during write - Tries to atomically rename
.lock→ final file - Rename fails with ENOENT despite the file existing moments before
This may be related to async coordination between multiple write operations.
Expected Behavior
Checkpoint conversion should complete successfully and produce a loadable MaxText checkpoint.
Impact
This completely blocks the ability to convert HuggingFace models to MaxText format on multi-host TPUs, which is a primary use case for the MaxText + JetStream serving stack.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
type:bugSomething isn't workingSomething isn't working