Skip to content

Checkpoint save fails with ENOENT during OCDBT atomic rename on TPU v5litepod #2837

@jaisong123

Description

@jaisong123

Environment

  • TPU: v5litepod-32 (32 chips, 8 hosts)
  • JAX: 0.9.0
  • jaxlib: 0.9.0
  • orbax-checkpoint: 0.11.32
  • Python: 3.12
  • OS: Ubuntu 22.04 (tpu-ubuntu2204-base)

Problem

When converting a HuggingFace Llama 8B checkpoint to MaxText format using llama_or_mistral_ckpt.py, the checkpoint save fails silently. The layer conversion completes successfully (32/32 layers), but the Orbax save operation fails during atomic file rename.

Error Message

ValueError: NOT_FOUND: Error writing "params.params.decoder.decoder_norm.scale/c/0" in OCDBT database at local file "/home/user/maxtext-checkpoints/r1-distill-llama-8b/0.orbax-checkpoint-tmp/items.orbax-checkpoint-tmp/ocdbt.process_0/": Failed to rename fd: 14 "/home/user/maxtext-checkpoints/r1-distill-llama-8b/0.orbax-checkpoint-tmp/items.orbax-checkpoint-tmp/ocdbt.process_0/d/8f39efd1e47e6a0ada381d99719b652c.__lock" to: "/home/user/maxtext-checkpoints/r1-distill-llama-8b/0.orbax-checkpoint-tmp/items.orbax-checkpoint-tmp/ocdbt.process_0/d/8f39efd1e47e6a0ada381d99719b652c" [OS error 2: ENOENT No such file or directory] [source locations='tensorstore/internal/os/file_util_posix.cc:406\ntensorstore/kvstore/file/file_key_value_store.cc:662\ntensorstore/kvstore/transaction.cc:133']

Full Stack Trace

File "/home/user/venv312/lib/python3.12/site-packages/orbax/checkpoint/_src/futures/future.py", line 309, in _target_setting_result
    self._result = target()
File "/home/user/venv312/lib/python3.12/site-packages/orbax/checkpoint/_src/futures/future.py", line 426, in <lambda>
    target=lambda: asyncio_utils.run_sync(coro),
File "/home/user/venv312/lib/python3.12/site-packages/orbax/checkpoint/_src/asyncio_utils.py", line 36, in run_sync
    return asyncio.run(coro)
File "/home/user/venv312/lib/python3.12/site-packages/orbax/checkpoint/_src/serialization/jax_array_handlers.py", line 373, in _serialize_without_dispatcher
    await async_serialize_replica_slices_batch(
File "/home/user/venv312/lib/python3.12/site-packages/orbax/checkpoint/_src/serialization/jax_array_handlers.py", line 618, in _async_serialize_replica_slices
    await asyncio.gather(*write_coros)
File "/home/user/venv312/lib/python3.12/site-packages/orbax/checkpoint/_src/serialization/serialization.py", line 265, in async_serialize_from_host
    await asyncio.gather(*write_coros)
File "/home/user/venv312/lib/python3.12/site-packages/orbax/checkpoint/_src/serialization/serialization.py", line 253, in write_fragment
    await t[fragment.index].write(
ValueError: NOT_FOUND: Error writing "params.params.decoder.decoder_norm.scale/c/0" in OCDBT database...

Steps to Reproduce

  1. Create a v5litepod-32 TPU
  2. Install MaxText: pip install -e . from https://github.com/AI-Hypercomputer/maxtext
  3. Download a HuggingFace Llama model (e.g., deepseek-ai/DeepSeek-R1-Distill-Llama-8B)
  4. Run the checkpoint conversion:
JAX_PLATFORMS=cpu python3 llama_or_mistral_ckpt.py \
    --base-model-path=/path/to/hf-model \
    --maxtext-model-path=/path/to/output \
    --model-size=llama3-8b \
    --huggingface-checkpoint=True

Observed Behavior

  1. Layer conversion completes: layers: 100%|██████████| 32/32
  2. Sharding conversion completes
  3. Orbax save starts but fails during atomic rename
  4. Checkpoint directory contains ~11GB of data but is incomplete (missing tree structure metadata)
  5. Subsequent attempts to load the checkpoint fail with "No structure could be identified"

What We Tried

  • Saving to local disk instead of GCS - same issue
  • Using JAX_PLATFORMS=cpu - same issue
  • Disabling OCDBT/Zarr3 with legacy format flags - same issue (save appears to succeed but 0 bytes written)
  • Different JAX versions (0.5.0, 0.8.1, 0.9.0) - same issue

Analysis

The error appears to be a race condition or file descriptor issue in the tensorstore OCDBT layer:

  1. Orbax creates a .lock file during write
  2. Tries to atomically rename .lock → final file
  3. Rename fails with ENOENT despite the file existing moments before

This may be related to async coordination between multiple write operations.

Expected Behavior

Checkpoint conversion should complete successfully and produce a loadable MaxText checkpoint.

Impact

This completely blocks the ability to convert HuggingFace models to MaxText format on multi-host TPUs, which is a primary use case for the MaxText + JetStream serving stack.

Metadata

Metadata

Assignees

Labels

type:bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions