Skip to content

Missing model.safetensors.index.json in final hf checkpoint #2137

@legarwin

Description

@legarwin

Bug description

When using torchtitan with huggingface checkpoint for in and output, there seems to be a incosistent wrong behaviour for the save hf model.

If you provide a hf model with missing model.safetensors.index.json you get a fair warning

WARNING - model.safetensors.index.json not found at hf_assets_path: ./tests/assets/tokenizer/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format

and the saved final hf checkpoint behaves as expected. It saves a single hf safetensor (no matter if there were split safetensor files before) + index.json.

However, if the model.safetensors.index.json is not missing when loading the initial hf model, the saved final hf checkpoint seem to behave wrong. It keeps the safetensor splits for the output (based on the input model), but the model.safetensors.index.json is completely missing.

See the output here for a debug run with a single safetensor for the input and output model:
Image

I think the issue can be found in this if condition:
The first one skips the step of save the model.safetensors.index.json and also the call of method consolidate_safetensors_files_on_every_rank is not saving it.
Image

Can you have a look into that? Thank you! :)

Versions

  1. Tested on latest pytorch nightly: 2.10.0.dev20251210+cu128
  2. Use default debug model toml torchtitan/models/llama3/train_configs/debug_model.toml with checkpoint enabled and an adapted one where i pointed to a local downloaded model with tokenizer and model.safetensors.index.json, see
[model]
hf_assets_path = "<local-path>/llama3_debugmodel"

[checkpoint]
enable = true
folder = "checkpoint"
interval = 10
export_dtype = "float32"
async_mode = "disabled"  # ["disabled", "async", "async_with_pinned_mem"]
last_save_model_only = true
last_save_in_hf = true
initial_load_path = "<local-path>/llama3_debugmodel"
initial_load_in_hf = true
  1. Tested issue on torchtitan run_train.sh with no modifications

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions