Skip to content

Conversation

@TomQunChao
Copy link

@TomQunChao TomQunChao commented Nov 26, 2025

What does this PR do?

Problem Statement: safetensors.torch.serialize_file function will occupy GIL during whole save_file process, even it's writing file, which is time consuming, but no other python threads can execute.

This pr add a function safetensors.torch.serialize_file_threadable which does following things:

  1. Release GIL when writing files, when it is guaranteed that the tensors to be saved will not be modified during the checkpoint saving process, this function can be used with python threads ^_^
  2. Python only passes a data pointer to the Rust layer, eliminating the need for complex type conversion and type checking when writing bytes. The benchmark shows that when saving ~100 MB of data, serialize_file takes ~1400 ms, while serialize_file_threadable only needs ~30 ms.

benchmark result of safetensors.torch.save_file vs save_file_threadable

Name (time in ms) Min (ms) Max (ms) Mean (ms) StdDev (ms) Median (ms) IQR (ms) Outliers OPS Rounds Iterations
test_sf_threadable_save_cpu 53.4114 239.5388 114.8421 75.5853 100.2545 95.5060 1;0 8.7076 5 1
test_sf_save_cpu 2213.5468 3129.8307 2538.6077 354.9364 2437.3711 396.3095 1;0 0.3939 5 1


# safetensors

## Safetensors
Copy link
Member

@danieldk danieldk Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for submitting this! Could you please, before anyone reviews the PR:

  • Check that the PR is in a good shape for submission. E.g. here a large part of the README is deleted), one of the new docs contains an ASCII emoji, etc.
  • Make sure that the files are properly formatted with cargo fmt.
  • Provide some reproducible data/benchmarks to verify.
  • Split a change into multiple PRs if there are changes that can be separated into smaller units.
  • Check if extending the API is really necessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the suggestions!
I have reverted all unintended changes (e.g., the README deletions and the ASCII emoji in the doc) and run cargo fmt on every touched file.
I also added a benchmark that compares the performance of the two save_file variants; the numbers are pasted in the PR body.
Regarding the necessity of the new API:

  • In some training setups—at least in mine—check-pointing has to happen asynchronously.
    The current safetensors.torch.save_file never releases the GIL, so any background thread that calls it blocks every other Python thread for the entire serialization window.
    An async-friendly entry point is therefore useful.
  • The “thread-able” version is measurably faster on the benchmark, but it requires the user to guarantee that the CPU tensors being serialized will not be mutated during the call.
    That restriction makes it a non-drop-in replacement for the original function, so exposing it under a separate name keeps existing code safe while letting new code opt in to the extra performance.
  • I’m still new to the safetensors code base, so I don’t have a clear picture of the long-term maintenance cost.
    If the maintainers decide that the extra API surface is unacceptable, I’m happy to fold the functionality into the existing save_file behind an optional flag (or any other design you prefer).

Please let me know what direction you’d like me to take next.

@TomQunChao TomQunChao force-pushed the feat/gil_free_serialize_file branch from 473dd86 to 17c2404 Compare December 2, 2025 08:58
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a small nit, we talked about this offline with @McPatate ! its a nice PR

Copy link
Member

@McPatate McPatate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the results I'm getting from running the benchmarks (python3 -m pytest benches/test_pt.py -k save_cpu) in your repo in feat/gil_free_serialize_file (with my added no zero copy test):

-------------------------------------------------------------------------------------------- benchmark: 3 tests -------------------------------------------------------------------------------------------
Name (time in s)                                 Min                Max               Mean            StdDev             Median               IQR            Outliers     OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_sf_threadable_save_cpu                   4.1726 (1.0)       4.1866 (1.0)       4.1789 (1.0)      0.0053 (1.0)       4.1773 (1.0)      0.0073 (1.0)           2;0  0.2393 (1.0)           5           1
test_sf_threadable_save_cpu_no_zero_copy      4.1736 (1.00)      4.1895 (1.00)      4.1818 (1.00)     0.0067 (1.25)      4.1798 (1.00)     0.0110 (1.50)          2;0  0.2391 (1.00)          5           1
test_sf_save_cpu                             11.2521 (2.70)     12.0204 (2.87)     11.6914 (2.80)     0.3391 (63.56)    11.8754 (2.84)     0.5627 (77.08)         1;0  0.0855 (0.36)          5           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Which seem a bit more realistic compared to 22 fold speedup 😄
Still very impressive results, kudos! Although I'm having trouble understanding why the no zero copy is the same speed as the zero copy one. Any idea what is going on here?

I ran the CI, there are some lint errors, could you fix them?

I think in follow-up PRs we could enable this with different libraries, would you be keen to add this @TomQunChao?

}

fn data_len(&self) -> usize {
self.data_len
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.data_len
match &self.tensor_data {
Owned(data) => data.len(),
Pointer(ptr) => ptr.len,
}

same here for the ref, you can run clippy to make sure it doesn't bark at you.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll check it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add this benchmark as well:

def test_sf_threadable_save_cpu_no_zero_copy(benchmark):
    # benchmark save_file_threadable vs save_file
    weights = create_gpt2(12)

    # Benchmark save_file_threadable
    with tempfile.NamedTemporaryFile(delete=False) as f_threadable:
        benchmark(save_file_threadable, weights, f_threadable.name, None, False)

    # Clean up files
    os.unlink(f_threadable.name)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can ignore this!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions!
Just to confirm — are you referring to enabling this feature in other libraries such as transformers?
If so, I’d be very happy to work on that!

@McPatate
Copy link
Member

Although I'm having trouble understanding why the no zero copy is the same speed as the zero copy one. Any idea what is going on here?

dug a little deeper and profiled the traces, it's not exactly the same speed, the mean differs in my bench by 29ms, which makes sense for a copy from RAM (to_vec call in the fn data() impl) where bandwidth is in the order of >20gb/s (here we bench with weights for a gpt2-like of ~500mb). I think having zero-copy option though on larger files should help shave off some cycles, so it's good to have it!

I think I understand the 3x performance boost as well. From the flame graph I can see that a lot of time is spent in the prepare fn, with a lot of calls to core::fmt::Display fns (I assume to_string). I narrowed down the culprit to this piece of code:

        let pydata: PyBound<PyAny> = tensor_desc
            .get_item("data")?
            .ok_or_else(|| SafetensorError::new_err(format!("Missing `data` in {tensor_desc}")))?;
        // Make sure it's extractable first.
        let data: &[u8] = pydata.extract()?;
        let data_len = data.len();
        let data: PyBound<PyBytes> = pydata.extract()?;

It seems A LOT of copies happen there, and we write after that.

Attached the flamegraphs for reference (sry didn't manage to extract all the symbols):
save_file
save_threadable_no_zc
save_threadable_zc

@McPatate
Copy link
Member

So! Now that I've thoroughly reviewed and understand what's going on, I think we can refactor things to only have one save_file function which would map to save_threadable_file(zero_copy=True). I don't think we'll ever want a non zero copy version, and the old version is just bad so this is good!

We can in this case, unless it breaks other things:

  • remove PyView
  • remove _flatten
  • remove _tobytes
  • move safe_threadable_file related code in save_file and delete it

@McPatate
Copy link
Member

Soooo I have a clearer view on what's going on here. It seems I was wrong when saying the code with the pybuffer extraction is the culprit. In fact, we can see in the traces that everything is under the prepare_shape call. We took a look with @danieldk and we narrowed it down to the ok_or call in the prepare_shape fn. Changing it to ok_or_else does the following to perf:

------------------------------------------------------------------------------------------ benchmark: 3 tests -----------------------------------------------------------------------------------------
Name (time in s)                                Min               Max              Mean            StdDev            Median               IQR            Outliers     OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_sf_threadable_save_cpu_no_zero_copy     4.1422 (1.0)      4.2176 (1.01)     4.1800 (1.00)     0.0106 (1.55)     4.1793 (1.0)      0.0085 (1.0)           9;3  0.2392 (1.00)         50           1
test_sf_threadable_save_cpu                  4.1431 (1.00)     4.2373 (1.01)     4.1799 (1.0)      0.0179 (2.63)     4.1795 (1.00)     0.0110 (1.30)         13;9  0.2392 (1.0)          50           1
test_sf_save_cpu                             4.1591 (1.00)     4.1955 (1.0)      4.1801 (1.00)     0.0068 (1.0)      4.1793 (1.00)     0.0086 (1.01)         13;1  0.2392 (1.00)         50           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

So it seems you introduced a regression because from what I can see the code didn't contain an ok_or_else in prepare for the shape extraction on main.

Btw all the results I shared are coming from a Linux machine. Digging some more it seems that the IO configuration for this machine isn't great, which might explain why results don't really differ between each method.

When testing on mac, these are the results I get:

--------------------------------------------------------------------------------------- benchmark: 2 tests --------------------------------------------------------------------------------------
Name (time in ms)                   Min                 Max               Mean             StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_sf_threadable_save_cpu     53.0379 (1.0)      135.7290 (1.0)      56.5965 (1.0)      12.4764 (1.0)      53.4873 (1.0)      0.5328 (1.26)         3;11  17.6689 (1.0)          50           1
test_sf_save_cpu                63.3573 (1.19)     166.1867 (1.22)     67.0720 (1.19)     15.4401 (1.24)     63.6666 (1.19)     0.4214 (1.0)          2;10  14.9094 (0.84)         50           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

and for comparison, save_cpu equivalent on main:

------------------------------------------------ benchmark: 1 tests -----------------------------------------------
Name (time in ms)           Min       Max     Mean   StdDev   Median     IQR  Outliers      OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------
test_pt_sf_save_cpu     65.5022  145.5401  70.4561  12.5811  66.4999  1.5409       4;9  14.1932      50           1
-------------------------------------------------------------------------------------------------------------------

I tried bumping the number of runs to try and make it as statistically relevant as possible. I still think we can go through with this to enable multithreaded writes with safetensors, it should probably yield perf improvements for large files as well when it comes to zero copy.

@McPatate
Copy link
Member

So for the linux benchmarks, I was facing issues wrt IOPS configuration of my instance. Throughput is capped at 125Mb/s which makes sense wrt the 4s figure. Also, we wrote in an existing file, so opening with File::create will open with the O_TRUNC flag, which could hurt performance on some platforms.

Updating the tests to:

def test_sf_threadable_save_cpu(benchmark):
   weights = create_gpt2(12)

   filename = "tmp.safetensors"

   def setup():
      try:
         os.unlink(filename)
      except Exception:
         pass

   benchmark.pedantic(
      save_file_threadable,
      args=(weights, filename, None, True),
      setup=setup,
      rounds=5,
      iterations=1,
   )

   os.unlink(filename)

(note: I suspect I could still improvement the io performance of the machine tuning params in the console, but that'll do)
and using a mounted nvme disk yields:

---------------------------------------------------------------------------------------------- benchmark: 3 tests ---------------------------------------------------------------------------------------------
Name (time in ms)                                 Min                 Max                Mean            StdDev              Median               IQR            Outliers     OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_sf_threadable_save_cpu                  328.7920 (1.0)      337.1023 (1.0)      330.7821 (1.0)      3.5487 (1.0)      329.3272 (1.0)      2.5724 (1.0)           1;1  3.0231 (1.0)           5           1
test_sf_threadable_save_cpu_no_zero_copy     625.3672 (1.90)     637.4008 (1.89)     629.7825 (1.90)     4.5028 (1.27)     628.8223 (1.91)     3.3922 (1.32)          1;1  1.5878 (0.53)          5           1
test_sf_save_cpu                             695.1437 (2.11)     704.9688 (2.09)     698.8713 (2.11)     3.6922 (1.04)     698.4955 (2.12)     3.6772 (1.43)          2;0  1.4309 (0.47)          5           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

and on macos:

---------------------------------------------------------------------------------------- benchmark: 2 tests ---------------------------------------------------------------------------------------
Name (time in ms)                   Min                 Max                Mean             StdDev             Median                IQR            Outliers      OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_sf_threadable_save_cpu     71.0147 (1.0)      121.6068 (1.0)       91.1449 (1.0)      21.4708 (1.0)      83.9422 (1.0)      35.2491 (1.98)          1;0  10.9715 (1.0)           5           1
test_sf_save_cpu                85.6395 (1.21)     152.1943 (1.25)     101.1548 (1.11)     28.5824 (1.33)     89.6425 (1.07)     17.8232 (1.0)           1;1   9.8858 (0.90)          5           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

so strangely macos performance degrades when creating a new file, so maybe APFS has optimisations for write.

@McPatate
Copy link
Member

I'll go ahead and modify the code tmrw if you haven't done it before @TomQunChao, once this is done I'll take care of the regression script I mentioned earlier:

I'll write a script tmrw to serialize existing weights we have on the Hub to make sure there is no regression

@TomQunChao
Copy link
Author

I'll go ahead and modify the code tmrw if you haven't done it before @TomQunChao, once this is done I'll take care of the regression script I mentioned earlier:

I'll write a script tmrw to serialize existing weights we have on the Hub to make sure there is no regression

Sounds good! I'll give it a try today, but if I run into any issues, I'd really appreciate your help. Thanks!

regression

@TomQunChao
Copy link
Author

@McPatate I've refactored the relevant parts in torch.py and lib.rs as you suggested — removing PyView, _flatten, and _tobytes, and replacing the old save_file with the GIL-free version.

However, I realized the workload is a bit larger than I expected, because the changes also need to be applied to NumPy/Paddle/TensorFlow backends. Unfortunately, I won’t be able to spend more time on this this week due to some other commitments.

Would you be able to help with the remaining changes?
If not, I’ll have time next Tuesday and can finish the rest then.

@McPatate McPatate force-pushed the feat/gil_free_serialize_file branch from 5fe0632 to 36622da Compare December 15, 2025 21:35
@McPatate McPatate force-pushed the feat/gil_free_serialize_file branch from 36622da to 18d24e3 Compare December 15, 2025 21:35
Copy link
Member

@McPatate McPatate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm personally good with the state of this PR, would appreciate a review if you have time @danieldk

Comment on lines +350 to +352
// XXX: On Windows, we write to a temporary file first and then rename it.
// This avoids "error 1224" (ERROR_USER_MAPPED_FILE) which occurs when
// trying to write to a file that has an active memory-mapped section.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% we want to include this, this mostly broke tests that read and wrote to the same file. I would err on the side of caution and leave it personally, as I don't believe writing then renaming would hurt performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants