add gil free serialize_file(serialize_file_threadable) #679

TomQunChao · 2025-11-26T11:10:27Z

What does this PR do?

Problem Statement: safetensors.torch.serialize_file function will occupy GIL during whole save_file process, even it's writing file, which is time consuming, but no other python threads can execute.

This pr add a function safetensors.torch.serialize_file_threadable which does following things:

Release GIL when writing files, when it is guaranteed that the tensors to be saved will not be modified during the checkpoint saving process, this function can be used with python threads ^_^
Python only passes a data pointer to the Rust layer, eliminating the need for complex type conversion and type checking when writing bytes. The benchmark shows that when saving ~100 MB of data, serialize_file takes ~1400 ms, while serialize_file_threadable only needs ~30 ms.

benchmark result of safetensors.torch.save_file vs save_file_threadable

Name (time in ms)	Min (ms)	Max (ms)	Mean (ms)	StdDev (ms)	Median (ms)	IQR (ms)	Outliers	OPS	Rounds	Iterations
test_sf_threadable_save_cpu	53.4114	239.5388	114.8421	75.5853	100.2545	95.5060	1;0	8.7076	5	1
test_sf_save_cpu	2213.5468	3129.8307	2538.6077	354.9364	2437.3711	396.3095	1;0	0.3939	5	1

danieldk · 2025-12-01T13:05:06Z

README.md

-
 # safetensors

-## Safetensors


Thank you for submitting this! Could you please, before anyone reviews the PR:

Check that the PR is in a good shape for submission. E.g. here a large part of the README is deleted), one of the new docs contains an ASCII emoji, etc.

Make sure that the files are properly formatted with cargo fmt.

Provide some reproducible data/benchmarks to verify.

Split a change into multiple PRs if there are changes that can be separated into smaller units.

Check if extending the API is really necessary.

Thank you very much for the suggestions!
I have reverted all unintended changes (e.g., the README deletions and the ASCII emoji in the doc) and run cargo fmt on every touched file.
I also added a benchmark that compares the performance of the two save_file variants; the numbers are pasted in the PR body.
Regarding the necessity of the new API:

In some training setups—at least in mine—check-pointing has to happen asynchronously.
The current safetensors.torch.save_file never releases the GIL, so any background thread that calls it blocks every other Python thread for the entire serialization window.
An async-friendly entry point is therefore useful.

The “thread-able” version is measurably faster on the benchmark, but it requires the user to guarantee that the CPU tensors being serialized will not be mutated during the call.
That restriction makes it a non-drop-in replacement for the original function, so exposing it under a separate name keeps existing code safe while letting new code opt in to the extra performance.

I’m still new to the safetensors code base, so I don’t have a clear picture of the long-term maintenance cost.
If the maintainers decide that the extra API surface is unacceptable, I’m happy to fold the functionality into the existing save_file behind an optional flag (or any other design you prefer).

Please let me know what direction you’d like me to take next.

HuggingFaceDocBuilderDev · 2025-12-10T15:07:14Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

a small nit, we talked about this offline with @McPatate ! its a nice PR

bindings/python/src/lib.rs

McPatate

These are the results I'm getting from running the benchmarks (python3 -m pytest benches/test_pt.py -k save_cpu) in your repo in feat/gil_free_serialize_file (with my added no zero copy test):

-------------------------------------------------------------------------------------------- benchmark: 3 tests -------------------------------------------------------------------------------------------
Name (time in s)                                 Min                Max               Mean            StdDev             Median               IQR            Outliers     OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_sf_threadable_save_cpu                   4.1726 (1.0)       4.1866 (1.0)       4.1789 (1.0)      0.0053 (1.0)       4.1773 (1.0)      0.0073 (1.0)           2;0  0.2393 (1.0)           5           1
test_sf_threadable_save_cpu_no_zero_copy      4.1736 (1.00)      4.1895 (1.00)      4.1818 (1.00)     0.0067 (1.25)      4.1798 (1.00)     0.0110 (1.50)          2;0  0.2391 (1.00)          5           1
test_sf_save_cpu                             11.2521 (2.70)     12.0204 (2.87)     11.6914 (2.80)     0.3391 (63.56)    11.8754 (2.84)     0.5627 (77.08)         1;0  0.0855 (0.36)          5           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Which seem a bit more realistic compared to 22 fold speedup 😄
Still very impressive results, kudos! Although I'm having trouble understanding why the no zero copy is the same speed as the zero copy one. Any idea what is going on here?

I ran the CI, there are some lint errors, could you fix them?

I think in follow-up PRs we could enable this with different libraries, would you be keen to add this @TomQunChao?

bindings/python/src/lib.rs

McPatate · 2025-12-10T15:44:21Z

bindings/python/src/lib.rs

+    }
+
+    fn data_len(&self) -> usize {
+        self.data_len


Suggested change

self.data_len

match &self.tensor_data {

Owned(data) => data.len(),

Pointer(ptr) => ptr.len,

}

same here for the ref, you can run clippy to make sure it doesn't bark at you.

ok, I'll check it.

bindings/python/src/lib.rs

bindings/python/benches/test_pt.py

McPatate · 2025-12-10T16:02:50Z

bindings/python/benches/test_pt.py

Could you add this benchmark as well:

def test_sf_threadable_save_cpu_no_zero_copy(benchmark): # benchmark save_file_threadable vs save_file weights = create_gpt2(12) # Benchmark save_file_threadable with tempfile.NamedTemporaryFile(delete=False) as f_threadable: benchmark(save_file_threadable, weights, f_threadable.name, None, False) # Clean up files os.unlink(f_threadable.name)

you can ignore this!

Thanks for the suggestions!
Just to confirm — are you referring to enabling this feature in other libraries such as transformers?
If so, I’d be very happy to work on that!

McPatate · 2025-12-10T19:06:28Z

Although I'm having trouble understanding why the no zero copy is the same speed as the zero copy one. Any idea what is going on here?

dug a little deeper and profiled the traces, it's not exactly the same speed, the mean differs in my bench by 29ms, which makes sense for a copy from RAM (to_vec call in the fn data() impl) where bandwidth is in the order of >20gb/s (here we bench with weights for a gpt2-like of ~500mb). I think having zero-copy option though on larger files should help shave off some cycles, so it's good to have it!

I think I understand the 3x performance boost as well. From the flame graph I can see that a lot of time is spent in the prepare fn, with a lot of calls to core::fmt::Display fns (I assume to_string). I narrowed down the culprit to this piece of code:

        let pydata: PyBound<PyAny> = tensor_desc
            .get_item("data")?
            .ok_or_else(|| SafetensorError::new_err(format!("Missing `data` in {tensor_desc}")))?;
        // Make sure it's extractable first.
        let data: &[u8] = pydata.extract()?;
        let data_len = data.len();
        let data: PyBound<PyBytes> = pydata.extract()?;

It seems A LOT of copies happen there, and we write after that.

Attached the flamegraphs for reference (sry didn't manage to extract all the symbols):

McPatate · 2025-12-10T19:12:52Z

So! Now that I've thoroughly reviewed and understand what's going on, I think we can refactor things to only have one save_file function which would map to save_threadable_file(zero_copy=True). I don't think we'll ever want a non zero copy version, and the old version is just bad so this is good!

We can in this case, unless it breaks other things:

remove PyView
remove _flatten
remove _tobytes
move safe_threadable_file related code in save_file and delete it

bindings/python/tests/test_threadable.py

McPatate · 2025-12-11T10:31:05Z

Soooo I have a clearer view on what's going on here. It seems I was wrong when saying the code with the pybuffer extraction is the culprit. In fact, we can see in the traces that everything is under the prepare_shape call. We took a look with @danieldk and we narrowed it down to the ok_or call in the prepare_shape fn. Changing it to ok_or_else does the following to perf:

------------------------------------------------------------------------------------------ benchmark: 3 tests -----------------------------------------------------------------------------------------
Name (time in s)                                Min               Max              Mean            StdDev            Median               IQR            Outliers     OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_sf_threadable_save_cpu_no_zero_copy     4.1422 (1.0)      4.2176 (1.01)     4.1800 (1.00)     0.0106 (1.55)     4.1793 (1.0)      0.0085 (1.0)           9;3  0.2392 (1.00)         50           1
test_sf_threadable_save_cpu                  4.1431 (1.00)     4.2373 (1.01)     4.1799 (1.0)      0.0179 (2.63)     4.1795 (1.00)     0.0110 (1.30)         13;9  0.2392 (1.0)          50           1
test_sf_save_cpu                             4.1591 (1.00)     4.1955 (1.0)      4.1801 (1.00)     0.0068 (1.0)      4.1793 (1.00)     0.0086 (1.01)         13;1  0.2392 (1.00)         50           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

So it seems you introduced a regression because from what I can see the code didn't contain an ok_or_else in prepare for the shape extraction on main.

Btw all the results I shared are coming from a Linux machine. Digging some more it seems that the IO configuration for this machine isn't great, which might explain why results don't really differ between each method.

When testing on mac, these are the results I get:

--------------------------------------------------------------------------------------- benchmark: 2 tests --------------------------------------------------------------------------------------
Name (time in ms)                   Min                 Max               Mean             StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_sf_threadable_save_cpu     53.0379 (1.0)      135.7290 (1.0)      56.5965 (1.0)      12.4764 (1.0)      53.4873 (1.0)      0.5328 (1.26)         3;11  17.6689 (1.0)          50           1
test_sf_save_cpu                63.3573 (1.19)     166.1867 (1.22)     67.0720 (1.19)     15.4401 (1.24)     63.6666 (1.19)     0.4214 (1.0)          2;10  14.9094 (0.84)         50           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

and for comparison, save_cpu equivalent on main:

------------------------------------------------ benchmark: 1 tests -----------------------------------------------
Name (time in ms)           Min       Max     Mean   StdDev   Median     IQR  Outliers      OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------
test_pt_sf_save_cpu     65.5022  145.5401  70.4561  12.5811  66.4999  1.5409       4;9  14.1932      50           1
-------------------------------------------------------------------------------------------------------------------

I tried bumping the number of runs to try and make it as statistically relevant as possible. I still think we can go through with this to enable multithreaded writes with safetensors, it should probably yield perf improvements for large files as well when it comes to zero copy.

McPatate · 2025-12-11T14:53:14Z

So for the linux benchmarks, I was facing issues wrt IOPS configuration of my instance. Throughput is capped at 125Mb/s which makes sense wrt the 4s figure. Also, we wrote in an existing file, so opening with File::create will open with the O_TRUNC flag, which could hurt performance on some platforms.

Updating the tests to:

def test_sf_threadable_save_cpu(benchmark):
   weights = create_gpt2(12)

   filename = "tmp.safetensors"

   def setup():
      try:
         os.unlink(filename)
      except Exception:
         pass

   benchmark.pedantic(
      save_file_threadable,
      args=(weights, filename, None, True),
      setup=setup,
      rounds=5,
      iterations=1,
   )

   os.unlink(filename)

(note: I suspect I could still improvement the io performance of the machine tuning params in the console, but that'll do)
and using a mounted nvme disk yields:

---------------------------------------------------------------------------------------------- benchmark: 3 tests ---------------------------------------------------------------------------------------------
Name (time in ms)                                 Min                 Max                Mean            StdDev              Median               IQR            Outliers     OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_sf_threadable_save_cpu                  328.7920 (1.0)      337.1023 (1.0)      330.7821 (1.0)      3.5487 (1.0)      329.3272 (1.0)      2.5724 (1.0)           1;1  3.0231 (1.0)           5           1
test_sf_threadable_save_cpu_no_zero_copy     625.3672 (1.90)     637.4008 (1.89)     629.7825 (1.90)     4.5028 (1.27)     628.8223 (1.91)     3.3922 (1.32)          1;1  1.5878 (0.53)          5           1
test_sf_save_cpu                             695.1437 (2.11)     704.9688 (2.09)     698.8713 (2.11)     3.6922 (1.04)     698.4955 (2.12)     3.6772 (1.43)          2;0  1.4309 (0.47)          5           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

and on macos:

---------------------------------------------------------------------------------------- benchmark: 2 tests ---------------------------------------------------------------------------------------
Name (time in ms)                   Min                 Max                Mean             StdDev             Median                IQR            Outliers      OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_sf_threadable_save_cpu     71.0147 (1.0)      121.6068 (1.0)       91.1449 (1.0)      21.4708 (1.0)      83.9422 (1.0)      35.2491 (1.98)          1;0  10.9715 (1.0)           5           1
test_sf_save_cpu                85.6395 (1.21)     152.1943 (1.25)     101.1548 (1.11)     28.5824 (1.33)     89.6425 (1.07)     17.8232 (1.0)           1;1   9.8858 (0.90)          5           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

so strangely macos performance degrades when creating a new file, so maybe APFS has optimisations for write.

McPatate · 2025-12-11T14:54:27Z

I'll go ahead and modify the code tmrw if you haven't done it before @TomQunChao, once this is done I'll take care of the regression script I mentioned earlier:

I'll write a script tmrw to serialize existing weights we have on the Hub to make sure there is no regression

TomQunChao · 2025-12-12T02:22:22Z

I'll go ahead and modify the code tmrw if you haven't done it before @TomQunChao, once this is done I'll take care of the regression script I mentioned earlier:

I'll write a script tmrw to serialize existing weights we have on the Hub to make sure there is no regression

Sounds good! I'll give it a try today, but if I run into any issues, I'd really appreciate your help. Thanks!

regression

Co-authored-by: Arthur <[email protected]>

Co-authored-by: Luc Georges <[email protected]>

…_file(threadable)

… _flatten_as_ptr

TomQunChao · 2025-12-12T07:01:23Z

@McPatate I've refactored the relevant parts in torch.py and lib.rs as you suggested — removing PyView, _flatten, and _tobytes, and replacing the old save_file with the GIL-free version.

However, I realized the workload is a bit larger than I expected, because the changes also need to be applied to NumPy/Paddle/TensorFlow backends. Unfortunately, I won’t be able to spend more time on this this week due to some other commitments.

Would you be able to help with the remaining changes?
If not, I’ll have time next Tuesday and can finish the rest then.

McPatate

I'm personally good with the state of this PR, would appreciate a review if you have time @danieldk

McPatate · 2025-12-15T23:28:48Z

safetensors/src/tensor.rs

+    // XXX: On Windows, we write to a temporary file first and then rename it.
+    // This avoids "error 1224" (ERROR_USER_MAPPED_FILE) which occurs when
+    // trying to write to a file that has an active memory-mapped section.


Not 100% we want to include this, this mostly broke tests that read and wrote to the same file. I would err on the side of caution and leave it personally, as I don't believe writing then renaming would hurt performance.

danieldk reviewed Dec 1, 2025

View reviewed changes

add gil free serialize file(serialize_file_threadable)

17c2404

TomQunChao force-pushed the feat/gil_free_serialize_file branch from 473dd86 to 17c2404 Compare December 2, 2025 08:58

TomQunChao added 2 commits December 2, 2025 16:59

Merge branch 'main' into feat/gil_free_serialize_file

cece21f

add benchmarks for safetensors.torch.save_file vs save_file_threadable

5c0624e

ArthurZucker reviewed Dec 10, 2025

View reviewed changes

bindings/python/src/lib.rs Outdated Show resolved Hide resolved

McPatate reviewed Dec 10, 2025

View reviewed changes

bindings/python/tests/test_threadable.py Show resolved Hide resolved

TomQunChao and others added 11 commits December 12, 2025 11:37

Update bindings/python/src/lib.rs to repair error info

aa89833

Co-authored-by: Arthur <[email protected]>

using enum replace contained_data

d145b8a

Co-authored-by: Luc Georges <[email protected]>

using enum to replace contained_data, the using part

61da135

Co-authored-by: Luc Georges <[email protected]>

using enum to replace contained_data flag, the using part

e240317

Co-authored-by: Luc Georges <[email protected]>

rename benchmark function

7034bd7

Co-authored-by: Luc Georges <[email protected]>

rename benchmark function, 2

d762a71

Co-authored-by: Luc Georges <[email protected]>

remove _flatten, to_bytes, redirect save_file, save_file to serialize…

ae413cf

…_file(threadable)

add keep alive machanism to avoid python gc tensor

8696141

remove PyView, _flatten, _tobytes in lib.rs/safetensors.torch

8a5706e

repair imports and benchmark

ba2fd70

add notice in the comment of save_file; add byte order converation in…

6047c66

… _flatten_as_ptr

McPatate added 3 commits December 15, 2025 15:14

fix: ok_or -> ok_or_else

ee92343

refactor: reintroduce _to_ndarray

c0d9d18

refactor: ok_or -> ok_or_else

12a0585

McPatate added 8 commits December 15, 2025 16:15

feat: add target/ to .gitignore

8ff5096

feat: keep numpy array refs alive during serialization

ed5aa3c

refactor(paddle): update paddle save_file to use data_ptr

75caf42

fix(paddle): set ref to tensor storage in tensor to avoid losing ref

fadf8fe

refactor: test_threadable.py to properly test gil release

c649f06

feat(ci): bump all actions versions + remove archived refs

71b2764

refactor(test): fmt

1ba32f4

fix(ci): disable benchmarks auto-push when PR is from fork

bf678e9

McPatate force-pushed the feat/gil_free_serialize_file branch from 5fe0632 to 36622da Compare December 15, 2025 21:35

fix: move is_contiguous check after tensor sparsity check

18d24e3

McPatate force-pushed the feat/gil_free_serialize_file branch from 36622da to 18d24e3 Compare December 15, 2025 21:35

McPatate added 3 commits December 15, 2025 23:01

fix(windows): write to tmpfile and rename to avoid os error 1224

61e64eb

fix(ci): do not push s390x image when PR is from fork

00439bb

fix(ci): avoid pushing cache to ghcr when PR is from fork

ec72037

McPatate approved these changes Dec 15, 2025

View reviewed changes

McPatate reviewed Dec 15, 2025

View reviewed changes

-        self.data_len
+        match &self.tensor_data {
+            Owned(data) => data.len(),
+            Pointer(ptr) => ptr.len,
+        }

add gil free serialize_file(serialize_file_threadable) #679

Are you sure you want to change the base?

add gil free serialize_file(serialize_file_threadable) #679

Uh oh!

Conversation

TomQunChao commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

danieldk Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomQunChao Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Dec 10, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

McPatate left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

McPatate Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

TomQunChao Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

McPatate Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

McPatate Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

TomQunChao Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

McPatate commented Dec 10, 2025

Uh oh!

McPatate commented Dec 10, 2025

Uh oh!

Uh oh!

McPatate commented Dec 11, 2025

Uh oh!

McPatate commented Dec 11, 2025

Uh oh!

McPatate commented Dec 11, 2025

Uh oh!

TomQunChao commented Dec 12, 2025

Uh oh!

TomQunChao commented Dec 12, 2025

Uh oh!

McPatate left a comment

Choose a reason for hiding this comment

Uh oh!

McPatate Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

TomQunChao commented Nov 26, 2025 •

edited

Loading

danieldk Dec 1, 2025 •

edited

Loading