-
Notifications
You must be signed in to change notification settings - Fork 284
add gil free serialize_file(serialize_file_threadable) #679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
add gil free serialize_file(serialize_file_threadable) #679
Conversation
|
|
||
| # safetensors | ||
|
|
||
| ## Safetensors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for submitting this! Could you please, before anyone reviews the PR:
- Check that the PR is in a good shape for submission. E.g. here a large part of the README is deleted), one of the new docs contains an ASCII emoji, etc.
- Make sure that the files are properly formatted with
cargo fmt. - Provide some reproducible data/benchmarks to verify.
- Split a change into multiple PRs if there are changes that can be separated into smaller units.
- Check if extending the API is really necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the suggestions!
I have reverted all unintended changes (e.g., the README deletions and the ASCII emoji in the doc) and run cargo fmt on every touched file.
I also added a benchmark that compares the performance of the two save_file variants; the numbers are pasted in the PR body.
Regarding the necessity of the new API:
- In some training setups—at least in mine—check-pointing has to happen asynchronously.
The currentsafetensors.torch.save_filenever releases the GIL, so any background thread that calls it blocks every other Python thread for the entire serialization window.
An async-friendly entry point is therefore useful. - The “thread-able” version is measurably faster on the benchmark, but it requires the user to guarantee that the CPU tensors being serialized will not be mutated during the call.
That restriction makes it a non-drop-in replacement for the original function, so exposing it under a separate name keeps existing code safe while letting new code opt in to the extra performance. - I’m still new to the safetensors code base, so I don’t have a clear picture of the long-term maintenance cost.
If the maintainers decide that the extra API surface is unacceptable, I’m happy to fold the functionality into the existing save_file behind an optional flag (or any other design you prefer).
Please let me know what direction you’d like me to take next.
473dd86 to
17c2404
Compare
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a small nit, we talked about this offline with @McPatate ! its a nice PR
McPatate
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the results I'm getting from running the benchmarks (python3 -m pytest benches/test_pt.py -k save_cpu) in your repo in feat/gil_free_serialize_file (with my added no zero copy test):
-------------------------------------------------------------------------------------------- benchmark: 3 tests -------------------------------------------------------------------------------------------
Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_sf_threadable_save_cpu 4.1726 (1.0) 4.1866 (1.0) 4.1789 (1.0) 0.0053 (1.0) 4.1773 (1.0) 0.0073 (1.0) 2;0 0.2393 (1.0) 5 1
test_sf_threadable_save_cpu_no_zero_copy 4.1736 (1.00) 4.1895 (1.00) 4.1818 (1.00) 0.0067 (1.25) 4.1798 (1.00) 0.0110 (1.50) 2;0 0.2391 (1.00) 5 1
test_sf_save_cpu 11.2521 (2.70) 12.0204 (2.87) 11.6914 (2.80) 0.3391 (63.56) 11.8754 (2.84) 0.5627 (77.08) 1;0 0.0855 (0.36) 5 1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Which seem a bit more realistic compared to 22 fold speedup 😄
Still very impressive results, kudos! Although I'm having trouble understanding why the no zero copy is the same speed as the zero copy one. Any idea what is going on here?
I ran the CI, there are some lint errors, could you fix them?
I think in follow-up PRs we could enable this with different libraries, would you be keen to add this @TomQunChao?
bindings/python/src/lib.rs
Outdated
| } | ||
|
|
||
| fn data_len(&self) -> usize { | ||
| self.data_len |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| self.data_len | |
| match &self.tensor_data { | |
| Owned(data) => data.len(), | |
| Pointer(ptr) => ptr.len, | |
| } |
same here for the ref, you can run clippy to make sure it doesn't bark at you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll check it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add this benchmark as well:
def test_sf_threadable_save_cpu_no_zero_copy(benchmark):
# benchmark save_file_threadable vs save_file
weights = create_gpt2(12)
# Benchmark save_file_threadable
with tempfile.NamedTemporaryFile(delete=False) as f_threadable:
benchmark(save_file_threadable, weights, f_threadable.name, None, False)
# Clean up files
os.unlink(f_threadable.name)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can ignore this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions!
Just to confirm — are you referring to enabling this feature in other libraries such as transformers?
If so, I’d be very happy to work on that!
|
So! Now that I've thoroughly reviewed and understand what's going on, I think we can refactor things to only have one We can in this case, unless it breaks other things:
|
|
Soooo I have a clearer view on what's going on here. It seems I was wrong when saying the code with the pybuffer extraction is the culprit. In fact, we can see in the traces that everything is under the So it seems you introduced a regression because from what I can see the code didn't contain an Btw all the results I shared are coming from a Linux machine. Digging some more it seems that the IO configuration for this machine isn't great, which might explain why results don't really differ between each method. When testing on mac, these are the results I get: and for comparison, I tried bumping the number of runs to try and make it as statistically relevant as possible. I still think we can go through with this to enable multithreaded writes with safetensors, it should probably yield perf improvements for large files as well when it comes to zero copy. |
|
So for the linux benchmarks, I was facing issues wrt IOPS configuration of my instance. Throughput is capped at 125Mb/s which makes sense wrt the 4s figure. Also, we wrote in an existing file, so opening with Updating the tests to: def test_sf_threadable_save_cpu(benchmark):
weights = create_gpt2(12)
filename = "tmp.safetensors"
def setup():
try:
os.unlink(filename)
except Exception:
pass
benchmark.pedantic(
save_file_threadable,
args=(weights, filename, None, True),
setup=setup,
rounds=5,
iterations=1,
)
os.unlink(filename)(note: I suspect I could still improvement the io performance of the machine tuning params in the console, but that'll do) and on macos: so strangely macos performance degrades when creating a new file, so maybe APFS has optimisations for write. |
|
I'll go ahead and modify the code tmrw if you haven't done it before @TomQunChao, once this is done I'll take care of the regression script I mentioned earlier:
|
Sounds good! I'll give it a try today, but if I run into any issues, I'd really appreciate your help. Thanks!
|
Co-authored-by: Arthur <[email protected]>
Co-authored-by: Luc Georges <[email protected]>
Co-authored-by: Luc Georges <[email protected]>
Co-authored-by: Luc Georges <[email protected]>
Co-authored-by: Luc Georges <[email protected]>
Co-authored-by: Luc Georges <[email protected]>
…_file(threadable)
|
@McPatate I've refactored the relevant parts in torch.py and lib.rs as you suggested — removing PyView, _flatten, and _tobytes, and replacing the old save_file with the GIL-free version. However, I realized the workload is a bit larger than I expected, because the changes also need to be applied to NumPy/Paddle/TensorFlow backends. Unfortunately, I won’t be able to spend more time on this this week due to some other commitments. Would you be able to help with the remaining changes? |
5fe0632 to
36622da
Compare
36622da to
18d24e3
Compare
McPatate
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm personally good with the state of this PR, would appreciate a review if you have time @danieldk
| // XXX: On Windows, we write to a temporary file first and then rename it. | ||
| // This avoids "error 1224" (ERROR_USER_MAPPED_FILE) which occurs when | ||
| // trying to write to a file that has an active memory-mapped section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% we want to include this, this mostly broke tests that read and wrote to the same file. I would err on the side of caution and leave it personally, as I don't believe writing then renaming would hurt performance.
What does this PR do?
This pr add a function safetensors.torch.serialize_file_threadable which does following things:
benchmark result of safetensors.torch.save_file vs save_file_threadable