Performance fix: add pin_memory option (default on for CUDA) to speed… #817
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.

Motivation
On a GB10 (cc 12.1) machine, loading weights using
NunchakuFluxTransformer2dModel.from_pretrained(...)takes approximately 37 seconds, while on a 3090 machine it takes approximately 3 seconds.The profiler display time is almost entirely consumed by numerous
cudaMemcpyAsyncoperations (approximately 2400+ small block copies). Typically, small blocks are moved from the CPU to the GPU memory path, amplifying the fixed overhead of each copy.Modifications
Solution (How to Locate and Fix)
Location: H2D bandwidth testing confirmed the link was normal (pinned H2D was very high).
The "small copy" experiment verified that the difference between pageable and pinned was ~50×.
The loading logic was found in nunchaku/models/transformers/transformer_flux.py::from_pretrained: load_file(...) first reads the weights into the CPU tensor, then triggers a large number of H2D copies.
Fix: Added a _pin_state_dict(...) helper to perform pin_memory() on the CPU tensors in quantized_part_sd and unquantized_part_sd.
Added a pin_memory switch (default True) to from_pretrained(), enabling it only when device.type == "cuda".
Results (actual test): Loading profiling time decreased from ~37.5s Self CPU time to ~3.9s.
Self CUDA time decreased to the ~154ms range.
Conclusion: This patch is a performance fix, significantly reducing the overhead of "CPU→GPU small block transport during the weight loading phase".