Respect TORCH_CUDA_ARCH_LIST to speed up builds#100
Open
d3banjan wants to merge 1 commit intoDao-AILab:mainfrom
Open
Respect TORCH_CUDA_ARCH_LIST to speed up builds#100d3banjan wants to merge 1 commit intoDao-AILab:mainfrom
d3banjan wants to merge 1 commit intoDao-AILab:mainfrom
Conversation
When TORCH_CUDA_ARCH_LIST is set, use it to generate -gencode flags instead of hardcoding all supported architectures. This is the standard PyTorch convention used by flash-attention, xformers, and PyTorch's own cpp_extension.py. When the env var is unset, behavior is unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #39
Summary
TORCH_CUDA_ARCH_LISTis set, parse it and generate only the requested-gencodeflags instead of hardcoding all supported architecturescpp_extension.pyMotivation
Building from source currently compiles for all supported GPU architectures regardless of the target hardware. For users targeting a single architecture (e.g.
TORCH_CUDA_ARCH_LIST="8.6"), this makes builds ~5–7x slower than necessary.Test plan
pip install -e .withTORCH_CUDA_ARCH_LISTunset — verify all existing gencode flags are emitted (unchanged behavior)TORCH_CUDA_ARCH_LIST="8.6" pip install -e .— verify only-gencode arch=compute_86,code=sm_86appears in nvcc outputTORCH_CUDA_ARCH_LIST="7.5;8.0" pip install -e .— verify both architectures are emittedTORCH_CUDA_ARCH_LIST="8.6+PTX" pip install -e .— verify PTX suffix is stripped andcompute_86is used