Skip to content

Respect TORCH_CUDA_ARCH_LIST to speed up builds#100

Open
d3banjan wants to merge 1 commit intoDao-AILab:mainfrom
d3banjan:fix/respect-torch-cuda-arch-list
Open

Respect TORCH_CUDA_ARCH_LIST to speed up builds#100
d3banjan wants to merge 1 commit intoDao-AILab:mainfrom
d3banjan:fix/respect-torch-cuda-arch-list

Conversation

@d3banjan
Copy link

Fixes #39

Summary

  • When TORCH_CUDA_ARCH_LIST is set, parse it and generate only the requested -gencode flags instead of hardcoding all supported architectures
  • When unset, behavior is completely unchanged (existing hardcoded flags remain as fallback)
  • This is the standard PyTorch convention already used by flash-attention, xformers, and PyTorch's own cpp_extension.py

Motivation

Building from source currently compiles for all supported GPU architectures regardless of the target hardware. For users targeting a single architecture (e.g. TORCH_CUDA_ARCH_LIST="8.6"), this makes builds ~5–7x slower than necessary.

Test plan

  • pip install -e . with TORCH_CUDA_ARCH_LIST unset — verify all existing gencode flags are emitted (unchanged behavior)
  • TORCH_CUDA_ARCH_LIST="8.6" pip install -e . — verify only -gencode arch=compute_86,code=sm_86 appears in nvcc output
  • TORCH_CUDA_ARCH_LIST="7.5;8.0" pip install -e . — verify both architectures are emitted
  • TORCH_CUDA_ARCH_LIST="8.6+PTX" pip install -e . — verify PTX suffix is stripped and compute_86 is used

When TORCH_CUDA_ARCH_LIST is set, use it to generate -gencode flags
instead of hardcoding all supported architectures. This is the standard
PyTorch convention used by flash-attention, xformers, and PyTorch's own
cpp_extension.py. When the env var is unset, behavior is unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TORCH_CUDA_ARCH_LIST support

1 participant