Skip to content

Updated setup.py and pyproject.toml#279

Open
stevenwalton wants to merge 3 commits intoSHI-Labs:mainfrom
stevenwalton:setup
Open

Updated setup.py and pyproject.toml#279
stevenwalton wants to merge 3 commits intoSHI-Labs:mainfrom
stevenwalton:setup

Conversation

@stevenwalton
Copy link

@stevenwalton stevenwalton commented Nov 1, 2025

As discussed, we're splitting PR #273 up.

This PR is isolated to the changes in pyproject.toml and setup.py that do not involve uv. (Which is most of that PR)

Tested on EndeavourOS with 4080S (sm89)

Changes

- pyproject.toml:

  • author/maintainer names changed
    • Added email since this is now supported
    • @alihassanijr : do we want to still set you as the maintainer? I think this is best?
  • Classifiers added
  • torch noted in the dependencies (not exactly needed, at least now ¯\_(ツ)_/¯)
  • Bumped torch version (NEEDS CHECK)
  • Incorporated 2 line changes to support blackwell ultra (@alihassanijr, needs check)

- src/natten/profiler.py

  • Removed nested double quotes and switched to unique quote type (example)
    • Adds clarity and avoids runtime errors

- setup.py

  • Code completely refactored
  • Rewritten to be easier to edit and add new architectures as need be.
    • Key variables and functions I believe are most likely needed to be modified in the future are near the top of the file for added clarity.
      • env created to handle user environment variables.
    • Should just need to adjust SUPPORTED_GPU_ARCH and CAT2ARCH globals that are at the top of the file.
      • SUPPORTED_GPU_ARCH: called by _arch_list_to_cmake_tags to define arch_list as before
      • CAT2ARCH: called in autogen_kernel_instantiations where we now loop through these keys. Previously we were defining the category options in this function with hard coded values. We now have the function _category_generator that takes in the arch and generates the fna and fmha variations of the forward and backward functions with the associated architecture name. If this 4 entry dict pattern stays then we don't need to do anything except edit CAT2ARCH to add those architectures (in the setup.py file at least...).
      • Similarly NUM_SPLITS modifed to avoid redundancy. Function _tune_ag_policy added to help generate this. Globals _AG_POLICY_TUNABLES are intended to be tuned. Currently we either double or halve these values, associated AG_POLICY_FINE and AG_POLICY_COURSE respectively. AG_POLICY_* dictionaries are modified using the union operator to ensure proper policy is implemented and dupes avoided.
  • Removed asserts and changed to if-raise pattern

TODO:

- torch version

Has some inconsistencies that we need to check. In my setup.py we have a minimum version of 2.6, but in current setup.py the version is 2.5. This also doesn't match pyproject.toml. Which version do we support?

- CUDA version & detection

We had discussed doing a different process for detecting this through cmake. Do we want that in this version? I think we save that for a different PR since we are probably ready to merge this now and that will help us avoid merge conflicts.

- Minor

I can do this in next PR

  • We have tmp_dir near the top. I think we should place a context manager under BuildExtension.build_extensions to handle this. Will be a bit cleaner and clarify the scope of this directory.
    • EDIT: @alihassanijr let's check the logic on this. Can you clarify the original intent? See here and here. NATTEN_BUILD_DIR will never be None. So why don't we just wrap in context manager and get rid of the possibility of using self.build_lib as BUILD_DIR?
  • More comments/doc?
  • More cleanup?

Changing authors and maintainers to reflect the project.
Adding email now that domain supports forwarding.
Updating classifiers
Version matching torch and cmake
This is the setup.py file from PR SHI-Labs#273
Updated with the minor change needed for blackwell ultra

Minor change to src/natten/profiler:
f-strings had use of " inside f (e.g. f"{", ".join(...)}.")
This is not supported in all python versions and is bad practice as it
is less readable.
Gotta fix those linting errors
readme = {file = "docs/README_pypi.md", content-type = "text/markdown"}

dependencies = [
"torch>=2.8.0",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"torch>=2.8.0",
"torch",

Can we relax this for now and just make it any version of torch?
Some pypi versions are stupid and will error out if the system constraints have locked the torch version.
The best part is that the version requirements are also met.

But anyway, we do technically support 2.7, even 2.6.
If folks aren't using Flex, we can probably go even as far back as 2.0. They'd have to compile it themselves of course.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to local stage and placing comments for additional clarity

"setuptools >= 64",
"torch >= 2.7",
"cmake >= 4.0",
"torch >= 2.8",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"torch >= 2.8",
"torch",

Same here. Let's accept any version of torch, and just raise warnings/errors in setup/cmake if it's unsupported... I really don't want to trust pypi to do the right thing here.

MIN_TORCH_VERSION : float = 2.6
MIN_CUDA_VERSION : float = 12.0
MIN_SM : int = 30
SUPPORTED_GPU_ARCH : list[int] = [90, 100, 103]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename SUPPORTED_GPU_ARCH to something like SMS_WITH_ARCH_SPECIFIC_FEATS?
I'm also very supportive of just using ARCH instead of SM in this entire file (SMS is hard to parse, even for me).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually slightly more complicated.

SM90 and SM100/103 have backends that are specific to them.
But it is possible to get backends that not all of our supported architectures can run. For instance #278 will add new backends that will run on SM80 and later.

The thing that's special about the SM90 and SM100/103 backends is that they can only support one architecture (or arch family, but we're not using that concept directly here), and therefore their arch tags will have that a appended to enable the arch-specific ISA.

I'm thinking maybe we could have a list of backends, with architectures that are supported.
What gets tricky is that we don't want to list out all the arches in backends like fna/fmha because it would be everything.

We could do a min and max arch... but I don't want to assume it's always contiguous.
I.e. It's unclear to me what Sm101 and Sm102 are, and whether they can support that backend -- and these numbers changed between CTK 12.8 and 13.0 too....

Comment on lines +135 to +156
_AG_POLICY_TUNABLES = {
"reference": 2,
"fna": 64,
"fmha": 6,
"hopper-fna": 8,
"hopper-fna-bwd": 4,
"blackwell-fna": 28,
"blackwell-fna-bwd": 14,
}

if not HAS_CUDA_ARCH:
HAS_CUDA = torch.cuda.is_available()
_AG_POLICIES_CONSTS = {
"hopper-fmha": 1,
"hopper-fmha-bwd": 1,
"blackwell-fmha": 1,
"blackwell-fmha-bwd": 1,
}

if HAS_CUDA:
cuda_device = torch.cuda.get_device_properties(torch.cuda.current_device())
sm = cuda_device.major + cuda_device.minor * 0.1
CUDA_ARCH = f"{sm}"
print(
"`NATTEN_CUDA_ARCH` not set, but detected CUDA driver with PyTorch. "
f"Building for {CUDA_ARCH=}."
)
AG_POLICY_DEFAULT = _AG_POLICIES_CONSTS | _AG_POLICY_TUNABLES

assert torch.version.cuda is not None
TORCH_CUDA_VERSION = [x for x in torch.version.cuda.split(".")[:2]]
CUDA_TAG = "".join([x for x in TORCH_CUDA_VERSION])
CUDA_VERSION = [int(x) for x in TORCH_CUDA_VERSION]

assert CUDA_VERSION >= [12, 0], "NATTEN only supports CUDA 12.0 and above."
if CUDA_VERSION >= [12, 0] and IS_WINDOWS:
print(
"WARNING: Torch cmake will likely fail on Windows with CUDA 12.X. "
"Please refer to NATTEN documentation to read more about the issue "
"and how to get around it until the issue is fixed in torch."
)
# Now this is more explicit
AG_POLICY_FINE = AG_POLICY_DEFAULT | _tune_ag_policy(_AG_POLICY_TUNABLES, 2)
AG_POLICY_COARSE = AG_POLICY_DEFAULT | _tune_ag_policy(_AG_POLICY_TUNABLES, 0.5)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this refactor, but can we eliminate the concept of consts? Today we're doing 1 file for the FMHAs, but this can change (and very soon will). Why don't we just merge them? I'm okay with them being set to 2 in the FINE.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, honestly I think _tune_ag_policy can handle that entirely. Not to just get rid of AG_POLICY_{FINE,COARSE} but we can probably do this even better.

# Note: Union operator means last key wins.
def _tune_ag_policy(policy: dict, scale : float) -> dict:
for key in policy:
policy[key] = int(policy[key] * scale)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
policy[key] = int(policy[key] * scale)
policy[key] = max(1, int(policy[key] * scale))

Needs a clip so the 1s don't end up 0s when scale is < 1?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to local stage. Good catch

# Also because we want CMake to build everything elsewhere, otherwise pypi will package
# build files.
build_dir = self.build_lib if NATTEN_BUILD_DIR is None else NATTEN_BUILD_DIR
if env['BUILD_DIR'] is not None:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Current local stage has a context manager here. Commenting to remember to discuss. self.build_lib never actually gets called?

so_path_final = f"{self.build_lib}/{output_binary_name}"
if not os.path.exists(so_dir):
os.makedirs(so_dir)
so_dir_final = os.path.join(self.build_lib,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so_dir_final should be removed and then the os.makedirs line should just be os.makedirs(os.path.dirname(so_path_final)). I think this is clearer

##################
# Helper functions
##################
def _get_torch_cuda_version() -> float:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Handle via cmake? Currently we rely on torch but we don't need to.

"torch >= 2.7",
"cmake >= 4.0",
"torch >= 2.8",
"cmake >= 3.2",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to review our versioning here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants