What would you like to report?
Summary
The Triton-accelerated UMAS_FAST_GPU execution backend is currently hard-gated on:
Models trained with mmax == 1 cannot use these kernels and silently fall back to the general backend.
This matters because mmax == 1 is a deliberate modeling choice in some smaller/faster SO(2)-based models, including eSEN-style configurations. These models therefore miss the documented ~30–40% speedup from the fast GPU backend.
Location
In:
fairchem/core/models/uma/nn/execution_backends.py
inside UMASFastGPUBackend.validate:
if lmax != 2 or mmax != 2:
raise ValueError("umas_fast_gpu requires lmax==2 and mmax==2")
Separately, update_inference_settings_for_fast_gpu appears to catch this ValueError and silently leaves the model on the slower general backend.
Why this matters
We are running long molecular dynamics trajectories with a custom-trained escn_md model using:
Because the system composition varies during the simulation, merge_mole=True / the "turbo" preset is not usable for this workload.
As a result, we lose both:
- the MoLE/turbo path, and
- the Triton fast GPU kernels.
On a single NVIDIA GH200, this leaves us at roughly:
~3 ns/day for a 216-atom system
For campaigns measured in hundreds of nanoseconds, this is the difference between feasible and borderline.
Minimal reproduction
import warp as wp
if not hasattr(wp, "vec"):
wp.vec = wp.types.vector # workaround for wp.vec removal in recent warp
from fairchem.core.calculate.ase_calculator import FAIRChemCalculator
from fairchem.core.units.mlip_unit.api.inference import InferenceSettings
settings = InferenceSettings(
tf32=True,
activation_checkpointing=False,
merge_mole=False, # required: composition varies
compile=True,
internal_graph_gen_version=3,
execution_mode="umas_fast_gpu", # raises
)
calc = FAIRChemCalculator.from_model_checkpoint(
"<path/to/escn_md_mmax1_checkpoint.pt>",
task_name="<your_task>",
device="cuda",
inference_settings=settings,
)
This raises:
ValueError: umas_fast_gpu requires lmax==2 and mmax==2
To reproduce, use any escn_md checkpoint trained with mmax=1.
The relevant backbone configuration is logged in:
under:
backbone:
lmax: 2
mmax: 1
Request
Would it be possible to support mmax == 1 in umas_fast_gpu?
There seem to be two possible paths:
Option 1: Relax the validation gate
If the existing Triton kernels already work for mmax=1 with smaller block strides, allow:
in UMASFastGPUBackend.validate.
The relevant kernels appear to include:
node_to_edge_wigner_permute
permute_wigner_inv_edge_to_node
edge_degree_scatter
For l <= 2, the mmax=1 case should have fewer m-channels per node than the mmax=2 case.
Option 2: Add an mmax=1 fast path
If the current Triton kernels assume mmax=2 shapes, a sibling kernel path for mmax=1 would be useful.
The eSEN line of models commonly uses this style of configuration, so this would likely benefit more than just this single custom model.
Expected behavior
A model with:
should either:
- use the
umas_fast_gpu backend when requested, or
- fail with a clear explanation that
mmax=1 is not supported and why.
Actual behavior
The model cannot use umas_fast_gpu because of the hard gate:
umas_fast_gpu requires lmax==2 and mmax==2
In some paths, this failure is caught and the model silently remains on the slower general backend.
Environment
fairchem-core: 2.19.0
torch: 2.8.0+cu129
GPU: NVIDIA GH200 120GB
CUDA runtime: 13.1
NVIDIA driver: 590.48.01
Architecture: linux aarch64
Model: custom escn_md backbone, lmax=2, mmax=1, no MoLE
Additional context
A minimal answer such as “not planned, because the current kernels fundamentally assume mmax=2” would still be helpful. It would let us decide whether to invest in a downstream PyTorch-only fast path or retrain the model with mmax=2.
What would you like to report?
Summary
The Triton-accelerated
UMAS_FAST_GPUexecution backend is currently hard-gated on:Models trained with
mmax == 1cannot use these kernels and silently fall back to the general backend.This matters because
mmax == 1is a deliberate modeling choice in some smaller/faster SO(2)-based models, including eSEN-style configurations. These models therefore miss the documented ~30–40% speedup from the fast GPU backend.Location
In:
inside
UMASFastGPUBackend.validate:Separately,
update_inference_settings_for_fast_gpuappears to catch thisValueErrorand silently leaves the model on the slower general backend.Why this matters
We are running long molecular dynamics trajectories with a custom-trained
escn_mdmodel using:Because the system composition varies during the simulation,
merge_mole=True/ the"turbo"preset is not usable for this workload.As a result, we lose both:
On a single NVIDIA GH200, this leaves us at roughly:
For campaigns measured in hundreds of nanoseconds, this is the difference between feasible and borderline.
Minimal reproduction
This raises:
To reproduce, use any
escn_mdcheckpoint trained withmmax=1.The relevant backbone configuration is logged in:
under:
Request
Would it be possible to support
mmax == 1inumas_fast_gpu?There seem to be two possible paths:
Option 1: Relax the validation gate
If the existing Triton kernels already work for
mmax=1with smaller block strides, allow:in
UMASFastGPUBackend.validate.The relevant kernels appear to include:
For
l <= 2, themmax=1case should have fewer m-channels per node than themmax=2case.Option 2: Add an
mmax=1fast pathIf the current Triton kernels assume
mmax=2shapes, a sibling kernel path formmax=1would be useful.The eSEN line of models commonly uses this style of configuration, so this would likely benefit more than just this single custom model.
Expected behavior
A model with:
should either:
umas_fast_gpubackend when requested, ormmax=1is not supported and why.Actual behavior
The model cannot use
umas_fast_gpubecause of the hard gate:In some paths, this failure is caught and the model silently remains on the slower general backend.
Environment
Additional context
A minimal answer such as “not planned, because the current kernels fundamentally assume
mmax=2” would still be helpful. It would let us decide whether to invest in a downstream PyTorch-only fast path or retrain the model withmmax=2.