Performance Characterisation of Anemoi Training on Isambard-AI

Author: Tomas Lazauskas
Affiliation: The Alan Turing Institute
Date: 30 April 2026
Document type: Technical Report (version 1.0)

Abstract

This report characterises the training performance of the Anemoi weather model on Isambard-AI GH200 (Grace Hopper) nodes, working from a single GPU up to 100 nodes (400 GPUs). At single-GPU scale, the O96 workload is found to be memory-bandwidth bound: CUTLASS GEMM kernels reach 88–96% of peak HBM3e bandwidth but only 30–36% of peak compute throughput, placing them deep in the memory-bound region of the roofline. Software optimisations (torch.compile, FP8, batch size tuning) do not improve throughput because the bottleneck is the arithmetic intensity of the problem size, not software overhead. At multi-node scale, AllReduce gradient synchronisation remains fully pipelined within the backward pass at all tested node counts (up to 100 nodes, 400 GPUs), contributing no measurable critical-path overhead; efficiency degrades gradually from ~95% at 10 nodes to ~85% at 100 nodes, driven primarily by growth in forward-pass overhead.

Performance Characterisation of Anemoi Training on Isambard-AI

Introduction

Anemoi is an open-source framework developed by ECMWF (The European Centre for Medium-Range Weather Forecasts) for training data-driven numerical weather prediction models [1], [2]. Its flagship models are graph-based neural networks that operate over irregular geographic meshes, combining a Graph Transformer encoder-processor-decoder architecture with domain-specific spherical harmonics kernels. Training these models at production resolution is computationally intensive: a single training step on the O96 dataset [3] — an octahedral reduced Gaussian grid with approximately 1° (≈111 km) horizontal resolution and ~40,320 grid points — requires ~187 TFLOPs of computation and generates ~95 GB of theoretical activation memory, necessitating both high-memory accelerators and efficient distributed training across many nodes. The N320 dataset (a higher-resolution octahedral grid, approximately 0.25°) is used for initial scaling comparisons alongside O96; both datasets reach the same wall-clock minimum at 100 nodes with the same setup-overhead growth pattern, though N320's heavier per-step compute delays the crossover point. All detailed profiling focuses on O96, as the bottleneck characterisation is expected to carry over to N320.

Isambard-AI [4] is a UK national AI research supercomputer hosted at the University of Bristol, based on NVIDIA GH200 Grace Hopper Superchips [5], [16]. Each node provides 4 GH200 GPUs with 96 GB HBM3e each, connected intra-node via NVLink, and inter-node via the HPE Slingshot 11 high-speed interconnect [17]. Isambard-AI is one of the first large-scale GH200 deployments available for open research, and its performance characteristics for distributed deep learning workloads — particularly for memory-bandwidth-bound models like Anemoi — are not yet well characterised.

This report documents a systematic investigation of Anemoi training performance on Isambard-AI, starting from a single GPU and scaling up to 100 nodes (400 GPUs) for detailed profiling. The scope is limited to computational performance characterisation — throughput, step time, scaling efficiency, and hardware utilisation. Model quality and training convergence are not assessed. The work is structured around three questions:

What is the single-GPU performance ceiling on GH200, and what are the bottlenecks?
How efficiently does Anemoi scale across 4 GPUs within a single node (NVLink)?
How does multi-node scaling behave over Slingshot, and where does communication become the bottleneck?

The report is organised as follows. An Executive Summary immediately follows this introduction with the key findings and recommendations across all tiers. The Initial Scaling Tests section presents epoch-level strong scaling results for both O96 and N320 datasets, establishing the wall-clock optimum and identifying setup overhead as a growing cost at large node counts. The NCCL Benchmarking section establishes that the physical interconnect is not the source of the observed overhead, motivating the software-focused investigation that follows. The Single GPU section characterises the hardware utilisation and software bottleneck profile of a single GH200, working through a sequence of optimisation actions culminating in a clean hardware-bound baseline. The Single Node Multi-GPU Scaling section investigates intra-node DDP overhead and its node-to-node variability. The Multi-Node Scaling section quantifies per-step scaling efficiency from 2 to 100 nodes, characterises NCCL communication behaviour, and measures startup overhead growth.

Executive Summary

Anemoi training on Isambard-AI GH200 nodes was characterised across three tiers: single GPU, single node (4-GPU NVLink), and multi-node (Slingshot interconnect). The findings at each tier feed directly into the next, and together identify a clear set of bottlenecks and the configurations under which Anemoi scales well.

Single GPU

The O96 model on a single GH200 achieves ~0.97 s/step (7.93 samples/s) in eager mode. Profiling establishes that the workload is memory-bandwidth bound: GPU utilisation is 92.8%, but Tensor Core utilisation is only ~1.1% and Model FLOP Utilisation is ~20% of the GH200 dense BF16 peak. Direct hardware measurement with ncu confirms this: CUTLASS GEMM kernels reach 88–96% of peak HBM3e bandwidth but only 30–36% of peak compute throughput, placing them deep in the memory-bound region of the roofline. The GPU is continuously busy, but the dominant kernels do not have sufficient arithmetic intensity to exploit Tensor Cores.

The main software bottleneck identified was CPU dispatch overhead: ~3,130 kernel launches per step with frequent cudaStreamSynchronize blocking calls. torch.compile reduced kernel launches by 31% via Triton operator fusion and eliminated all cudaStreamSynchronize stalls, but did not produce a measurable throughput improvement — the workload is memory-bandwidth bound and kernel fusion alone cannot change that. The hardware ceiling is HBM3e memory bandwidth, which is a characteristic of the model's arithmetic intensity and cannot be removed without architectural changes.

Activation checkpointing (num_chunks: 2) is required to fit within 96 GB HBM3e (34.1 GB peak vs 95.1 GB theoretical). Disabling it does not change step time, confirming the bottleneck is not recompute overhead.

Single Node (4 GPUs, NVLink)

On a correctly configured node, 4-GPU scaling efficiency is 95.7% (44 ms overhead, 987 ms → 1,031 ms/step). The NVLink All-Reduce is fully overlapped with the backward pass and is not on the critical path.

Early runs showed 76.5% efficiency due to CUDA_LAUNCH_BLOCKING=1 present in the job environment, which forces every kernel launch to block until completion. With ~3,130 launches per step this produced up to 247 ms of overhead per step. Once identified and unset, efficiency recovered to 95.7%.

Multi-Node Scaling (Slingshot interconnect)

Multi-node scaling was characterised from 2 to 100 nodes (8 to 400 GPUs) on O96. The headline results:

Figure 0.1. Scaling efficiency at each node count. Green bars (≥ 93%) indicate efficient scaling; the drop at 50–100 nodes coincides with the NCCL RING_LL → TREE_LL algorithm switch and growth in forward-pass overhead.

Efficiency is excellent up to 10 nodes (~94–95%) and degrades gradually to ~85% at 50–100 nodes. The primary mechanism is growth in forward-pass overhead — the DDP _pre_forward buffer broadcast (ncclDevKernel_Broadcast_RING_LL) growing from 23.6 ms/step at 10 nodes to 101.6 ms/step at 100 nodes, plus an unexplained 64 ms spike at 50 nodes. AllReduce backward wall time remains stable (709–765 ms across all node counts) despite total AllReduce kernel time reaching 621 ms/step at 50 nodes, indicating AllReduce continues to pipeline within the backward pass. The NCCL algorithm switch from RING_LL to TREE_LL at 50 nodes raises AllReduce kernel time but does not measurably extend the backward wall time.

Wall-clock optimum for O96 is 100 nodes (82 s/epoch); for N320 also ~100 nodes (669 s/epoch). Scaling beyond 100 nodes offers no wall-clock benefit and degrades cost efficiency sharply.

Startup overhead becomes a significant fraction of total job time at scale — 52 s at 50 nodes, 79 s at 100 nodes — driven by the DDP weight broadcast (36.8 s at 100 nodes) and NCCL first-batch warmup (16.9 s at 10 nodes). These are one-time per-job costs that amortise quickly over a full training run.

Where to Look for Performance Improvements

For readers focused on improving training throughput or reducing job turnaround time:

Single-GPU throughput — the dominant kernel classes (GEMMs, element-wise operations) are hardware-bound at the HBM3e memory-bandwidth ceiling; no software change can address this without increasing arithmetic intensity. The one actionable cost centre is sparse routing (indexSelectLargeIndex + indexFuncLargeIndex, ~13% of runtime), which is latency/cache-bound due to irregular sparse access and could be reduced by pre-computing graph indices. nvjet_hsh (~36% of runtime) is already near the ridge point and is not a target. Details are in Optimisation Actions.
Single-node efficiency — 96.1% at 4 GPUs relative to 1 GPU; there is limited scope for further improvement. The residual forward-pass overhead is characterised in Action 8: Root Cause — CUDA_LAUNCH_BLOCKING.
Multi-node step time — at 50+ nodes, forward-pass overhead grows substantially (DDP Broadcast + unexplained overhead) and is the primary driver of efficiency loss. Potential levers are discussed in Baseline Multi-Node Training Runs under Performance improvement opportunities.
Multi-node startup time — at 100 nodes startup overhead accounts for ~79 s, dominated by the DDP weight broadcast and NCCL warmup. Analysis is in Startup Overhead Analysis.

Initial Scaling Tests

`O96` Strong Scaling

Initial strong scaling experiments were run for the O96 dataset, training for 2 epochs across node counts of 1, 10, 50, 100, 200, and 500. For each run, two metrics were recorded: Slurm Total Time (wall-clock duration from job start to finish, measuring how fast the training completes) and Total Node Hours (the product of node count and wall-clock time, measuring total compute consumed — a proxy for cost). Both are plotted below on a log-log scale.

Figure 1. O96 Strong Scaling Performance.

Wall-clock time falls from 4,239 s (1 node) to 244 s (100 nodes), then reverses: 420 s at 200 nodes, 1,170 s at 500 nodes.
Total node hours increase monotonically throughout (1.18 h → 162.5 h), so beyond 100 nodes both time and cost worsen — further scaling is counterproductive for O96.

In addition to the strong scaling analysis, the total job time is decomposed into two components: training time (the time spent executing forward and backward passes) and setup time (the overhead before training begins, covering model initialisation, dataset loading, and distributed environment setup). Note that training + setup does not exactly equal the Slurm Total Time shown in Figure 1 — the small residual (~30 s) reflects Slurm scheduling and node allocation overhead not captured by either timer. The following plot illustrates this breakdown:

Figure 2. O96 Training Time Analysis.

Training time drops from 4,189 s (1 node) to 82 s (100 nodes), while setup time grows from 23 s (1 node) to 1,000 s (500 nodes).
Beyond 100 nodes the crossover makes scaling counterproductive: at 200 nodes setup time (275 s) is already more than double the training time (117 s), and at 500 nodes nearly eight times longer (1,000 s vs 129 s).

`N320` Strong Scaling

The O96 results identified 100 nodes as the wall-clock minimum and setup overhead as the dominant cost beyond it. The N320 dataset — a significantly higher-resolution workload — tests whether heavier per-step compute shifts this picture. Greater computational intensity per GPU means more useful work per synchronisation step, which should extend the range over which scaling remains efficient.

The model was trained for 2 epochs across node counts of 1, 2, 8, 10, 25, 50, 100, and 200 nodes. Testing beyond 200 nodes was not performed given resource constraints and the trends already established with O96.

Figure 3. N320 Strong Scaling Performance.

Wall-clock time falls steadily from 33,444 s (1 node) to 669 s (100 nodes) — a wider effective scaling range than O96. N320's ~5× larger grid (~204,800 vs ~40,320 grid points) produces larger GEMM dimensions and higher arithmetic intensity per step, making communication a smaller fraction of total step time and sustaining efficient scaling further. Cost also grows more slowly: total node hours remain relatively stable up to 25 nodes (9.29 h → 13.49 h), unlike O96 where cost rose steeply from the outset.
At 200 nodes the wall-clock gain is negligible (669 s → 642 s) while total node hours nearly doubles (18.58 h → 35.67 h), confirming 100 nodes as the potential wall-clock minimum for N320 as well.

The total job time is again decomposed into training time and setup time to understand the plateau at 200 nodes.

Figure 4. N320 Training Time Analysis.

Training time falls smoothly from 33,384 s (1 node) to 312 s (200 nodes). Setup time rises from 32 s to 289 s — the same growth pattern seen in O96, but the heavier workload keeps training dominant for longer.
At 200 nodes training (312 s) and setup (289 s) are nearly equal, each accounting for ~50% of total job time. This explains the plateau: as the GPUs compute faster with more nodes, the growing initialisation cost offsets the gain, preventing any further reduction in wall-clock time.

NCCL Benchmarking

Before undertaking the detailed per-tier investigation — from single GPU through single node to multi-node — a hardware sanity check was performed to rule out the physical network as the source of the scaling overhead observed in the initial tests.

NCCL (NVIDIA Collective Communications Library) [6] is the communication backend used by PyTorch for gradient synchronisation in distributed training. It implements collective operations such as All-Reduce — the operation that averages gradients across all GPUs at the end of each backward pass — and is optimised for NVIDIA interconnects including NVLink (intra-node) and high-speed fabrics such as Slingshot (inter-node). The NCCL All-Reduce benchmark measures the raw bandwidth of this operation using synthetic data, isolating the interconnect from any framework or training overhead. This provides a hardware speed limit against which software-level bottlenecks can be judged.

NCCL All-Reduce benchmarks were carried out on Isambard-AI across 1, 10, 50, and 200 nodes.

Figure 5. Peak bus bandwidth of NCCL All-Reduce as a function of node count. NVLink (1 node) provides 342.5 GB/s against a theoretical peak of 450 GB/s (NVLink 4.0 unidirectional, 76% utilisation). Slingshot bandwidth is stable at 91–93 GB/s from 10 to 50 nodes against a theoretical ceiling of 100 GB/s (4 NICs × 25 GB/s per node, 91–93% utilisation), reducing to 70.8 GB/s (71% of theoretical) at 200 nodes.

Bandwidth is stable between 10 and 50 nodes (92.7 → 91.2 GB/s), confirming that the scaling degradation seen in the initial tests is not caused by network bandwidth. The gradient tensor size is fixed by model parameters and does not grow with node count, so the volume of data to synchronise is also not the primary cause. What does grow with node count is the number of participating ranks, which increases All-Reduce latency and can affect NCCL algorithm selection and collective coordination overhead. At 200 nodes bandwidth reduces to 70.8 GB/s, suggesting network bandwidth may become a contributing factor at very large node counts — though this range was not profiled in detail. The following sections investigate the source of overhead tier by tier — beginning with single-GPU performance characterisation, then single-node multi-GPU communication overhead, and finally multi-node scaling behaviour.

Single GPU

Five profiling tools were used in sequence to characterise performance, each answering a different question:

Tool	What it measures	Key question answered
Anemoi simple profiler	Step time, throughput, forward/backward/optimizer breakdown	What is the baseline throughput and performance characteristics?
Anemoi detailed profiler	Model characteristics: parameter count, TMACs, theoretical activation memory, peak measured memory	What are the model's compute and memory demands?
PyTorch Profiler / TensorBoard	Operator host time, GPU utilisation, Tensor Core utilisation, kernel occupancy	Which operations are slow, and what do indirect hardware metrics indicate?
`nsys` (Nsight Systems) [14]	CPU–GPU timeline, CUDA API time, kernel launch counts, kernel time by type	Is the GPU busy, and what does the kernel structure look like?
`ncu` (Nsight Compute) [13]	Per-kernel memory and compute throughput as % of hardware peak (Speed-of-Light)	Are kernels actually memory-bound or compute-bound at the hardware level?

Together they form a funnel — from throughput at the top down to direct hardware measurement. The first three tools establish the baseline and evaluate optimisation actions; nsys is used alongside torch.compile to track structural CPU–GPU changes, and then with ncu provides hardware-level roofline analysis confirming the workload is memory-bandwidth bound.

Baseline Characterisation

A baseline profiling run on a single NVIDIA GH200 GPU for 40 training steps on the O96 dataset indicates that the workload is memory-bandwidth bound: GPU utilisation is 92.81% but Tensor Core utilisation is only ~1.1%, achieved occupancy is 41.92%, and Model FLOP Utilisation is ~20% of the GH200 dense BF16 peak — the GPU is continuously busy, but on memory-bound work rather than the dense matrix operations that Tensor Cores accelerate. The detailed profiler adds ~10% step-time overhead versus simple (concentrated in CPU-side instrumentation); simple profiling is used for all throughput comparisons throughout this report.

The detailed profiler reports the following model characteristics:

Metric	Value	Note
Model Size	231 M params (462 MB)	Small by parameter count
Compute Load	23.42 TMACs / 46.84 TFLOPs per forward pass	High compute density relative to model size
Theoretical Activation Memory	95.1 GB	Estimated peak activation volume (pre-checkpointing); exceeds usable HBM3e, motivating `num_chunks` checkpointing
Measured Peak Memory	34.1 GB (with `num_chunks: 2`)	61 GB with `num_chunks: 1` (still checkpointed, but all chunks recomputed together)
Architecture	Graph Transformer	Encoder-Processor-Decoder
Scale	322k input / 87k latent nodes	Large input graph drives high activation volume

Despite having only 462 MB of weights, the graph-based architecture generates disproportionately large activations (~205 bytes of theoretical activation per byte of model parameters). Activation checkpointing (num_chunks: 2) is required to fit within 96 GB HBM3e. Varying num_chunks controls the memory–compute trade-off: num_chunks: 1 raises peak to 61 GB; num_chunks: 16 lowers it to 33 GB. Crucially, step time is insensitive to this setting — the bottleneck is not activation memory.

Model FLOP Utilisation (MFU). With num_chunks: 2, activation checkpointing adds one extra forward recomputation, making the total per-step cost equivalent to 4 forward passes:

4 × 23.42 TMACs × 2 FLOPs/MAC = 187.4 TFLOPs per step

At an avg batch time of 0.97 s (simple profile), this yields ~193 TFLOP/s — approximately 20% of the GH200’s 989 TFLOP/s dense BF16 peak. A ~20% MFU is consistent with a memory-bandwidth-bound workload.

Optimisation Actions

The baseline identified three concrete observations: (1) ~60 GB of unused VRAM, (2) heavy element-wise kernel fragmentation with CPU–GPU synchronisation stalls, and (3) only ~1.1% Tensor Core utilisation. Four software actions target these observations independently, plus a targeted test of fused AdamW (they are not stacked). nsys is used alongside Action 3 to verify what torch.compile changed structurally; ncu roofline profiling is run last on the compiled baseline to narrow the hardware-level investigation to the most relevant configuration.

Action	Change	Hypothesis
1 — Batch Size	8 → 16	More data per step saturates memory bandwidth and improves GPU utilisation
2 — DataLoader Workers	8 → 16/32	More prefetch workers eliminate any residual data starvation
3 — torch.compile [15]	Eager → compiled	Kernel fusion via Triton reduces element-wise fragmentation and CPU dispatch overhead
4 — FP8 Precision	BF16 → FP8	Halving weight precision reduces data movement, potentially closing the memory-bandwidth gap

Action 1 — Batch Size 16: ❌ No throughput gain (−1.8%, simple profiler). Step time doubled with 2× data; peak memory doubled to ~72% of HBM3e. The bottleneck is not data supply.
Action 2 — DataLoader Workers (16/32): ❌ No effect (<3% spread across 8, 16, 32 workers, within noise). Data loading is not the bottleneck.
Action 3 — torch.compile: ❌ No throughput benefit (avg batch time +7.5% over 200 steps, including recompilation overhead). Operator fusion reduced kernel launches by 31% and peak memory by 10% (34.2 → 30.7 GB). Tensor Core utilisation remained ~1.1% (baseline) / ~1.2% (compiled, different profiler run) — the memory-bandwidth bound character of the workload is unchanged by fusion.
Action 4 — FP8 Precision: ❌ No meaningful improvement in avg batch time (+0.8% over 200 steps). End-to-end throughput regresses (~20%) due to AMAX scaling overhead adding CPU contention. FP8 offers no advantage when the bottleneck is HBM3e bandwidth, not arithmetic throughput. BF16 is recommended.

Detailed data tables for each action are in Supplementary Material: Single GPU Profiling Detail.

`nsys` Deep-Dive

NVIDIA Nsight Systems (nsys) [14] is a system-level profiler that records a timeline of CPU and GPU activity — API calls, kernel launches, and memory transfers — allowing CPU–GPU interaction patterns to be inspected directly. nsys profiling at three stages of optimisation (baseline eager, compiled, compiled with further changes) tracks how this interaction changes and confirms that removing software inefficiencies does not shift the hardware ceiling.

At baseline, 625,957 CUDA kernel launches (~3,130/step) generated heavy CPU–GPU synchronisation: cudaStreamSynchronize — a blocking call where the CPU waits for the GPU to finish queued work — accounted for 91.0% of CUDA API time (152.9 s total, 20,982 calls over 200 steps). Despite this, GPU utilisation remained 92.81%, indicating the GPU had sufficient work queued to stay busy between sync points. After torch.compile, cudaStreamSynchronize dropped to 0.1% of CUDA API time (0.13 s, 21,011 calls) — stalls were effectively eliminated, confirmed directly by the compiled nsys profile. Kernel launches fell by 31% to ~429,000, and Triton kernels appeared in the compiled profile, confirming operator fusion. As a side effect, device-to-device memory movement increased ~2.7× (398 GB → 1,087 GB), reflecting Triton workspace buffers. Despite these structural changes, throughput did not improve.

With CPU-side stalls eliminated, the remaining GPU kernel time for 200 steps breaks down as:

Figure 6. GPU kernel time breakdown by type (200 steps, compiled BF16, rank 0). nvjet_hsh dominates at ~36%; FlashAttention contributes ~19%; sparse routing (indexSelectLargeIndex + indexFuncLargeIndex) accounts for ~13%.

Sparse routing (indexSelectLargeIndex, 13%) warrants further investigation: edge indices appear to be re-expanded and re-sorted every forward pass despite being deterministic under fixed batch size and sharding, suggesting potential for caching. flash_fwd_kernel is called 2× more often than flash_bwd_kernel, confirming activation checkpointing is active. Fused AdamW showed no improvement (+0.2% avg batch time) — the optimizer update is not a meaningful cost centre.

Conclusion: torch.compile eliminated all cudaStreamSynchronize stalls and reduced kernel launches by 31%. However, since the GPU was already memory-bandwidth bound at baseline, removing the CPU-side stalls did not improve throughput. The hardware ceiling is HBM3e memory bandwidth. Compiled BF16 is used as the starting point for multi-node scaling experiments.

`ncu` Hardware Measurement

nsys shows when the GPU is busy; ncu (Nsight Compute) [13] measures how efficiently each kernel uses the hardware. By replaying each CUDA kernel with hardware performance counters, ncu reports Speed-of-Light (SOL) metrics — memory bandwidth and compute throughput as a percentage of theoretical peak. GH200’s ridge point is ~247 FLOP/Byte (989 TFLOP/s peak dense BF16 ÷ 4.0 TB/s peak HBM3e bandwidth [5], [16]); kernels below this arithmetic intensity are memory-bound regardless of GPU utilisation [10]. ncu was run on the baseline (eager BF16) configuration using --set roofline, capturing 500 kernels after skipping one warmup step (~3,130 kernel launches), covering all distinct kernel types.

The per-kernel SOL metrics reveal three distinct performance regimes:

Figure 7. Roofline scatter plot of Memory SOL (x-axis) vs Compute SOL (y-axis) for the dominant kernel types. Points are the mean SOL across 500 captured kernels; error bars show the observed min–max range. Kernels in the upper-right are near the ridge point; kernels shifted left are memory-bound. The GH200 ridge point (~247 FLOP/Byte, dense BF16) separates the memory-bound and compute-bound regions.

See ncu Speed-of-Light Values per Kernel in Supplementary Material for the numerical breakdown.

GEMM kernels are memory-bound. Linear projections — which should saturate Tensor Cores on large matrices — are instead bottlenecked by HBM3e bandwidth. This is the direct hardware confirmation of low Tensor Core utilisation observed via TensorBoard. O96's matrix dimensions are determined by the number of grid points (~40,320) and the batch size; the resulting arithmetic intensity falls well below GH200's dense BF16 ridge point of ~247 FLOP/Byte, placing every GEMM in the memory-bound region of the roofline.

nvjet_hsh is near the ridge point. Both memory and compute SOL are high simultaneously, meaning these cuDNN kernels (graph message-passing) are well-optimised and are not the limiting bottleneck. FlashAttention [11] (flash_fwd_kernel, flash_bwd_*) accounts for ~19% of GPU kernel time (nsys) but was not captured within the 500-kernel ncu window; based on its tiled SRAM design — avoiding repeated HBM reads for keys and values — it is expected to be near the ridge point, consistent with the nvjet_hsh measurements.

Sparse routing is latency-bound. indexFuncLargeIndex shows low SOL on both axes — it is bottlenecked by irregular memory access patterns from Anemoi's geographic mesh connectivity, not by bandwidth or compute capacity.

Conclusion: Direct hardware measurement confirms that the dominant kernel classes (GEMMs and element-wise operations) are operating deep in the memory-bound region of the roofline, saturating HBM3e bandwidth while leaving Tensor Core capacity largely idle. The ~1.1% Tensor Core utilisation figure from TensorBoard reflects the substantial fraction of GPU time spent in element-wise and norm kernels with near-zero Tensor Core usage (31% in the compiled profile; the eager baseline has a similar split). Software optimisation cannot resolve this — the arithmetic intensity of the O96 problem size is the fundamental constraint.

Summary

Different step-time figures appear across sections because they use different tools and scopes:

Step time	Source	Steps	What it includes
~0.77 s	`nsys` GPU kernel time	200	CUDA kernel execution only (total GPU kernel time ÷ 200 steps; excludes CPU overhead and inter-step gaps)
0.97 s	Anemoi simple profiler (`run_training_batch`)	40	Forward + backward + optimizer; excludes inter-step overhead
0.98 s	Anemoi simple profiler	200	Same scope; slight run-to-run variance
~0.96 s	Anemoi simple profiler	200	Consistent across nodes; used as the single-node reference
0.954–0.987 s	Anemoi simple profiler (NVTX runs)	200	Node-specific; used in single-node DDP experiments

All throughput and scaling comparisons use the simple profiler (run_training_batch) unless explicitly stated otherwise. Full per-action timing and memory figures are in Supplementary Material: Single GPU Profiling Detail.

The single-GPU investigation establishes that the dominant kernel classes are hardware-bound at the HBM3e memory-bandwidth ceiling. The eager BF16, batch size 8 configuration is carried forward as the 1-GPU reference baseline for all multi-GPU experiments — compiled BF16 is reserved for direct comparison within those experiments.

Single Node Multi-GPU Scaling

Each Isambard-AI node hosts 4 GH200 GPUs connected via NVLink. Moving from 1 to 4 GPUs introduces the first layer of distributed communication: intra-node NCCL All-Reduce over NVLink, which synchronises gradients across GPUs at the end of each backward pass.

Intra-node scaling result. On a correctly configured node, 4-GPU scaling efficiency is 95.7% — approximately 1,031 ms/step at 4 GPUs vs 987 ms/step at 1 GPU, a 44 ms (4.3%) overhead. This is within the expected range for a graph model communicating over NVLink.

Background. Early single node/4-GPU runs showed 76.5% efficiency (step times ranging from ~1,185 ms to ~1,234 ms across different nodes and profiling configurations). CUDA_LAUNCH_BLOCKING=1 was present in the SLURM job environment — carried over from a prior debugging session — but was not recognised as the cause, triggering a seven-action investigation before the root cause was found. The key lesson: verify the job environment before beginning any performance investigation. A misconfigured environment variable invalidated the initial baseline and drove a substantial profiling campaign that could have been avoided.

CUDA_LAUNCH_BLOCKING=1 forces every CUDA kernel launch to be synchronous, turning ~11 µs async dispatches into blocking waits. With ~625,000 kernel launches over 200 steps (~3,130 per step), the cumulative cost is ~220 ms. PyTorch DDP [12] amplifies the effect further through additional cudaStreamSynchronize calls for NCCL bucket coordination.

Despite being triggered by a misconfiguration, the investigation is retained in this report rather than removed. It covers NCCL overlap profiling, forward/backward isolation, DDP configuration, I/O and thermal ruling-out, and kernel dispatch analysis — the natural sequence of checks for any intra-node scaling regression — and serves as a practical diagnostic reference for future work.

Investigation Summary

The table below summarises each investigative action, the hypothesis tested, and the outcome. Full data tables for each action are in Supplementary Material: Single Node Profiling Detail.

Action	Hypothesis	Outcome
1	Establish baseline	76.5% efficiency observed; later identified as `CUDA_LAUNCH_BLOCKING=1` artefact
2	NCCL `All-Reduce` not overlapping with backward	Ruled out — fully overlapped, 22–45 ms/step (2.5% of backward window)
3	Forward overhead is a profiler artefact; `torch.compile` addresses it	Negative — proportional overhead on both phases (+29% fwd, +25% bwd); compile gives only 2.9% step benefit
4	DDP bucket size or gradient layout causing overhead	Ruled out — both alternatives marginally worse than default
5	Dataloader I/O contention starving the GPU	Ruled out — 9.8× dataloader headroom at 4 GPUs
6	Node heterogeneity or thermal throttling	Both ruled out — same-node test and dummy-load test
7	Multi-process resource contention (non-DDP)	Ruled out — 4× independent training processes matched 1-GPU baseline
8	Fine-grained NVTX + kernel dispatch analysis	Root cause found — `CUDA_LAUNCH_BLOCKING=1` causing 215 µs dispatch latency (vs 11 µs normal)

Action 1: Initial 4-GPU Baseline

Observed 76.5% scaling efficiency (1.22 s/step vs 0.97 s/step at 1 GPU; 8.23 → 6.30 samples/s per GPU). At this point CUDA_LAUNCH_BLOCKING=1 was present in the environment and undetected. The apparent 26% step overhead triggered the investigation.

Action 2: NCCL Communication Overlap

NCCL All-Reduce is fully overlapped with the backward pass: 22–45 ms/step (2.5% of the 882 ms backward window) across 31 buckets. Implied NVLink bandwidth is ~31 GB/s — 9% of the 342.5 GB/s NVLink peak. Load across all four ranks is balanced to <1 ms spread on the backward phase. NCCL is not the bottleneck.

Action 3: Isolating the Overhead

An apples-to-apples comparison (both runs: simple profiler, no NVTX, no compile, 200 steps) showed the forward pass is 29% slower at 4 GPUs — DDP does no communication during the forward, so this cannot be a DDP artefact. Overhead was near-proportional across both phases (+29% forward, +25% backward), suggesting a node-level effect rather than DDP-intrinsic overhead. torch.compile gave only a 2.9% net step improvement at 4 GPUs.

Action 4: DDP Configuration

Larger gradient buckets (bucket_cap_mb=100) and gradient_as_bucket_view=True both made performance worse (+1.7% and +1.2% step time respectively). The latter also collapsed dataloader throughput by 85% due to contention with the pinned-memory transfer pipeline. DDP configuration is not the cause.

Action 5: Data Loading

Per-process dataloader throughput drops 38× under 4-GPU I/O contention, but retains 9.8× headroom over training consumption. The GPU never stalls waiting for data. Data loading is not the bottleneck.

Action 6: Node Heterogeneity and Thermal Throttling

A same-node 1-GPU vs 4-GPU test confirmed the overhead is real and not a node-comparison artefact (965 ms vs 1,185 ms on the same node). A throttle test — 1-GPU training alongside 3 compute-saturating dummy GPU loads — showed <0.5% step-time difference. Thermal and power-cap throttling are both ruled out.

Action 7: Multi-Process vs Multi-Rank

Four independent 1-GPU training processes running simultaneously (no DDP) produced 970 ms/step — identical to the single-GPU baseline. The ~220 ms overhead is therefore specific to the multi-rank DDP configuration, not generic multi-process load.

Action 8: Root Cause — CUDA_LAUNCH_BLOCKING

NVTX phase breakdowns across two nodes revealed dramatic variability:

Phase (NVTX avg)	1-GPU (nid010659)	4-GPU (nid010706)	4-GPU (nid010881)
Forward	266 ms	285 ms	350 ms
Backward	714 ms	737 ms	883 ms
Optimizer	6.6 ms	9.7 ms	1.5 ms
Step	987 ms	1,031 ms	1,234 ms
Overhead vs 1-GPU	—	+44 ms (+4.4%)	+247 ms (+25%)

cudaLaunchKernel dispatch latency identifies the root cause:

Profile	Avg `cudaLaunchKernel` latency	Total kernel launches
1-GPU baseline (nid010659)	11.8 µs	625,920
4-GPU best (nid010706)	10.6 µs	625,691
4-GPU worst (nid010881)	215.3 µs	625,691

Kernel launch counts are identical across configurations — multi-rank training introduces no extra launches. On nid010881, the 20× increase in dispatch latency (11 µs → 215 µs) is consistent with CUDA_LAUNCH_BLOCKING=1 in the job environment, which forces kernel launches to block until completion. With ~3,130 launches per step the cumulative cost is ~220 ms. NCCL's higher CPU wake frequency amplifies this into disproportionate overhead. With a clean job environment (nid010706), the remaining 44 ms overhead includes GPU stream fragmentation (~23 ms backward overhead) and a forward-pass buffer broadcast stall (~19 ms forward overhead), with the small residual in optimizer and inter-phase gaps.

Verdict. With a clean job environment, 4-GPU scaling efficiency is 95.7% (987 ms → 1,031 ms/step). CUDA_LAUNCH_BLOCKING=1 in the job environment is the sole cause of the degraded 76.5% efficiency seen in early runs. Verify the job environment before any performance investigation. The forward-pass buffer broadcast should be monitored at multi-node scale where it runs over Slingshot.

Multi Node Scaling

With single-GPU and single-node behaviour established, this section characterises how Anemoi scales across multiple nodes connected via the HPE Slingshot 11 interconnect. The key questions are: how efficiently does gradient synchronisation scale from 2 to 100 nodes, where does NCCL communication become the critical-path bottleneck, and how large is the startup overhead relative to training time at scale? All runs use the O96 dataset, eager BF16, batch size 8, and the same job environment controls established in the single-node section (CUDA_LAUNCH_BLOCKING and TORCH_NCCL_BLOCKING_WAIT explicitly unset).

Baseline Multi-Node Training Runs (2–100 Nodes)

Goal: Establish baseline step time and startup time from 2 to 100 nodes to quantify scaling efficiency and startup overhead growth beyond 1 node.

For the 1-GPU, 1 node, 2 nodes, and 10 nodes, 200 steps of the simple profiler with NVTX markers and nsys profile capture were used. The 25-node run completed only 80 steps — the epoch ended early due to dataset size. The 50-node and 100-node runs were limited to 40 and 24 steps respectively for the same reason. Since 24–80 steps is still sufficient to get a stable median step time, this should not affect the validity of the scaling efficiency calculation, especially when comparing median times across runs.

Scaling efficiency is calculated as:

Scaling Efficiency = T(1-GPU) / T(N-GPU) × 100%

where T(1-GPU) is the median step time on 1 GPU and T(N-GPU) is the median step time with N GPUs. This is equivalent to the throughput-ratio formulation used in the Single Node section (N-GPU total throughput / (N × 1-GPU throughput)); step time and throughput are reciprocals, so the two expressions are identical. Each step processes N times more data in parallel (one local batch per GPU), so the global batch size grows with GPU count and fewer steps are needed per epoch. A step that takes the same wall-clock time as the 1-GPU baseline therefore represents a perfect N× throughput improvement, and 100% efficiency means no overhead from parallelisation.

Per-step scaling (Simple profiler, NVTX, nsys profile, rank 0):

Figure 9. Scaling efficiency vs node count. Efficiency is flat at ~94–96% up to 10 nodes, drops to 90.8% at 25 nodes, then to ~85% at 50 nodes — slightly below the trend, coinciding with the NCCL RING_LL → TREE_LL switch — and holds at 85.6% at 100 nodes.

Figure 10. Median step time decomposed into backward, forward (derived), and optimizer phases by node count. Backward is relatively stable; the forward residual grows sharply at 50 nodes before partially recovering at 100 nodes, with the 50-node spike likely reflecting a transient effect rather than pure DDP broadcast scaling.

See Full Per-Step Timing Statistics (Action 1) in Supplementary Material for the numerical breakdown.

Note

Each configuration is based on a single experiment. The reported values should be treated as indicative rather than statistically robust - run-to-run variance in step time, NCCL behaviour, and job scheduling noise are not accounted for. All timing statistics are collected from rank 0; in synchronous DDP training the effective step time is bounded by the slowest rank, so inter-rank variance is not captured and rank 0 may underestimate true wall-clock step time.

Important

Median is the correct central measure for step time in these runs. Mean-based metrics are likely to be heavily distorted by the first-batch NCCL warmup and should not be used to compare scaling performance across node counts.

Scaling efficiency declines gradually. It is flat up to 10 nodes (~94–96%), drops to 90.8% at 25 nodes, then to ~85% at 50 nodes — slightly below the trend, coinciding with the NCCL RING_LL → TREE_LL switch — and holds at 85.6% at 100 nodes.
Backward peaks at 25 nodes (+7.9% vs 1-GPU) and eases at higher counts; forward peaks at 50 nodes (+48% vs 1-GPU) before easing slightly; NCCL All-Reduce remains overlapped with the backward at all scales tested — backward wall time is stable (709–765 ms) across all node counts despite AllReduce kernel time growing from 43 ms (1 node) to 621 ms (50 nodes), indicating AllReduce pipelines within the backward pass throughout.
cudaLaunchKernel median is flat (8.2 → 7.4 µs across all scales) — CPU dispatch is not a bottleneck at any scale tested.

Backward Pass and AllReduce Analysis

To identify how much time NCCL communication takes relative to available overlap, the total GPU kernel time for f32 AllReduce (ncclDevKernel_AllReduce_Sum_f32_*) is compared to the backward NVTX wall time. The backward pass is the natural overlap window because DDP launches AllReduce on each gradient bucket as it becomes available during the backward, allowing communication and compute to run concurrently. If total AllReduce kernel time stays well below the backward wall time, AllReduce completes before the backward finishes and adds nothing to step time. If it approaches or exceeds the backward wall time, some AllReduce work may spill past the backward and delay the optimizer step.

Figure 11. NCCL AllReduce kernel time (RING_LL + TREE_LL) as a fraction of the backward NVTX window at each scale. Saturation remains below 50% up to 25 nodes (AllReduce fully overlapped), jumps to 83% at 50 nodes when NCCL switches to TREE_LL, then eases to 70% at 100 nodes.

See NCCL AllReduce Kernel Time per Scale in Supplementary Material for the numerical breakdown.

AllReduce kernel time grows from 42.6 ms (1 node) to 329.6 ms (10 nodes, 45% of the 737 ms backward window) and 377.1 ms (25 nodes, 49% of 764.9 ms) — in both cases well within the backward, so AllReduce is fully overlapped. At 25 nodes the backward nevertheless peaks (+7.9% vs 1-GPU), suggesting the mixed RING/TREE transitional regime adds overhead the saturation metric does not capture.

At 50 nodes NCCL switches to predominantly TREE_LL (621 ms total AllReduce; 615 ms TREE_LL, 6 ms RING_LL; 83% saturation), yet backward wall time rises only 39 ms above the 1-GPU baseline — AllReduce continues to pipeline within the backward. At 100 nodes AllReduce drops to 519 ms/step (70% saturation); TREE_LL launch count falls 34 → 29 per step at similar per-launch cost (a count effect, cause unknown), and backward eases by 10 ms — though the 100-node backward StdDev (169 ms) dwarfs this improvement.

Forward Residual Decomposition

The derived forward is a residual (step − backward − optimizer) and includes all untagged overhead; it cannot be interpreted in isolation.

It is visible in Figure 10 and the Full Per-Step Timing Statistics table: stable from 1-GPU to 10 nodes (261.8 → 284.8 ms, +23 ms total), rises moderately at 25 nodes (+17 ms), then jumps sharply at 50 nodes (+86 ms), then falls back slightly at 100 nodes (−19 ms).

Figure 12. Forward residual decomposed into baseline forward compute (1-GPU floor), ncclDevKernel_Broadcast_RING_LL, and unexplained overhead. The 50-node bar shows a 64 ms unexplained spike with no identifiable kernel source; at 100 nodes Broadcast dominates (28% of forward residual) but the unexplained component collapses back to ~6 ms.

ncclDevKernel_Broadcast_RING_LL (DDP _pre_forward buffer sync) is one attributable contributor, measured via nsys gpukernsum (total kernel time ÷ steps): 23.6 ms → 37.1 ms → 62.1 ms → 101.6 ms at 10/25/50/100 nodes. Unlike AllReduce, Broadcast uses RING_LL at all node counts. From 10 to 50 nodes it accounts for +38.5 ms (~37%) of the +103 ms forward jump; the remaining ~65 ms has no identifiable kernel source. At 100 nodes Broadcast grows +39.5 ms yet the derived forward drops 18.6 ms, implying ~58 ms of other residual components improved.

Important

The 50-node run is a likely outlier: the unexplained 64 ms forward spike is non-monotonic (the 100-node forward is 18 ms lower), suggesting a transient hardware or network effect rather than a systematic software scaling issue. Full decomposition would require a per-kernel GPU trace at 50 nodes, which is beyond the scope of this work; if the anomaly persists in future runs at this scale it warrants further investigation. Outside this outlier, scaling is gradual and the efficiency loss is consistent with expected DDP overhead at increasing node counts.

Performance improvement opportunities:

Set broadcast_buffers=False in DDP. The ncclDevKernel_Broadcast_RING_LL kernel grows from 23.6 ms/step at 10 nodes to 62.1 ms/step at 50 nodes to 101.6 ms/step at 100 nodes (8.9% of total step time). The O96 model uses Layer Norm, not Batch Norm, so this cross-rank buffer sync is unnecessary. Disabling it could potentially recover ~38 ms of unexplained forward overhead at 50 nodes and ~62 ms at 100 nodes, partially restoring scaling efficiency at both scales.
Mitigate NCCL first-batch warmup (16.9 s at 10 nodes). This is the dominant cost for short/debug runs. The warmup can be eliminated by adding a dummy forward/backward pass before the profiled window, or by pre-initialising NCCL communicators with a no-op collective before training begins.
Profile rank heterogeneity. All timing data is from rank 0. The step max values (16,934 ms at 10 nodes) suggest at least one rank is significantly slower. Collecting profiles across all ranks — or at minimum the slowest rank — would confirm whether the efficiency loss at 50 nodes is uniform or driven by a single straggler.

If the 50-node performance degradation persists in future runs, the following could be investigated:

Run a full per-kernel GPU trace at 50 nodes to identify the source of the unexplained ~65 ms forward overhead. The DDP Broadcast (+38 ms) accounts for only ~37% of the forward jump; the remainder has no identifiable kernel source in the available data.
Investigate forcing RING_LL or increasing the gradient bucket size. NCCL switches to predominantly TREE_LL at 50 nodes (621 ms total, 615 ms TREE_LL). Backward wall time remains stable despite this, but forcing RING_LL via NCCL_ALGO=RING or increasing the DDP bucket size beyond the default 25 MB would reduce the number of AllReduce calls per step (~34 at 50 nodes, 29 at 100 nodes) and may help at the transitional 25-node regime.

See Supplementary Material: Multi-Node Profiling Detail for simple profiler cross-validation and statistical caveats on step-max and optimizer skew.

Startup Overhead Analysis

At large node counts, startup time can rival or exceed training time for short runs — at 100 nodes, the 79.1 s startup is roughly 3× the 27 s of actual training (24 steps). On a shared HPC cluster where allocation time is scarce, startup overhead directly reduces the fraction of walltime spent doing useful work. Understanding which phase dominates at each scale is necessary to prioritise optimisation and to set realistic step-count minimums for future profiling runs.

Method. A lightweight Lightning callback (experiments/diagnostics/callbacks/startup_timer.py) emits a timestamped log line at each key Lightning hook from rank 0 only. T0 is set at callback instantiation — after Python imports and Hydra config loading, but before model initialisation and Lightning setup. The five phases map to the following operations:

T0 → setup: model and graph construction, dataset open, weight initialisation.
setup → on_fit_start: DDP model wrapping and weight broadcast from rank 0 to all ranks (462 MB over NVLink intra-node, Slingshot inter-node). The dominant cost at 50 nodes (+17.6 s).
on_fit_start → on_train_start: NCCL process group initialisation and communicator setup.
on_train_start → first batch start: gradient bucket allocation and data prefetch.
First batch: forward + backward + first AllReduce, including NCCL topology negotiation warmup. The dominant cost at 10 nodes (+16.9 s).

Startup overhead (wall-clock from T0 to end of first batch, rank 0; full numerical table in Startup Phase Definitions and Raw Timings):

Figure 13. Startup overhead decomposed by phase at each node count. The dominant cost shifts from NCCL first-batch warmup at 10 nodes (16.9 s) to DDP weight broadcast at 50–100 nodes (17.6–36.8 s). The 25-node run is excluded — its T0→setup phase (164.4 s) is a single-run outlier with no confirmed cause that would compress the other bars.

The dominant bottleneck shifts with scale. At 2 nodes the first batch accounts for most of the added startup cost (+1.5 s, first inter-node NCCL allreduce). At 10 nodes the first batch explodes to 16.9 s (NCCL topology warmup at 40 ranks). At 50 nodes the bottleneck moves to setup → on_fit_start (+17.1 s over the 1-GPU baseline), covering DDP model wrapping and the 462 MB weight broadcast to 200 ranks over Slingshot. At 100 nodes this phase doubles to 36.8 s (+36.3 s over baseline), consistent with the broadcast cost scaling linearly with node count.
First batch warmup is cheapest at the extremes. At 10 nodes (40 ranks, RING_LL) it is 16.9 s; at 25, 50, and 100 nodes it is 1.4–4.2 s, consistent with the TREE_LL switch reducing the warmup cost for the u32 scalar collective (u32_TREE_LL was 11.07 s at 10 nodes and only 1.63 s at 50 nodes).
NCCL process group init (on_fit_start → on_train_start) is stable — grows from 4.6 s to 9.1 s across the full range. Communicator creation scales well; the cost is in the first data movement, not the setup itself.
At 50 and 100 nodes, startup time far exceeds training time for these short runs. At 50 nodes: 52.0 s startup vs ~46 s training (40 steps × ~1.15 s/step). At 100 nodes: 79.1 s startup vs ~27 s training (24 steps × ~1.14 s/step) — startup is 3× longer than training. This reinforces the recommendation to run at least 200 steps at these node counts where dataset size permits.
T0 → setup grows modestly (11.2 s at 1-GPU → 28.8 s at 100 nodes, +17.6 s). The 25-node spike (164.4 s) is a single-run anomaly with no confirmed cause — if it recurs systematically at that scale it warrants investigation. At 100 nodes this phase is no longer the primary bottleneck — setup → on_fit_start (36.8 s) is.
All startup costs are one-time per-job and amortise quickly. Both the weight broadcast (up to 36.8 s at 100 nodes) and the NCCL first-batch warmup (16.9 s at 10 nodes) are fixed overheads. Over a real training run of ~1000 steps they represent 3% and 1.5% of total walltime respectively — negligible. The practical recommendation is to run at least 200 steps at all node counts where dataset size permits.

Further Work

Each profiling tier concludes with a set of improvement opportunities and open questions that were identified but not pursued within the scope of this work. These are documented inline at the end of each section and can be picked up independently as follow-on investigations.

Supplementary Material: Single GPU Profiling Detail

This section contains the detailed data tables supporting the condensed findings in the Single GPU section.

Profiler Overhead: Simple vs Detailed

The detailed configuration adds ~10% overhead versus simple, concentrated in CPU-side optimizer instrumentation rather than CUDA kernels. GPU-heavy operations (forward/backward passes) are barely affected (<2%).

Metric definitions. Avg Batch Time refers to the run_training_batch timer — the per-step time covering forward pass, backward pass, and optimizer update, excluding inter-step overhead. Training Throughput (samples/s) is derived from training_avg_throughput × batch_size and reflects end-to-end wall-clock speed including dataloader and framework overhead.

Metric	Simple Profile	Detailed Profile	Delta (%)
Total Epoch (40 steps) Time	39.22 s	43.35 s	+10.5%
Avg Batch Time	0.97 s	1.06 s	+8.8%
Training Throughput	7.93 samples/s	7.01 samples/s	−11.6%
Backward Pass (Total)	28.27 s	28.39 s	+0.4%
Forward Pass (Total)	10.18 s	10.37 s	+1.9%
Optimizer Step (Total)	38.80 s	42.20 s	+8.8%
DataLoader Next (Total)	0.11 s	0.30 s	+173%

Note: The Optimizer Step timer spans the entire training step (including backward pass) and should not be interpreted as measuring optimizer-only cost.

The backward pass takes 28.27 s versus 10.18 s for the forward pass (2.8:1 ratio). With num_chunks: 2 activation checkpointing, the backward pass requires one additional forward recomputation, raising its cost from the standard 2× to ~3× the forward — consistent with the observed ratio.

TensorBoard Trace Detail

Note: The TensorBoard PyTorch Profiler plugin (torch-tb-profiler) used for this analysis has since been deprecated and is scheduled for permanent removal on 03/05/2026. This work was completed before decommission. For future profiling, the recommended replacements are HTA (Holistic Trace Analysis) [8] for programmatic GPU utilisation, kernel breakdown, and memory analysis, and Perfetto UI [9] for interactive kernel-level timeline inspection.

The detailed profiler produces a TensorBoard trace. The four trace views collectively confirm the memory-bound characterisation:

GPU and Execution Summary: GPU utilisation is 92.81% and SM Efficiency is 90.84%, ruling out data starvation as the bottleneck — the GPU is never idle. CPU-side synchronisation stalls were present (91% of CUDA API time, confirmed by nsys Phase 1) but did not limit GPU throughput. Achieved occupancy is only 41.92%, indicating memory stalls prevent full warp utilisation. The TensorBoard step time (1.29 s) is higher than the Anemoi run_training_batch timers because it includes trace-capture overhead; these measures are not interchangeable.
Memory View: Peak memory usage is 34.1 GB (~36% of 95 GB usable HBM3e). The trace shows a characteristic sawtooth pattern — memory spikes to 34 GB and drops as each activation chunk is processed then freed. The 60 GB of unused VRAM headroom does not translate to faster training.
Operator View: Host Self Time is dominated by aten::copy_ (58.5%) and aten::nonzero (26.7%). Dynamic sparse indexing causes CPU–GPU synchronisation stalls; heavy aten::to and aten::copy_ traffic indicates tensor casts inside the training loop. torch.compile fused over 50,000 of these element-wise operations and eliminated the cudaStreamSynchronize stall, though this did not translate to a measurable throughput improvement.
Kernel View: Tensor Core utilisation is only 1.1%, with 98.9% of GPU time on non-Tensor-Core work — directly confirming the workload is memory-bandwidth bound. NVIDIA nvjet kernels account for 40–50% of kernel time; FlashAttention for ~25% (TensorBoard host-side accounting; nsys GPU-time breakdown gives slightly different figures). flash_fwd_kernel is called 2× more often than flash_bwd_kernel, confirming activation checkpointing is active.

The five GPU efficiency metrics are mutually consistent:

Metric	Value	What it measures
GPU Utilisation	92.81%	Fraction of step time the GPU is executing any kernel — confirms no data starvation.
Est. SM Efficiency	90.84%	Fraction of scheduled SM time where at least one warp is active — confirms SMs are rarely idle.
Est. Achieved Occupancy	41.92%	Fraction of the theoretical maximum co`ncu`rrent warps active — less than half, indicating memory pressure limits warp parallelism.
Tensor Core Utilisation	~1.1%	Fraction of kernel execution time in Tensor Core operations — 98.9% is spent on memory-bound element-wise work instead.
Model FLOP Utilisation (MFU)	~20%	Achieved TFLOP/s (193) vs. GH200 dense BF16 peak (989 TFLOP/s) — consistent with a memory-bandwidth bound regime.

Optimisation Action Data

Action 1: Batch Size Increase

dataloader.batch_size.training was increased from 8 to 16 over 40 training steps.

simple profiling:

Metric	Batch Size 8	Batch Size 16	Change
Avg Batch Time	0.97 s	1.91 s	+1.97×
Training Throughput	7.93 samples/s	7.79 samples/s	−1.8%

detailed profiling:

Metric	Batch Size 8	Batch Size 16	Change
Avg Batch Time	1.06 s	1.99 s	+1.88×
Training Throughput	7.01 samples/s	7.71 samples/s	+10%
Peak Memory	34.1 GB (36%)	~68 GB (~72%)	+2×

The simple profiler's −1.8% is the reliable indicator — the detailed profiler's +10% is inflated by its fixed overhead being proportionally smaller at larger batch size.

Action 2: DataLoader Workers

dataloader.num_workers.training varied across 8, 16, and 32 workers (batch size 16, simple profiler, 40 steps):

Metric	8 Workers	16 Workers	32 Workers
Avg Batch Time	1.91 s	1.92 s	1.95 s
Training Throughput	7.79 samples/s	7.95 samples/s	7.72 samples/s
vs. 8 Workers	Baseline	+2.1%	−0.8%

Action 3: torch.compile

Compilation is scoped to the inner model (model.model = torch.compile(model.model)) — compiling the full Lightning module causes a Triton crash in the validation loop ("Triton installation not found"). The eager baseline here (0.954 s) differs slightly from the section baseline (0.97 s) due to a different profiler run; see the step-time source table in the Summary for context.

200-step simple profiler (includes recompilation overhead):

Metric	Eager Mode	Compiled	Change
Avg Batch Time	0.954 s	1.026 s	+7.5%
Backward Pass	0.694 s	0.705 s	+1.5%
Forward Pass	0.253 s	0.314 s	Inconclusive (recompilation noise)
Validation Step	0.321 s	3.248 s	+913% (recompilation)
Training Throughput	8.23 samples/s	6.27 samples/s	−23.9%
Total Wall Time	236 s	274 s	+16%

Training Throughput drops more sharply than Avg Batch Time (−23.9% vs +7.5%) because it is computed over total wall-clock time including validation — 6 validation recompilation events (~18 s extra vs eager) inflate the denominator. Compiled artefacts can be cached via torch._dynamo.config to eliminate validation recompilation, but this does not address the batch time regression.

40-step detailed profile (structural effects):

Change	Detail
Occupancy	41.9% → 37.1% (GPU util unchanged: 92.81% → 91.75%)
`aten::copy_`	−54%
`aten::empty_strided`	−57%
`aten::to`	−70%
Peak memory	34.2 GB → 30.7 GB (−10%)
Tensor Core utilisation	~1.2% (unchanged)

Action 4: FP8 Precision

Both runs use torch.compile; compiled BF16 is the baseline to hold compilation constant and isolate precision.

Metric	BF16 (compiled)	FP8 (Transformer Engine)	Change
Avg Batch Time	1.026 s	0.997 s	−2.8%
Forward Pass	0.314 s	0.316 s	~0%
Backward Pass	0.705 s	0.676 s	−4.1%
Training Throughput	6.27 samples/s	6.32 samples/s	+0.8%
Dataloader Throughput	8,899 samples/s	1,426 samples/s	−84%
Total Wall Time	264 s	273 s	+3.4%

AMAX scaling collapses dataloader throughput by 84% (8,899 → 1,426 samples/s), though training is unaffected since 1,426 samples/s far exceeds the ~6.3 samples/s training throughput.

`nsys`: Phases 1 and 2

Phase 1: Baseline — CPU Dispatch Activity

625,957 CUDA kernel launches for 200 steps (~3,130/step) — consistent with aten::copy_ and aten::nonzero fragmentation in the TensorBoard Operator View.
cudaStreamSynchronize accounted for 91% of CUDA API time (~147 s) — the CPU repeatedly waited for the GPU rather than issuing new work.

GPU utilisation remained 92.81% — the GPU was not starved. The stall activity was entirely CPU-side; the GPU remained busy throughout.

Phase 2: torch.compile — Kernel Fusion

Metric	Baseline (Eager)	Compiled	Change
cudaLaunchKernel calls	625,957	~429,000	−31%
Fused element-wise ops	~0	>50,000	Triton fusion active
D2D Memory Movement	398 GB	1,087 GB	~2.7× (expected)
cudaStreamSynchronize share	~91%	Negligible	CPU stall removed

The 3× D2D increase is expected — Triton kernels allocate workspace buffers in HBM3e, trading bandwidth for compute locality.

Fused AdamW

Metric	Compiled (BF16)	Fused AdamW	Change
Avg Batch Time	1.026 s	1.028 s	+0.2%
Training Throughput	6.27 samples/s	6.18 samples/s	−1.4%

`ncu`: Roofline Background and Measurement

Roofline Model

GH200 has two performance ceilings:

Memory ceiling: 4.0 TB/s peak HBM3e bandwidth [5, 16]
Compute ceiling: ~989 TFLOP/s peak dense BF16 (Tensor Core; 1,979 TFLOP/s with structured sparsity) [5, 16]
Ridge point: ~247 FLOP/Byte (dense) — the arithmetic intensity at which a kernel transitions from memory-bound to compute-bound [10]

A kernel operating below the ridge point is constrained by how fast data can be loaded from HBM3e, not by how fast the GPU can compute. Increasing compute throughput does nothing; the only way to improve throughput is to reduce data movement or increase reuse.

Measurement Methodology

ncu was run on the baseline (eager BF16) configuration using --set roofline to collect Speed-of-Light (SOL) metrics — the percentage of peak memory bandwidth and peak SM compute throughput reached by each kernel. A launch-skip of 3,130 kernels (one warmup step) was applied before capturing 500 kernels, covering all distinct kernel types in a training step. Default kernel replay mode (--replay-mode kernel) was used; application replay was not viable because Anemoi is non-deterministic across runs.

`ncu` Speed-of-Light Values per Kernel

Numerical SOL values from the 500-kernel ncu capture (eager BF16, job 4263705). Each row gives the mean (mid-point) and observed range across the captured kernel instances. FlashAttention (flash_fwd_kernel, flash_bwd_*) was not captured in this window and is excluded.

Kernel type	Memory SOL mid (%)	Memory SOL range (%)	Compute SOL mid (%)	Compute SOL range (%)	Regime
CUTLASS GEMM (linear projections)	92	88–96	33	30–36	Memory-bound
Element-wise kernels (add, mul, copy)	92	90–93	21	13–29	Memory-bound
Layer norm backward	90	—	53	—	Memory-bound
`nvjet_hsh` (graph message-passing)	~70	65–75	~88	80–95	Near ridge point
`indexFuncLargeIndex` (sparse routing backward)	14	—	56	—	Latency/cache-bound

The GH200 ridge point is ~247 FLOP/Byte (dense BF16). Kernels with Memory SOL >> Compute SOL lie in the memory-bound region; nvjet_hsh is the only kernel class near the ridge. See Figure 7 in the main text for the roofline scatter plot.