Skip to content

Commit 1783134

Browse files
added section on estimating node-hours
1 parent 04bef42 commit 1783134

File tree

2 files changed

+85
-0
lines changed

2 files changed

+85
-0
lines changed

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,5 +24,6 @@ discussion category on the companion HPC Fund `GitHub site
2424
access
2525
jobs
2626
software
27+
nodehours
2728
help
2829
disclaimer

docs/nodehours.md

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# Node-Hour Estimates
2+
3+
The AMD University Program (AUP) AI & HPC Cluster supports **academic AI and HPC research and education** through year-long node-hour allocations, awarded quarterly to university faculty leading innovative, impactful, open-source projects.
4+
5+
When submitting a proposal, you will estimate the node-hours your project needs **in each partition** (i.e., node type; see Table 1) over a **1-year period**, which will then be normalized into a single node-hour allocation.
6+
7+
This guide explains how to prepare those estimates when completing the proposal (Step 3 of our 3-Step process):
8+
https://www.amd.com/en/corporate/university-program/ai-hpc-cluster.html#apply
9+
10+
---
11+
## Available Partitions and Capacity
12+
13+
**Table 1. Partitions, charge factors, and approximate annual and quarterly capacity**
14+
15+
| Partition | GPUs<br>per node | GPU type | Number<br>of nodes | Charge<br>factor | ~Annual<br>node-hours |
16+
| --------- | ------------- | ------------------- | --------------- | ------------- | ---------------------------- |
17+
| mi3508x | 8 | AMD Instinct MI350X | 4 | 1.40 | ~33,000 |
18+
| mi3501x | 1 | AMD Instinct MI350X | 8 | 0.175 | ~67,000 |
19+
| mi3258x | 8 | AMD Instinct MI325X | 1 | 1.20 | ~8,000 |
20+
| mi3008x | 8 | AMD Instinct MI300X | 2 | 1.00 | ~17,000 |
21+
| mi3001x | 1 | AMD Instinct MI300X | 8 | 0.125 | ~67,000 |
22+
| mi2508x | 8 | AMD Instinct MI250 | 10 | 0.80 | ~83,000 |
23+
| mi2104x | 4 | AMD Instinct MI210 | 11 | 0.40 | ~92,000 |
24+
| mi2101x | 1 | AMD Instinct MI210 | 28 | 0.10 | ~233,000 |
25+
26+
- Approximate annual node-hours in this table reflect total usable capacity across the entire cluster and are shared among all projects.
27+
28+
- The [1x] partitions represent virtual “slice” nodes that use a fraction (1/8 or 1/4) of a physical node’s resources; charge factors reflect this proportional usage.
29+
30+
- Each quarter, we allocate ~1/4 of our total annual node-hour capacity to multiple new projects.
31+
32+
---
33+
## What to Provide in the Proposal
34+
35+
Include an **estimate of the node-hours needed by partition** (i.e., by node type; see Table 1). Enter zeros for partitions you do not plan to use.
36+
37+
- A **node-hour** is the accounting unit used to allocate compute usage. Node-hours are consumed based on the number of nodes used and the run time (e.g., 4 nodes × 25 hours = 100 node-hours).
38+
- These are planning estimates for the full allocation period of 1 year.
39+
40+
---
41+
## How Allocations Are Calculated
42+
43+
Each partition has a charge factor reflecting relative capability and availability. Your per-partition requests will be converted into a single normalized node-hour allocation using these factors.
44+
45+
**PIs do not need to perform normalization**; simply provide the actual node-hours needed per partition.
46+
47+
Key points:
48+
- You receive one total allocation, not fixed per-partition limits.
49+
- Higher-capability partitions consume the total allocation more quickly.
50+
- You may shift usage across partitions during the year, as long as the total usage remains within the awarded allocation.
51+
52+
---
53+
## Hardware Selection Guidance
54+
55+
General guidance on hardware capabilities:
56+
- **MI350X / MI325X** – largest memory capacity; suited for very large or memory-constrained models.
57+
- **MI300X** – high-performance general-purpose accelerator for modern ML workloads.
58+
- **MI250 / MI210** – well-suited for development, testing, and compute-intensive workloads with smaller memory footprints.
59+
60+
Projects that effectively combine MI200- and MI300-series usage are generally able to **receive larger total allocations**.
61+
62+
---
63+
## MI3XX Usage Considerations
64+
65+
MI3XX-class nodes (MI300X, MI325X, MI350X) are a limited shared resource.
66+
67+
During allocation review, the following guidelines are considered (in addition to scientific merit and impact):
68+
- Requests **≤2%** of a partition’s **annual capacity** are typically straightforward to support.
69+
- Requests between **3–9%** of a partition’s **annual capacity** require a clear technical rationale and strong expected impact.
70+
- Requests **≥10%** of a partition’s **annual capacity** are typically not considered.
71+
72+
---
73+
## Justification
74+
75+
Provide a brief justification of your node-hour estimates in the proposal form. Focus on technical requirements (e.g., memory footprint, model size, precision needs) and prior experience with comparable workloads.
76+
77+
---
78+
## Summary (quick reference)
79+
80+
- Provide annual node-hour estimates by partition.
81+
- MI3XX capacity is limited and shared.
82+
- Mixed MI200/MI300 usage enables larger total allocations.
83+
- Hardware usage may evolve within the awarded total.
84+

0 commit comments

Comments
 (0)