Skip to content

Latest commit

 

History

History
175 lines (119 loc) · 12.3 KB

File metadata and controls

175 lines (119 loc) · 12.3 KB

OCI GPU Quick Start: NVIDIA H100

This document provides hardware specifications, supported OS images, onboarding verification, sample benchmarks, and best-practices for OCI deployments using the NVIDIA H100 GPU shape.

BM.GPU.H100.8 is a high-bandwidth NVIDIA H100 bare-metal shape intended for large-scale AI and HPC workloads, with eight 80 GB GPUs, dual Intel Xeon Platinum 8480+ processors, and RoCE-capable scale-out networking.

At a Glance

  • Shape: BM.GPU.H100.8
  • GPU configuration: 8 x NVIDIA H100 80 GB
  • Recommended OS baseline: Oracle Linux 8 or 9, Ubuntu Linux 22.04, or Ubuntu Linux 24.04
  • Recommended software baseline: DOCA OFED 3.2.1, NVIDIA Driver 580/590 (Open), CUDA 13.0/13.1
  • Primary verification command: nvidia-smi
  • Operational profile: scale-out AI and HPC with NCCL topology considerations

Table of Contents

Hardware Specifications

Shape Name GPU Model GPUs/Node GPU Memory (GB/GPU) GPU Memory Total CPU # of CPUs System Memory Local Storage Host NIC RDMA (ROCe) NICs
BM.GPU.H100.8 H100 8 80 640 GB 2 x Intel Xeon Platinum 8480+ @ 2.0 GHz 112 Cores 2 TB DDR5 16 x 3.5 TB NVMe (~54 TB usable) 100 Gb/s 8 x 2 x 200 Gb/s = 3.2 Tb/s

See the OCI Compute Shapes Docs for up-to-date details.

Recommended Operating Systems

  • Oracle Linux 8
  • Oracle Linux 9
  • Ubuntu Linux 22.04
  • Ubuntu Linux 24.04

Recommended Software Version

  • DOCA OFED 3.2.1
  • NVIDIA Driver 580 or 590 (Open)
  • CUDA 13.0 or 13.1
  • Oracle Cloud Agent 1.57.0
  • Use the Provided Images table below for the current validated OCI image combinations

Custom OS Image Creation with Packer

To build your images using packer clone the OCI HPC Images repo and run the commands found there OCI HPC Images GitHub Repo.

Provided Images

OS Version Image Packer Build Details OCI Platform Image Link Driver Versions Build & Dependency Status
OCI GPU AI Image with Ubuntu Linux 22.04 Canonical-Ubuntu-22.04-DOCA-OFED-3.2.1-GPU-580-OPEN-CUDA-13.0 PAR Link NVIDIA OPEN 580, DOCA OFED 3.2.1, CUDA 13.0, OCA 1.57.0 Build Build
OCI GPU AI Image with Ubuntu Linux 22.04 Canonical-Ubuntu-22.04-DOCA-OFED-3.2.1-GPU-590-OPEN-CUDA-13.1 PAR Link NVIDIA OPEN 590, DOCA OFED 3.2.1, CUDA 13.1, OCA 1.57.0 Build Build
OCI GPU AI Image with Ubuntu Linux 24.04 Canonical-Ubuntu-24.04-6.8-DOCA-OFED-3.2.1-GPU-580-OPEN-CUDA-13.0 PAR Link NVIDIA OPEN 580, DOCA OFED 3.2.1, CUDA 13.0, Kernel 6.8, OCA 1.57.0 Build Build
OCI GPU AI Image with Ubuntu Linux 24.04 Canonical-Ubuntu-24.04-6.8-DOCA-OFED-3.2.1-GPU-590-OPEN-CUDA-13.1 PAR Link NVIDIA OPEN 590, DOCA OFED 3.2.1, CUDA 13.1, Kernel 6.8, OCA 1.57.0 Build Build
OCI GPU AI Image with Ubuntu Linux 24.04 Canonical-Ubuntu-24.04-6.14-DOCA-OFED-3.2.1-GPU-590-OPEN-CUDA-13.1 PAR Link NVIDIA OPEN 590, DOCA OFED 3.2.1, CUDA 13.1, Kernel 6.14, OCA 1.57.0 Build Build
OCI GPU AI Image with Oracle Linux 8 Oracle-Linux-8.10-RHCK-DOCA-OFED-3.2.1-GPU-580-OPEN-CUDA-13.0 PAR Link NVIDIA OPEN 580, DOCA OFED 3.2.1, CUDA 13.0, OCA 1.57.0 Build Build
OCI GPU AI Image with Oracle Linux 8 Oracle-Linux-8.10-RHCK-DOCA-OFED-3.2.1-GPU-590-OPEN-CUDA-13.1 PAR Link NVIDIA OPEN 590, DOCA OFED 3.2.1, CUDA 13.1, OCA 1.57.0 Build Build
OCI GPU AI Image with Oracle Linux 9 Oracle-Linux-9.7-RHCK-DOCA-OFED-3.2.1-GPU-580-OPEN-CUDA-13.0 PAR Link NVIDIA OPEN 580, DOCA OFED 3.2.1, CUDA 13.0, OCA 1.57.0 Build Build
OCI GPU AI Image with Oracle Linux 9 Oracle-Linux-9.7-RHCK-DOCA-OFED-3.2.1-GPU-590-OPEN-CUDA-13.1 PAR Link NVIDIA OPEN 590, DOCA OFED 3.2.1, CUDA 13.1, OCA 1.57.0 Build Build

Hello World Verification

Run nvidia-smi to verify that all eight GPUs are visible and healthy:

nvidia-smi

You should see all eight H100 GPUs listed with a healthy driver stack and no obvious ECC errors.

Performance Benchmarks

NVIDIA publishes NCCL as the primary collective communication library for multi-GPU AI and HPC workloads. The source material for H100 focuses on representative single-node and multi-node NCCL workflows plus the supporting topology and network guidance needed for scale-out runs.

All Reduce - Single Node

./build/all_reduce_perf -b 8 -e 8G -f 2 -g 8

Multi-node Guidance

For H100 multi-node NCCL jobs, the source material calls out a required topology file. If it is not already present in the image, use one of the following:

For guidance on running additional NCCL collective benchmarks on this GPU family, see the NCCL user guide.

Model Inference Performance

The H100 source page in this workflow is oriented more toward system validation and NCCL guidance than a clean standalone inference table, so no normalized model-performance table is included here.

OKE GPU Getting Started

Information on getting up and running on OKE can be found here.

Useful H100-specific OKE starting points in oci-hpc-oke:

Troubleshooting

This guide includes a broad health-check set covering GPU visibility, NUMA topology, RDMA connectivity, DCGM diagnostics, PCIe bandwidth, and NVLink validation.

GPU Visibility

nvidia-smi

NUMA Layout

numactl --hardware

For H100, the source material expects a layout comparable to a 112-core dual-socket system, or fewer visible cores when hyperthreading is disabled.

RDMA Interface State

rdma link

The source material identifies the front-end network as ens1200 on mlx5_2, with RoCE RDMA interfaces expected to report ACTIVE and LINK_UP.

DCGM Diagnostics

dcgmi diag -r 1
dcgmi diag -r 2
dcgmi diag -r 3

The source material describes:

  • r1 as a quick metadata and deployment check
  • r2 as a medium-depth integration and hardware check
  • r3 as a fuller stress-oriented validation pass

PCIe and NVLink Validation

Additional validation tools referenced in the source material:

  • bandwidthTest from NVIDIA cuda-samples for PCIe bandwidth
  • nvbandwidth for NVLink bandwidth validation

Further Reading & Support

Additional references: