Active GPU performance health check and monitoring tool for OCI GPU compute resources.
- Overview
- Key Features
- Quick Start
- Health Checks
- Architecture
- Dashboards & Monitoring
- Dependencies
- Roadmap
- Limitations
- Support & Contact
OCI GPU Scanner is a cloud-native monitoring solution for OCI GPU clusters, powered by Prometheus and Grafana. It runs directly in your OCI tenancy, giving you complete control and privacyโyour data never leaves your environment.
- โ Monitors GPU health and performance (NVIDIA & AMD)
- โ Validates RDMA cluster connectivity
- โ Provides real-time dashboards and metrics
- โ Executes active performance checks and passive health monitoring
- โ Integrates with your existing infrastructure
- Free: Available to all OCI customers at no cost
- Private: Deployed in your tenancyโyou control the data
- Comprehensive: Covers both hardware and application metrics
- Extensible: Integrate PyTorch, vLLM, or custom metrics
- Tenancy-wide monitoring โ no region or compartment restrictions
- Multi-GPU support: NVIDIA (A100, H100, B200, H200) and AMD (MI300X)
- Flexible deployment: Native OKE integration (DaemonSet) or system service (bare metal/VMs)
- On-demand checks: Trigger active health checks via REST API when needed
- Complete privacy: All data stays within your tenancy boundary
- GPU metrics collection via NVIDIA DCGM Exporter and AMD SMI Exporter
- Custom RDMA cluster metrics for network performance
- Active checks (GPU-occupying): PyTorch-based benchmarks with baseline thresholds
- Passive checks (non-intrusive): Periodic monitoring without disrupting workloads
- Pre-configured Grafana dashboards with cluster, node, and GPU-level views
- Integrates with OKE Node Problem Detector for auto-tagging failed nodes
- Supports bring-your-own Prometheus/Grafana instances
- Easily extend with custom metrics (PyTorch, vLLM, etc.)
Choose your preferred installation method:
| Method | Best For | Link |
|---|---|---|
| Terraform & OCI Resource Manager | New deployments (includes OKE cluster) | Getting Started Guide |
| OKE Managed Add-On | Existing OKE clusters (via Console) | Console Deploy Guide |
| Helm | Existing OKE clusters (CLI) | Helm Deploy Guide |
OCI GPU Scanner performs two types of health checks:
Note: These checks occupy GPUs during execution.
Active checks run when the plugin is installed or triggered on-demand via the portal/REST API. They use PyTorch operations (Matmul, Linear Regression) to generate performance scores, and leverage MPI for multi-node cluster testing.
Sample outputs:
- Model MFU (Model FLOPs Utilization)
- Background computation
- Compute throughput
- Memory bandwidth
- Error detection
- Tensor core utilization
- Sustained workload
- Mixed precision testing
- GPU power, temperature, and utilization
- GPU topology and XID errors
- RDMA MPI multi-node tests (all2all, allgather, allreduce, broadcast)
Note: These do NOT occupy GPUs.
Passive checks run every minute by default (configurable during installation). They're maintained by the OCI GPU Core Compute team and included in Dr.HPC V2 binaries.
- GPU count verification
- PCIe error, speed, and width checks
- RDMA NIC count and link validation
- Network RX discards and GID index
- Ethernet link state (100GbE RoCE)
- Authentication status (wpa_cli)
- SRAM error detection
- GPU driver version compatibility
- GPU clock speeds
- eth0 interface presence
- HCA fatal errors
- Thermal throttling monitoring
- Source-based routing configuration
- Oracle Cloud Agent version
- RDMA link flap detection
- PCIe hierarchy validation
- Row remap errors (nvidia-smi)
- RTTCC status (H100)
GPU-specific checks: XGMI (AMD), NVLINK, fabric manager (NVIDIA).
Failures include recommended remediation actions.
| Component | Technology | Port | Access |
|---|---|---|---|
| Frontend (Portal) | React/Node.js | 3000 | Internal/External |
| Backend (Control Plane) | Django | 5000 (container), 80 (service) | External (LoadBalancer) |
| Database | PostgreSQL | - | Internal (StatefulSet/PVC) |
| Configuration | ConfigMaps & Secrets | - | - |
- NVIDIA DCGM Exporter
- AMD SMI Exporter
- Prometheus Node Exporter
- OCI GPU Scanner Active & Passive Health Check Scripts
- Prometheus Schema Converters
- Control Plane Connectors
After deployment, you'll have access to:
- Grafana: Real-time metrics and health check visualization
- Prometheus: Metrics storage and querying
- Portal: Resource management and on-demand checks
This solution uses the following open-source projects:
- Grafana
- Prometheus
- NGINX Ingress Controller for Kubernetes
- PostgreSQL
- NVIDIA DCGM Exporter
- AMD SMI Exporter
- Prometheus Node Exporter
We're actively developing new features. Want something specific? Open an issue or contact us below!
- Multi-Node NCCL/RCCL testing
- PyTorch FSDP multi-node training with RDMA
- Low-priority Kubernetes job auto-scheduling for active checks
- B200 NVLink & InfiniBand MPI validations
- Public and private domain access for ingress controller
- Simplified OCI tenancy policy options
- Advanced Grafana boards with K8s job filtering
- Deployment via OCI Console
- OKE Node Problem Detector integration for taints
- Auto-remediation controller for self-healing
- GB200 & ARM64 runtime support
- AMD MI355X
- OS Support: Only Ubuntu Linux-based GPU nodes are supported
- Control Plane: Requires x86 CPU nodes
- Active Checks: Do not run as low-priority jobsโwill disrupt existing GPU workloads
This solution is provided without SLOs or SLAs from Oracle. However, a dedicated team manages the product and will support issues on a best-effort basis. The OCI Compute engineering team maintains the health check scripts.
Questions, issues, or feedback?
๐ง Contact: amar.gowda@oracle.com




