Releases · oracle-quickstart/oci-hpc-oke · GitHub

19 Feb 23:51

OKE RDMA Quickstart Resource Manager template v26.2.0 Latest

Latest

What's Changed

Update readme by @OguzPastirmaci in #87
Update path for OCI CLI in helm-deployment.tf by @OguzPastirmaci in #89
Fix ubuntu repo by @robo-cap in #92
Issue number: 90 - fss mount on all worker nodes by @subburamoracle in #91
Install OKE node client packages from local repo if it exists by @OguzPastirmaci in #93
Improve ons-webhook resiliency by @robo-cap in #94
Add retry function to cloud init by @OguzPastirmaci in #95
Module fixes and improvements by @robo-cap in #96
Use NSGs instead of SLs for Lustre Service by @robo-cap in #100
Update NPD values file by @OguzPastirmaci in #102
Add NCCL tests manifest for BM.GPU.GB200-v3.4 and update the other manifests to use NCCL 2.29 by @OguzPastirmaci in #103
Add Terratest tests by @OguzPastirmaci in #101
Add the document for replacing the boot volumes of self-managed nodes by @OguzPastirmaci in #106
Update NCCL/RCCL images by @OguzPastirmaci in #107
Add check to wait until kubeconfig exists by @OguzPastirmaci in #108
Add MI355 manifest and update other manifests by @OguzPastirmaci in #109
Move GPU Fryer active health checks to Python by @OguzPastirmaci in #110
Update BM.GPU.MI355X-v1.8.yaml by @OguzPastirmaci in #111
added support for VM.DenseIO shapes by @shethdhvani in #114
Update replacing node using BVR guide by @OguzPastirmaci in #115
Fix pod logs mount by @robo-cap in #118
Replace Nginx Ingress controller with Contour by @robo-cap in #117
Fix: Set to retentionSize for Prometheus by @sam-andaluri in #119
Update contour helm values by @robo-cap in #120
Add NCCL tests manifest for BM.GPU.GB300.4 by @OguzPastirmaci in #121
Update BM.GPU.GB300.4.yaml by @OguzPastirmaci in #122
Add cloud-shell support to the BVR script by @robo-cap in #123
Remove BV high storage class by @OguzPastirmaci in #126
Add option to change services CIDR by @OguzPastirmaci in #127
Add NCCL tests 2.29.3 images by @OguzPastirmaci in #124
Update Node Problem Detector checks by @OguzPastirmaci in #130
Add an option to the OKE stack to use an existing Dynamic Group by @subburamoracle in #105
Bump chart versions by @OguzPastirmaci in #131
Add per-pool kubernetes version, max pods, and node cycling by @OguzPastirmaci in #128
Larger CIDR to accomodate more nodes by @OguzPastirmaci in #129
BugFix: Fix alert webhook to reduce chances of duplicate alerts by @sam-andaluri in #133
set kubeproxy to use ipvs & several small tweaks by @robo-cap in #132
Increase DCGM Exporter memory limits by @OguzPastirmaci in #134

New Contributors

@subburamoracle made their first contribution in #91
@shethdhvani made their first contribution in #114
@sam-andaluri made their first contribution in #119

Full Changelog: v25.11.0...v26.2.0

Contributors

OguzPastirmaci, shethdhvani, and 3 other contributors

Assets 3

05 Nov 08:00

OKE RDMA Quickstart Resource Manager template v25.11.0

Add option to install OCIR credential helper
Fix for Metrics Server
Add support to use image URIs

Full Changelog: v25.10.0...v25.11.0

Assets 3

30 Oct 19:01

OKE RDMA Quickstart Resource Manager template v25.10.0

Kubernetes upgrade: Added support for Kubernetes v1.34
Documentation: New guide — Deploying Prometheus & Grafana Stack with Dashboards and Alerts manually
Health checks:
- Added RCCL tests
- Added RocM Validation Suite (RVS) gst_single for AMD validation
Grafana access link: Default domain updated to endpoint.oci-hpc.ai, configurable for custom domains
Component updates: Refreshed dependencies and minor fixes across the stack

Full Changelog: v25.9.0...v25.10.0

Assets 3

25 Sep 19:33

OKE RDMA Quickstart Resource Manager template v25.9.0

Option to provision a shared Lustre file system and a PV backed by the Lustre file system
Fully private clusters using Resource Manager Private Endpoint for deployment
Same dashboards and notifications with the Slurm stack
Option to use Oracle Linux for non-RDMA pools
Component updates

Assets 3

18 Jun 23:13

OKE RDMA Quickstart Resource Manager template v25.5.1

This is a hotfix release to fix the breaking Helm provider change.

More info about the change here: hashicorp/terraform-provider-helm#1637

Assets 4

16 May 05:32

OKE RDMA Quickstart Resource Manager template v25.5.0

Added AMD Device Metrics Exporter
Added AMD dashboards

Assets 3

22 Apr 04:20

OKE RDMA Quickstart Resource Manager template v25.4.0

Added Kubernetes v1.32
Changed the default number of maximum pods per node to 110

Assets 3

31 Mar 04:54

OKE RDMA Quickstart Resource Manager template v25.3.1

OKE AMD GPU device plugin is enabled for BM.GPU.MI300X.8 shape
OKE DCGM Exporter is disabled (upstream DCGM Exporter is deployed)
Helm fix for Grafana load balancer not being deleted properly on Terraform destroy
Updated the health checks for Node Problem Detector
Updated Grafana dashboards
Added the required policies for Oracle Cloud Agent GPU/RDMA monitoring

Assets 3

18 Mar 20:49

OKE RDMA Quickstart Resource Manager template v25.3.0

VCN-native pod networking is now the default option for pod networking instead of Flannel.
Node Problem Detector is now deployed part of the stack and integrated with the Prometheus/Grafana stack for alerting.
Switched to using the upstream OKE Terraform module.

Assets 3

03 Mar 18:14

OKE RDMA Quickstart Resource Manager template v25.3.0-beta Pre-release

Pre-release

VCN-native pod networking is now the default option for pod networking instead of Flannel.
Node Problem Detector is now deployed part of the stack.
Fixed a Node Exporter issue preventing metrics from being streamed from bare metal GPU nodes.

Assets 3