Releases: oracle-quickstart/oci-hpc-oke
Releases · oracle-quickstart/oci-hpc-oke
OKE RDMA Quickstart Resource Manager template v26.2.0
What's Changed
- Update readme by @OguzPastirmaci in #87
- Update path for OCI CLI in helm-deployment.tf by @OguzPastirmaci in #89
- Fix ubuntu repo by @robo-cap in #92
- Issue number: 90 - fss mount on all worker nodes by @subburamoracle in #91
- Install OKE node client packages from local repo if it exists by @OguzPastirmaci in #93
- Improve ons-webhook resiliency by @robo-cap in #94
- Add retry function to cloud init by @OguzPastirmaci in #95
- Module fixes and improvements by @robo-cap in #96
- Use NSGs instead of SLs for Lustre Service by @robo-cap in #100
- Update NPD values file by @OguzPastirmaci in #102
- Add NCCL tests manifest for BM.GPU.GB200-v3.4 and update the other manifests to use NCCL 2.29 by @OguzPastirmaci in #103
- Add Terratest tests by @OguzPastirmaci in #101
- Add the document for replacing the boot volumes of self-managed nodes by @OguzPastirmaci in #106
- Update NCCL/RCCL images by @OguzPastirmaci in #107
- Add check to wait until kubeconfig exists by @OguzPastirmaci in #108
- Add MI355 manifest and update other manifests by @OguzPastirmaci in #109
- Move GPU Fryer active health checks to Python by @OguzPastirmaci in #110
- Update BM.GPU.MI355X-v1.8.yaml by @OguzPastirmaci in #111
- added support for VM.DenseIO shapes by @shethdhvani in #114
- Update replacing node using BVR guide by @OguzPastirmaci in #115
- Fix pod logs mount by @robo-cap in #118
- Replace Nginx Ingress controller with Contour by @robo-cap in #117
- Fix: Set to retentionSize for Prometheus by @sam-andaluri in #119
- Update contour helm values by @robo-cap in #120
- Add NCCL tests manifest for BM.GPU.GB300.4 by @OguzPastirmaci in #121
- Update BM.GPU.GB300.4.yaml by @OguzPastirmaci in #122
- Add cloud-shell support to the BVR script by @robo-cap in #123
- Remove BV high storage class by @OguzPastirmaci in #126
- Add option to change services CIDR by @OguzPastirmaci in #127
- Add NCCL tests 2.29.3 images by @OguzPastirmaci in #124
- Update Node Problem Detector checks by @OguzPastirmaci in #130
- Add an option to the OKE stack to use an existing Dynamic Group by @subburamoracle in #105
- Bump chart versions by @OguzPastirmaci in #131
- Add per-pool kubernetes version, max pods, and node cycling by @OguzPastirmaci in #128
- Larger CIDR to accomodate more nodes by @OguzPastirmaci in #129
- BugFix: Fix alert webhook to reduce chances of duplicate alerts by @sam-andaluri in #133
- set kubeproxy to use ipvs & several small tweaks by @robo-cap in #132
- Increase DCGM Exporter memory limits by @OguzPastirmaci in #134
New Contributors
- @subburamoracle made their first contribution in #91
- @shethdhvani made their first contribution in #114
- @sam-andaluri made their first contribution in #119
Full Changelog: v25.11.0...v26.2.0
OKE RDMA Quickstart Resource Manager template v25.11.0
- Add option to install OCIR credential helper
- Fix for Metrics Server
- Add support to use image URIs
Full Changelog: v25.10.0...v25.11.0
OKE RDMA Quickstart Resource Manager template v25.10.0
- Kubernetes upgrade: Added support for Kubernetes v1.34
- Documentation: New guide — Deploying Prometheus & Grafana Stack with Dashboards and Alerts manually
- Health checks:
- Added RCCL tests
- Added RocM Validation Suite (RVS)
gst_singlefor AMD validation
- Grafana access link: Default domain updated to
endpoint.oci-hpc.ai, configurable for custom domains - Component updates: Refreshed dependencies and minor fixes across the stack
Full Changelog: v25.9.0...v25.10.0
OKE RDMA Quickstart Resource Manager template v25.9.0
- Option to provision a shared Lustre file system and a PV backed by the Lustre file system
- Fully private clusters using Resource Manager Private Endpoint for deployment
- Same dashboards and notifications with the Slurm stack
- Option to use Oracle Linux for non-RDMA pools
- Component updates
OKE RDMA Quickstart Resource Manager template v25.5.1
This is a hotfix release to fix the breaking Helm provider change.
More info about the change here: hashicorp/terraform-provider-helm#1637
OKE RDMA Quickstart Resource Manager template v25.5.0
- Added AMD Device Metrics Exporter
- Added AMD dashboards
OKE RDMA Quickstart Resource Manager template v25.4.0
- Added Kubernetes v1.32
- Changed the default number of maximum pods per node to 110
OKE RDMA Quickstart Resource Manager template v25.3.1
- OKE AMD GPU device plugin is enabled for BM.GPU.MI300X.8 shape
- OKE DCGM Exporter is disabled (upstream DCGM Exporter is deployed)
- Helm fix for Grafana load balancer not being deleted properly on Terraform destroy
- Updated the health checks for Node Problem Detector
- Updated Grafana dashboards
- Added the required policies for Oracle Cloud Agent GPU/RDMA monitoring
OKE RDMA Quickstart Resource Manager template v25.3.0
- VCN-native pod networking is now the default option for pod networking instead of Flannel.
- Node Problem Detector is now deployed part of the stack and integrated with the Prometheus/Grafana stack for alerting.
- Switched to using the upstream OKE Terraform module.
OKE RDMA Quickstart Resource Manager template v25.3.0-beta
- VCN-native pod networking is now the default option for pod networking instead of Flannel.
- Node Problem Detector is now deployed part of the stack.
- Fixed a Node Exporter issue preventing metrics from being streamed from bare metal GPU nodes.