Skip to content

redhat-ai-americas/rhoai-argo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

98 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

rhoai-argo

Deployment of Red Hat OpenShift AI (RHOAI) and its required infrastructure stack using Helm and ArgoCD.

Automation Architecture

Sync Wave Description Resources
0 Namespaces All operators
5 RHOAI Dependencies & Utilities job-set-operator, cma-operator, cert-manager, leader-worker-set, Kueue, SR-IOV, OpenTelemetry, Tempo, ClusterObservability, kmm
7 Configs cluster-job-set, cma-controller
Checkpoint
10 GPU Dependencies & Hardware Operators nfd-operator
15 Configs nfd-instance
20 NVIDIA GPU Operator gpu-operator
25 GPU Cluster Policy gpu-clusterpolicy
Checkpoint
30 RHOAI Operator Group + Subscription rhoai-operator
32 RHOAI Deployment operator-deployment
33 DataScienceCluster configuration datasciencecluster
35 RHOAI dashboard configuration odhdashboardconfig

πŸš€ Getting Started

To initialize RHOAI, install OpenShift GitOps, configure permissions, and trigger the App-of-Apps deployment.

0. Clone the Repository

git clone https://github.com/redhat-ai-americas/rhoai-argo.git
cd rhoai-argo

1. Prepare OpenShift GitOps (~60 seconds)

  • Checks if the OpenShift GitOps Subscription exists. If not, applies it and configures permissions for the Service Account once the operator is fully ready.
# 1. Install Operator (only if missing) & Permissions
oc get subscription openshift-gitops-operator -n openshift-operators &>/dev/null || oc apply -f gitops-config/openshift-gitops-subscription.yaml
oc apply -f gitops-config/gitops-permission.yaml

echo "⏳ Finalizing OpenShift GitOps environment..."

# 2. Silent Wait until the Deployment exists
until oc wait deployment/openshift-gitops-server -n openshift-gitops --for=condition=Available --timeout=10s &>/dev/null; do sleep 5; done

# 3. Apply custom health checks and enable the sidebar GitOps tab
oc apply --server-side --force-conflicts -f gitops-config/argocd-instance.yaml

πŸ“¦ 2. Trigger "App-of-Apps" Deployment

  • Apply the yaml file for our App-of-Apps pattern controlled via sync waves.

Installation (Manual Approval)

  • Operators will require manual approval for any version upgrades in the OpenShift Console.
oc apply -f app-of-apps.yaml

Approve InstallPlans

Note

You must approve the InstallPlan requests as they attempt to install. This is to avoid automatic updates on your AI workloads. In order to use Automatic updates, first push this to an empty repository and update the app templates to point to git url. Then, change the end of the first line in argocd-applications/values.yaml from Manual to Automatic

  1. In the OpenShift Dashboard, navigate to Home > Search.
  2. The search bar, where it says Resources, type "InstallPlan" select the resource type.
  3. Click on the InstallPlan name in the first column, Installxxxxx > Preview InstallPlan > Approve. (Or use the tip below)
  4. To get back to the list, click on InstallPlans in the path at the top left. Make sure you are on all projects from the namespace dropdown menu at the top of the screen to see all InstallPlans.

Tip

Bulk Approval: To approve all currently waiting InstallPlans at once, run:

oc get installplan -A --no-headers | grep "false" | awk '{print $1, $2}' | xargs -L1 sh -c 'oc patch installplan $1 -n $0 --type merge -p "{\"spec\":{\"approved\":true}}"'

Rolling Approval: To approve all pending and future InstallPlans, run:

oc get installplan -A -w -o custom-columns=NS:.metadata.namespace,NAME:.metadata.name,PHASE:.status.phase --no-headers | while read -r namespace name phase; do
     if [ "$phase" = "RequiresApproval" ]; then
      echo "Detected ready InstallPlan: $name in $namespace. Approving..."
      oc patch installplan "$name" -n "$namespace" --type merge -p '{"spec":{"approved":true}}'
  fi
done

πŸ–₯️ 3. Monitor ArgoCD

  • The ArgoCD dashboard is available via the Waffle Menu in the OpenShift Console header. Alternatively, retrieve the URL directly:
# Get the ArgoCD URL
oc get route openshift-gitops-server -n openshift-gitops -o jsonpath='{.spec.host}'

Wait for the rhoai-deployment ArgoCD application to reach a Healthy state, and... Enjoy using RHOAI!


πŸ› οΈ 4. Configure Hardware (Optional)

If your workloads require GPU acceleration, navigate to the hardware-profile/ directory. This folder contains automation scripts and documentation to streamline your GPU setup rather than relying on manual cluster configurations.

Please see the Hardware Profile Guide for step-by-step instructions on how to do any of the following:

  1. Provision GPU Nodes: Automatically create a GPU MachineSet with the correct AWS instance types, labels, and taints.
  2. Install GPU Dashboards: Deploy the NVIDIA DCGM Exporter dashboard directly to your OpenShift web console.
  3. Enable Time-Slicing: Dynamically partition physical GPUs into virtual replicas so multiple workloads can share resources.
  4. Register the Hardware Profile: The final RHOAI dashboard steps to make the GPUs selectable for deployments and workbenches.

RHOAI 3.x Dependencies

Operator Description
Job-set Kubernetes-native API for managing groups of jobs as a unit
openshift-custom-metrics-autoscaler Custom Metrics Autoscaler Operator
cert-manager Cert-manager Operator for Red Hat OpenShift
Leader Worker Set LWS-rhel9 Operator
Red Hat Connectivity Link Unified framework for multicloud connectivity and API management
Kueue (RHBOK) Red Hat Build of Kueue
SR-IOV Single Root I/O Virtualization (SR-IOV) Network Operator
GPU Operator NVIDIA GPU provisioning and management
OpenTelemetry Vendor-neutral observability framework for metrics, logs, and traces
Tempo High-scale distributed tracing backend
Cluster Observability Standalone monitoring stacks for independent service configuration

References

Repository Structure

rhoai-argo/
β”œβ”€β”€ app-of-apps.yaml
β”œβ”€β”€ README.md
β”œβ”€β”€ gitops-config/
β”‚   β”œβ”€β”€ gitops-permission.yaml
β”‚   └── openshift-gitops-subscription.yaml
β”œβ”€β”€ argocd-applications/
β”‚   β”œβ”€β”€ Chart.yaml
β”‚   └── templates/
β”‚       β”œβ”€β”€ gpu-operator.yaml
β”‚       β”œβ”€β”€ infrastrcuture-operators.yaml
β”‚       β”œβ”€β”€ observability-operators.yaml
β”‚       β”œβ”€β”€ rhoai-application.yaml
β”‚       └── scaling-operators.yaml
└── helm/
    β”œβ”€β”€ rhoai-stack/
    β”‚   β”œβ”€β”€ Chart.yaml
    β”‚   β”œβ”€β”€ values.yaml
    β”‚   └── templates/
    β”‚       β”œβ”€β”€ configs/
    β”‚       β”‚   β”œβ”€β”€ 32-operator-deployment.yaml
    β”‚       β”‚   β”œβ”€β”€ 33-datasciencecluster.yaml
    β”‚       β”‚   β”œβ”€β”€ 35-dashboard-deployment.yaml
    β”‚       β”‚   └── 35-odhdashboardconfig.yaml
    β”‚       └── operators/
    β”‚           └── 30-rhoai-operator.yaml
    β”œβ”€β”€ workload-scaling/
    β”‚   β”œβ”€β”€ Chart.yaml
    β”‚   └── templates/
    β”‚       β”œβ”€β”€ configs/
    β”‚       β”‚   β”œβ”€β”€ 07-cluster-job-set.yaml
    β”‚       β”‚   └── 07-cma-controller.yaml
    β”‚       └── operators/
    β”‚           β”œβ”€β”€ 05-cma-operator.yaml
    β”‚           β”œβ”€β”€ 05-job-set.yaml
    β”‚           β”œβ”€β”€ 05-leader-worker-set.yaml
    β”‚           └── 05-rhbok.yaml
    β”œβ”€β”€ gpu-operator-installation/
    β”‚   β”œβ”€β”€ Chart.yaml
    β”‚   β”œβ”€β”€ values.yaml
    β”‚   └── templates/
    β”‚       β”œβ”€β”€ configs/
    β”‚       β”‚   β”œβ”€β”€ 15-nfd-instance.yaml
    β”‚       β”‚   └── 25-gpu-clusterpolicy.yaml
    β”‚       └── operators/
    β”‚           β”œβ”€β”€ 10-nfd-operator.yaml
    β”‚           └── 20-gpu-operator.yaml
    β”œβ”€β”€ infrastructure-utilities/
    β”‚   β”œβ”€β”€ Chart.yaml
    β”‚   └── templates/
    β”‚       β”œβ”€β”€ configs/
    β”‚       └── operators/
    β”‚           β”œβ”€β”€ 05-kmm.yaml
    β”‚           β”œβ”€β”€ 05-rhcl.yaml
    β”‚           └── 05-sriov.yaml
    └── observability-stack/
        β”œβ”€β”€ Chart.yaml
        └── templates/
            β”œβ”€β”€ configs/
            └── operators/
                β”œβ”€β”€ 05-cluster-observability.yaml
                β”œβ”€β”€ 05-open-telemetry.yaml
                └── 05-tempo.yaml

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages