This guide explains how to adapt saif-ai-pod for your environment.
The project uses a data-driven approach where cluster-specific values are defined in a single location and templates generate the rest.
File: openshift/cluster-mappings.yaml
This is the primary configuration file. Define:
- Cluster names
- Node IPs
- Server serial numbers (from Intersight)
- UCS profile names
Example:
clusters:
- name: my-cluster-1
ucs_profile: my-ucs-profile
serial: FCH12345678
ip: 192.168.1.100
hostname: my-cluster-1.example.com
interface: eth0
installation_disk: /dev/sdaFile: openshift/cluster-mappings.yaml (infrastructure section)
Define values common to all clusters:
- Base domain
- DNS servers
- Gateway
- Registry hostname
- Network CIDR
Example:
infrastructure:
base_domain: example.com
vlan: 100
network_cidr: 192.168.1.0/24
gateway: 192.168.1.1
dns_servers:
- 192.168.1.10
- 192.168.1.11
registry:
hostname: registry.example.com
port: 5000Configure in your GitHub repository settings:
| Secret | Source |
|---|---|
INTERSIGHT_API_KEY_ID |
Intersight Console → API Keys |
INTERSIGHT_SECRET_KEY |
Intersight Console → API Keys (PEM file contents) |
REDHAT_PULL_SECRET |
console.redhat.com → Downloads → Pull Secret |
SSH_PUBLIC_KEY |
Your SSH public key for node access |
Directory: ucs/data/
Customize if your hardware differs:
- Pool sizes (MAC, UUID, IP)
- VLAN IDs
- Storage policies (RAID vs no-RAID)
- Server templates
Create these records for each cluster:
| Record | Type | Value |
|---|---|---|
api.<cluster>.<domain> |
A | Node IP |
api-int.<cluster>.<domain> |
A | Node IP |
*.apps.<cluster>.<domain> |
A | Node IP |
registry.<domain> |
A | Runner VM IP |
- Fork/clone this repository
- Edit
openshift/cluster-mappings.yamlwith your values - Configure GitHub secrets in your repository
- Create DNS records for your clusters
- Update Intersight organization name in
ucs/data/*/global_settings.ezi.yaml - Verify server serial numbers match your hardware
- Run validation:
gh workflow run ucs/pipeline.yaml -f action=DRY-RUN - Run validation:
gh workflow run openshift/pipeline.yaml -f action=DRY-RUN
To find the correct values for your environment:
| Value | How to Find |
|---|---|
| Server serials | Intersight → Servers → Serial column |
| MAC addresses | Generated by UCS workflow, stored in cluster-macs.yaml |
| Interface names | Boot a live ISO on hardware, run ip link show |
| Disk paths | Boot a live ISO, run lsblk |
- Edit
openshift/cluster-mappings.yaml:- Update
network_cidr - Update
gateway - Update cluster
ipvalues
- Update
- Re-render configs:
python scripts/render-cluster-config.py <cluster>
- Edit
openshift/cluster-mappings.yaml:- Update
base_domain
- Update
- Create DNS records for new domain
- Re-render configs
- Add entry to
openshift/cluster-mappings.yaml - Run UCS workflow to deploy profile and capture MAC
- Create DNS records
- Run renderer:
python scripts/render-cluster-config.py <new-cluster> - Commit rendered configs
- Run OpenShift workflow
- Update
openshift_versionin cluster entry - Update
mirror/images-list.yamlwith new version - Run
sync-images.yamlto mirror new version - Re-render configs
- Edit cluster entry in
cluster-mappings.yaml:gpu: model: NVIDIA A100 count: 2
- Ensure BIOS policy has Secure Boot disabled
- GPU Operator will auto-detect and configure
Workflows use a pre-built container with all tools. To customize:
- Fork github-actions-container
- Modify Dockerfile with your tools
- Build and push to your registry
- Update workflow files to reference your image
Follow the existing pattern:
- Validate workflow reads from
cluster-mappings.yaml - Deploy workflow takes
cluster_nameandactioninputs - Test workflow verifies deployment
# Check cluster-mappings.yaml syntax
python scripts/render-cluster-config.py <cluster> --dry-run
# Verify Intersight credentials
isctl get compute physical-summary --top 1# Test from your network
nslookup api.<cluster>.<domain>
# Verify DNS server is authoritative
dig +trace api.<cluster>.<domain># Run UCS deploy to capture MAC
gh workflow run ucs/pipeline.yaml -f cluster_name=<cluster> -f action=DEPLOY
# Check generated file
cat openshift/cluster-macs.yamlThe container registry stores mirrored images for air-gapped installation. Two options are supported:
Image: registry:2
Pros:
- Simple to deploy and operate
- No authentication required (simpler pull secret)
- Lightweight resource requirements
Cons:
- No built-in UI
- No image pruning/garbage collection by default
- Not officially supported by Red Hat
Configuration:
infrastructure:
registry:
hostname: registry.example.com
port: 5000
tls: true # Self-signed certificate with additionalTrustBundleImage: mirror-registry (Quay-based)
Pros:
- Official Red Hat supported approach
- Built-in web UI
- Image pruning and garbage collection
- Integrated with oc-mirror workflows
Cons:
- Requires authentication setup
- Higher resource requirements
- More complex initial configuration
When to use each:
- Development/Lab: Docker registry:2 (simpler)
- Production/CVD: Red Hat mirror-registry (supported)
The registry must be accessible via HTTPS for OpenShift installation. Three approaches are available:
Current approach in this project.
Requirements:
- CA certificate and key
additionalTrustBundlein install-config.yamladditionalTrustBundlePolicy: Alwaysto ensure CA is injected
Configuration in install-config.yaml:
additionalTrustBundlePolicy: Always
additionalTrustBundle: |
-----BEGIN CERTIFICATE-----
<your CA certificate content>
-----END CERTIFICATE-----Advantages:
- No external DNS or internet access required
- Works in fully air-gapped environments
- Full control over certificate lifecycle
Requirements:
- Public DNS zone with API access (e.g., IONOS, Cloudflare, Route53)
- Registry hostname in public DNS zone
- Network allows outbound HTTPS to Let's Encrypt
Advantages:
- Publicly trusted certificate - no
additionalTrustBundleneeded - Automatic renewal possible
Setup:
# Example with IONOS DNS
certbot certonly \
--authenticator dns-ionos \
--dns-ionos-credentials /path/to/credentials.ini \
-d registry.your-public-domain.comRequirements:
- Certificate from internal CA
- CA cert in
additionalTrustBundle
When to use:
- Enterprise environments with mandatory internal CA
- When aligning with existing PKI infrastructure
True Air-Gap:
- No internet connectivity at any point
- Requires "sneakernet" transfer of images
- Use mirror-to-disk, then disk-to-mirror workflow
Semi-Connected (This Project):
- Internet available during preparation
- Air-gapped during installation
- Direct mirroring to internal registry
All images must be mirrored before installation:
| Category | Images | Config File |
|---|---|---|
| OpenShift Release | ~150 images | mirror/images-list.yaml |
| Red Hat Operators | NFD, etc. | mirror/images-list.yaml |
| Certified Operators | GPU Operator, CLIFE | mirror/images-list.yaml |
| Additional Images | CLIFE, custom images | mirror/images-list.yaml |
# Mirror to registry directly (semi-connected)
oc-mirror --config images-list.yaml docker://registry:5000
# Mirror to disk (true air-gap)
oc-mirror --config images-list.yaml file:///path/to/disk
# Upload from disk to registry
oc-mirror --from /path/to/disk docker://registry:5000CRITICAL: The registry must be accessible from the target server network during installation.
- Agent boots and needs to pull images
- Verify network path exists before generating ISO
- Test:
curl -v https://registry:5000/v2/_catalog
Update VLANs in UCS policies:
File: ucs/data/example/policies/network-policies.ezi.yaml
vlan_policies:
- name: saif-ai-pod-vlan
vlans:
- vlan_list: "100" # Change to your VLAN
name: ai-pod-vlan
auto_allow_on_uplinks: trueFor ACI-attached UCS-X:
- Create EPG in ACI for cluster VLAN
- Configure static path bindings to FI uplinks
- Ensure ARP flood and unicast routing enabled on Bridge Domain
- See TOPOLOGY.md for example ACI config
For jumbo frames (MTU 9216):
UCS Policy:
ethernet_network_control_policies:
- name: saif-ai-pod-netctrl
lldp_transmit: true
lldp_receive: trueAgent Config (NMState):
networkConfig:
interfaces:
- name: eth0
mtu: 9000The 100G IFMs have specific requirements:
- NO server ports needed - IFM handles server port mapping internally
- NO EthNetworkGroupPolicy on uplinks - VLANs controlled via VLAN policy
- Use
auto_allow_on_uplinks: truein VLAN policy
Different servers may have 1 or 2 M.2 drives:
| Configuration | Storage Policy | Notes |
|---|---|---|
| 2x M.2 disks | saif-ucsx-storage |
RAID1 for redundancy |
| 1x M.2 disk | saif-ucsx-storage-noraid |
No RAID possible |
Detection: Check Intersight or iserver for disk inventory before deployment.
Profile Override: Set storage policy per-profile in profiles/server.ezi.yaml:
profiles:
server:
- name: saif-ai-pod-1
storage_policy: saif-ucsx-storage-noraidGPU nodes require:
- Secure Boot disabled in BIOS policy
- GPU detected in PCI inventory (verify in Intersight)
- No additional UCS configuration - GPU Operator handles the rest
Non-GPU nodes:
- Can use standard BIOS policy
- Skip GPU Operator deployment
Two templates are provided:
| Template | Use Case |
|---|---|
saif-ai-pod-template |
Standard config, RAID M.2 |
saif-ai-pod-template-noraid |
Single M.2 disk |
To change a profile's template, detach and reattach:
# Detach
isctl update server profile moid <MOID> --SrcTemplate 'null'
# Update profile YAML, run terraform apply
# Profile will associate with new templateFor issues with this project:
- Check existing GitHub Issues
- Review Method of Procedure for detailed steps
- Consult TOPOLOGY.md for infrastructure reference
- See mop/troubleshooting.md for common issues