Customization Guide

This guide explains how to adapt saif-ai-pod for your environment.

Overview

The project uses a data-driven approach where cluster-specific values are defined in a single location and templates generate the rest.

Files to Customize

1. Cluster Definitions (Required)

File: openshift/cluster-mappings.yaml

This is the primary configuration file. Define:

Cluster names
Node IPs
Server serial numbers (from Intersight)
UCS profile names

Example:

clusters:
  - name: my-cluster-1
    ucs_profile: my-ucs-profile
    serial: FCH12345678
    ip: 192.168.1.100
    hostname: my-cluster-1.example.com
    interface: eth0
    installation_disk: /dev/sda

2. Shared Infrastructure (Required)

File: openshift/cluster-mappings.yaml (infrastructure section)

Define values common to all clusters:

Base domain
DNS servers
Gateway
Registry hostname
Network CIDR

Example:

infrastructure:
  base_domain: example.com
  vlan: 100
  network_cidr: 192.168.1.0/24
  gateway: 192.168.1.1
  dns_servers:
    - 192.168.1.10
    - 192.168.1.11
  registry:
    hostname: registry.example.com
    port: 5000

3. GitHub Secrets (Required)

Configure in your GitHub repository settings:

Secret	Source
`INTERSIGHT_API_KEY_ID`	Intersight Console → API Keys
`INTERSIGHT_SECRET_KEY`	Intersight Console → API Keys (PEM file contents)
`REDHAT_PULL_SECRET`	console.redhat.com → Downloads → Pull Secret
`SSH_PUBLIC_KEY`	Your SSH public key for node access

4. UCS Configuration (If using different hardware)

Directory: ucs/data/

Customize if your hardware differs:

Pool sizes (MAC, UUID, IP)
VLAN IDs
Storage policies (RAID vs no-RAID)
Server templates

5. DNS Records (Required)

Create these records for each cluster:

Record	Type	Value
`api.<cluster>.<domain>`	A	Node IP
`api-int.<cluster>.<domain>`	A	Node IP
`*.apps.<cluster>.<domain>`	A	Node IP
`registry.<domain>`	A	Runner VM IP

Customization Checklist

Fork/clone this repository
Edit openshift/cluster-mappings.yaml with your values
Configure GitHub secrets in your repository
Create DNS records for your clusters
Update Intersight organization name in ucs/data/*/global_settings.ezi.yaml
Verify server serial numbers match your hardware
Run validation: gh workflow run ucs/pipeline.yaml -f action=DRY-RUN
Run validation: gh workflow run openshift/pipeline.yaml -f action=DRY-RUN

Value Reference

To find the correct values for your environment:

Value	How to Find
Server serials	Intersight → Servers → Serial column
MAC addresses	Generated by UCS workflow, stored in `cluster-macs.yaml`
Interface names	Boot a live ISO on hardware, run `ip link show`
Disk paths	Boot a live ISO, run `lsblk`

Common Customizations

Different Network Subnet

Edit openshift/cluster-mappings.yaml:
- Update network_cidr
- Update gateway
- Update cluster ip values
Re-render configs: python scripts/render-cluster-config.py <cluster>

Different Base Domain

Edit openshift/cluster-mappings.yaml:
- Update base_domain
Create DNS records for new domain
Re-render configs

Adding a New Cluster

Add entry to openshift/cluster-mappings.yaml
Run UCS workflow to deploy profile and capture MAC
Create DNS records
Run renderer: python scripts/render-cluster-config.py <new-cluster>
Commit rendered configs
Run OpenShift workflow

Different OpenShift Version

Update openshift_version in cluster entry
Update mirror/images-list.yaml with new version
Run sync-images.yaml to mirror new version
Re-render configs

Different GPU Configuration

Edit cluster entry in cluster-mappings.yaml:
```
gpu:
  model: NVIDIA A100
  count: 2
```
Ensure BIOS policy has Secure Boot disabled
GPU Operator will auto-detect and configure

Workflow Customization

Container Image

Workflows use a pre-built container with all tools. To customize:

Fork github-actions-container
Modify Dockerfile with your tools
Build and push to your registry
Update workflow files to reference your image

Adding New Workflows

Follow the existing pattern:

Validate workflow reads from cluster-mappings.yaml
Deploy workflow takes cluster_name and action inputs
Test workflow verifies deployment

Troubleshooting

Validation Fails

# Check cluster-mappings.yaml syntax
python scripts/render-cluster-config.py <cluster> --dry-run

# Verify Intersight credentials
isctl get compute physical-summary --top 1

DNS Not Resolving

# Test from your network
nslookup api.<cluster>.<domain>

# Verify DNS server is authoritative
dig +trace api.<cluster>.<domain>

MAC Address Missing

# Run UCS deploy to capture MAC
gh workflow run ucs/pipeline.yaml -f cluster_name=<cluster> -f action=DEPLOY

# Check generated file
cat openshift/cluster-macs.yaml

Registry Configuration Options

The container registry stores mirrored images for air-gapped installation. Two options are supported:

Option 1: Docker Registry (Current)

Image: registry:2

Pros:

Simple to deploy and operate
No authentication required (simpler pull secret)
Lightweight resource requirements

Cons:

No built-in UI
No image pruning/garbage collection by default
Not officially supported by Red Hat

Configuration:

infrastructure:
  registry:
    hostname: registry.example.com
    port: 5000
    tls: true  # Self-signed certificate with additionalTrustBundle

Option 2: Red Hat Mirror Registry (Recommended for Production)

Image: mirror-registry (Quay-based)

Pros:

Official Red Hat supported approach
Built-in web UI
Image pruning and garbage collection
Integrated with oc-mirror workflows

Cons:

Requires authentication setup
Higher resource requirements
More complex initial configuration

When to use each:

Development/Lab: Docker registry:2 (simpler)
Production/CVD: Red Hat mirror-registry (supported)

TLS Certificate Options

The registry must be accessible via HTTPS for OpenShift installation. Three approaches are available:

Option 1: Self-Signed CA Certificate (Current)

Current approach in this project.

Requirements:

CA certificate and key
additionalTrustBundle in install-config.yaml
additionalTrustBundlePolicy: Always to ensure CA is injected

Configuration in install-config.yaml:

additionalTrustBundlePolicy: Always
additionalTrustBundle: |
  -----BEGIN CERTIFICATE-----
  <your CA certificate content>
  -----END CERTIFICATE-----

Advantages:

No external DNS or internet access required
Works in fully air-gapped environments
Full control over certificate lifecycle

Option 2: Let's Encrypt with Public DNS

Requirements:

Public DNS zone with API access (e.g., IONOS, Cloudflare, Route53)
Registry hostname in public DNS zone
Network allows outbound HTTPS to Let's Encrypt

Advantages:

Publicly trusted certificate - no additionalTrustBundle needed
Automatic renewal possible

Setup:

# Example with IONOS DNS
certbot certonly \
  --authenticator dns-ionos \
  --dns-ionos-credentials /path/to/credentials.ini \
  -d registry.your-public-domain.com

Option 3: Enterprise CA Certificate

Requirements:

Certificate from internal CA
CA cert in additionalTrustBundle

When to use:

Enterprise environments with mandatory internal CA
When aligning with existing PKI infrastructure

Air-Gap Considerations

True Air-Gap vs. Semi-Connected

True Air-Gap:

No internet connectivity at any point
Requires "sneakernet" transfer of images
Use mirror-to-disk, then disk-to-mirror workflow

Semi-Connected (This Project):

Internet available during preparation
Air-gapped during installation
Direct mirroring to internal registry

Image Mirroring Requirements

All images must be mirrored before installation:

Category	Images	Config File
OpenShift Release	~150 images	`mirror/images-list.yaml`
Red Hat Operators	NFD, etc.	`mirror/images-list.yaml`
Certified Operators	GPU Operator, CLIFE	`mirror/images-list.yaml`
Additional Images	CLIFE, custom images	`mirror/images-list.yaml`

oc-mirror Workflow

# Mirror to registry directly (semi-connected)
oc-mirror --config images-list.yaml docker://registry:5000

# Mirror to disk (true air-gap)
oc-mirror --config images-list.yaml file:///path/to/disk

# Upload from disk to registry
oc-mirror --from /path/to/disk docker://registry:5000

Registry Accessibility During Installation

CRITICAL: The registry must be accessible from the target server network during installation.

Agent boots and needs to pull images
Verify network path exists before generating ISO
Test: curl -v https://registry:5000/v2/_catalog

Network Customization

VLAN Configuration

Update VLANs in UCS policies:

File: ucs/data/example/policies/network-policies.ezi.yaml

vlan_policies:
  - name: saif-ai-pod-vlan
    vlans:
      - vlan_list: "100"  # Change to your VLAN
        name: ai-pod-vlan
        auto_allow_on_uplinks: true

ACI Integration

For ACI-attached UCS-X:

Create EPG in ACI for cluster VLAN
Configure static path bindings to FI uplinks
Ensure ARP flood and unicast routing enabled on Bridge Domain
See TOPOLOGY.md for example ACI config

MTU / Jumbo Frames

For jumbo frames (MTU 9216):

UCS Policy:

ethernet_network_control_policies:
  - name: saif-ai-pod-netctrl
    lldp_transmit: true
    lldp_receive: true

Agent Config (NMState):

networkConfig:
  interfaces:
    - name: eth0
      mtu: 9000

UCSX-S9108-100G IFM Notes

The 100G IFMs have specific requirements:

NO server ports needed - IFM handles server port mapping internally
NO EthNetworkGroupPolicy on uplinks - VLANs controlled via VLAN policy
Use auto_allow_on_uplinks: true in VLAN policy

Hardware Variations

M.2 Storage Configuration

Different servers may have 1 or 2 M.2 drives:

Configuration	Storage Policy	Notes
2x M.2 disks	`saif-ucsx-storage`	RAID1 for redundancy
1x M.2 disk	`saif-ucsx-storage-noraid`	No RAID possible

Detection: Check Intersight or iserver for disk inventory before deployment.

Profile Override: Set storage policy per-profile in profiles/server.ezi.yaml:

profiles:
  server:
    - name: saif-ai-pod-1
      storage_policy: saif-ucsx-storage-noraid

GPU vs Non-GPU Nodes

GPU nodes require:

Secure Boot disabled in BIOS policy
GPU detected in PCI inventory (verify in Intersight)
No additional UCS configuration - GPU Operator handles the rest

Non-GPU nodes:

Can use standard BIOS policy
Skip GPU Operator deployment

Server Profile Templates

Two templates are provided:

Template	Use Case
`saif-ai-pod-template`	Standard config, RAID M.2
`saif-ai-pod-template-noraid`	Single M.2 disk

To change a profile's template, detach and reattach:

# Detach
isctl update server profile moid <MOID> --SrcTemplate 'null'

# Update profile YAML, run terraform apply
# Profile will associate with new template

Support

For issues with this project:

Check existing GitHub Issues
Review Method of Procedure for detailed steps
Consult TOPOLOGY.md for infrastructure reference
See mop/troubleshooting.md for common issues

FilesExpand file tree

CUSTOMIZATION.md

Latest commit

History

CUSTOMIZATION.md

File metadata and controls

Customization Guide

Overview

Files to Customize

1. Cluster Definitions (Required)

2. Shared Infrastructure (Required)

3. GitHub Secrets (Required)

4. UCS Configuration (If using different hardware)

5. DNS Records (Required)

Customization Checklist

Value Reference

Common Customizations

Different Network Subnet

Different Base Domain

Adding a New Cluster

Different OpenShift Version

Different GPU Configuration

Workflow Customization

Container Image

Adding New Workflows

Troubleshooting

Validation Fails

DNS Not Resolving

MAC Address Missing

Registry Configuration Options

Option 1: Docker Registry (Current)

Option 2: Red Hat Mirror Registry (Recommended for Production)

TLS Certificate Options

Option 1: Self-Signed CA Certificate (Current)

Option 2: Let's Encrypt with Public DNS

Option 3: Enterprise CA Certificate

Air-Gap Considerations

True Air-Gap vs. Semi-Connected

Image Mirroring Requirements

oc-mirror Workflow

Registry Accessibility During Installation

Network Customization

VLAN Configuration

ACI Integration

MTU / Jumbo Frames

UCSX-S9108-100G IFM Notes

Hardware Variations

M.2 Storage Configuration

GPU vs Non-GPU Nodes

Server Profile Templates

Support