Skip to content

Latest commit

 

History

History
488 lines (353 loc) · 11.8 KB

File metadata and controls

488 lines (353 loc) · 11.8 KB

Customization Guide

This guide explains how to adapt saif-ai-pod for your environment.

Overview

The project uses a data-driven approach where cluster-specific values are defined in a single location and templates generate the rest.

Files to Customize

1. Cluster Definitions (Required)

File: openshift/cluster-mappings.yaml

This is the primary configuration file. Define:

  • Cluster names
  • Node IPs
  • Server serial numbers (from Intersight)
  • UCS profile names

Example:

clusters:
  - name: my-cluster-1
    ucs_profile: my-ucs-profile
    serial: FCH12345678
    ip: 192.168.1.100
    hostname: my-cluster-1.example.com
    interface: eth0
    installation_disk: /dev/sda

2. Shared Infrastructure (Required)

File: openshift/cluster-mappings.yaml (infrastructure section)

Define values common to all clusters:

  • Base domain
  • DNS servers
  • Gateway
  • Registry hostname
  • Network CIDR

Example:

infrastructure:
  base_domain: example.com
  vlan: 100
  network_cidr: 192.168.1.0/24
  gateway: 192.168.1.1
  dns_servers:
    - 192.168.1.10
    - 192.168.1.11
  registry:
    hostname: registry.example.com
    port: 5000

3. GitHub Secrets (Required)

Configure in your GitHub repository settings:

Secret Source
INTERSIGHT_API_KEY_ID Intersight Console → API Keys
INTERSIGHT_SECRET_KEY Intersight Console → API Keys (PEM file contents)
REDHAT_PULL_SECRET console.redhat.com → Downloads → Pull Secret
SSH_PUBLIC_KEY Your SSH public key for node access

4. UCS Configuration (If using different hardware)

Directory: ucs/data/

Customize if your hardware differs:

  • Pool sizes (MAC, UUID, IP)
  • VLAN IDs
  • Storage policies (RAID vs no-RAID)
  • Server templates

5. DNS Records (Required)

Create these records for each cluster:

Record Type Value
api.<cluster>.<domain> A Node IP
api-int.<cluster>.<domain> A Node IP
*.apps.<cluster>.<domain> A Node IP
registry.<domain> A Runner VM IP

Customization Checklist

  • Fork/clone this repository
  • Edit openshift/cluster-mappings.yaml with your values
  • Configure GitHub secrets in your repository
  • Create DNS records for your clusters
  • Update Intersight organization name in ucs/data/*/global_settings.ezi.yaml
  • Verify server serial numbers match your hardware
  • Run validation: gh workflow run ucs/pipeline.yaml -f action=DRY-RUN
  • Run validation: gh workflow run openshift/pipeline.yaml -f action=DRY-RUN

Value Reference

To find the correct values for your environment:

Value How to Find
Server serials Intersight → Servers → Serial column
MAC addresses Generated by UCS workflow, stored in cluster-macs.yaml
Interface names Boot a live ISO on hardware, run ip link show
Disk paths Boot a live ISO, run lsblk

Common Customizations

Different Network Subnet

  1. Edit openshift/cluster-mappings.yaml:
    • Update network_cidr
    • Update gateway
    • Update cluster ip values
  2. Re-render configs: python scripts/render-cluster-config.py <cluster>

Different Base Domain

  1. Edit openshift/cluster-mappings.yaml:
    • Update base_domain
  2. Create DNS records for new domain
  3. Re-render configs

Adding a New Cluster

  1. Add entry to openshift/cluster-mappings.yaml
  2. Run UCS workflow to deploy profile and capture MAC
  3. Create DNS records
  4. Run renderer: python scripts/render-cluster-config.py <new-cluster>
  5. Commit rendered configs
  6. Run OpenShift workflow

Different OpenShift Version

  1. Update openshift_version in cluster entry
  2. Update mirror/images-list.yaml with new version
  3. Run sync-images.yaml to mirror new version
  4. Re-render configs

Different GPU Configuration

  1. Edit cluster entry in cluster-mappings.yaml:
    gpu:
      model: NVIDIA A100
      count: 2
  2. Ensure BIOS policy has Secure Boot disabled
  3. GPU Operator will auto-detect and configure

Workflow Customization

Container Image

Workflows use a pre-built container with all tools. To customize:

  1. Fork github-actions-container
  2. Modify Dockerfile with your tools
  3. Build and push to your registry
  4. Update workflow files to reference your image

Adding New Workflows

Follow the existing pattern:

  • Validate workflow reads from cluster-mappings.yaml
  • Deploy workflow takes cluster_name and action inputs
  • Test workflow verifies deployment

Troubleshooting

Validation Fails

# Check cluster-mappings.yaml syntax
python scripts/render-cluster-config.py <cluster> --dry-run

# Verify Intersight credentials
isctl get compute physical-summary --top 1

DNS Not Resolving

# Test from your network
nslookup api.<cluster>.<domain>

# Verify DNS server is authoritative
dig +trace api.<cluster>.<domain>

MAC Address Missing

# Run UCS deploy to capture MAC
gh workflow run ucs/pipeline.yaml -f cluster_name=<cluster> -f action=DEPLOY

# Check generated file
cat openshift/cluster-macs.yaml

Registry Configuration Options

The container registry stores mirrored images for air-gapped installation. Two options are supported:

Option 1: Docker Registry (Current)

Image: registry:2

Pros:

  • Simple to deploy and operate
  • No authentication required (simpler pull secret)
  • Lightweight resource requirements

Cons:

  • No built-in UI
  • No image pruning/garbage collection by default
  • Not officially supported by Red Hat

Configuration:

infrastructure:
  registry:
    hostname: registry.example.com
    port: 5000
    tls: true  # Self-signed certificate with additionalTrustBundle

Option 2: Red Hat Mirror Registry (Recommended for Production)

Image: mirror-registry (Quay-based)

Pros:

  • Official Red Hat supported approach
  • Built-in web UI
  • Image pruning and garbage collection
  • Integrated with oc-mirror workflows

Cons:

  • Requires authentication setup
  • Higher resource requirements
  • More complex initial configuration

When to use each:

  • Development/Lab: Docker registry:2 (simpler)
  • Production/CVD: Red Hat mirror-registry (supported)

TLS Certificate Options

The registry must be accessible via HTTPS for OpenShift installation. Three approaches are available:

Option 1: Self-Signed CA Certificate (Current)

Current approach in this project.

Requirements:

  • CA certificate and key
  • additionalTrustBundle in install-config.yaml
  • additionalTrustBundlePolicy: Always to ensure CA is injected

Configuration in install-config.yaml:

additionalTrustBundlePolicy: Always
additionalTrustBundle: |
  -----BEGIN CERTIFICATE-----
  <your CA certificate content>
  -----END CERTIFICATE-----

Advantages:

  • No external DNS or internet access required
  • Works in fully air-gapped environments
  • Full control over certificate lifecycle

Option 2: Let's Encrypt with Public DNS

Requirements:

  • Public DNS zone with API access (e.g., IONOS, Cloudflare, Route53)
  • Registry hostname in public DNS zone
  • Network allows outbound HTTPS to Let's Encrypt

Advantages:

  • Publicly trusted certificate - no additionalTrustBundle needed
  • Automatic renewal possible

Setup:

# Example with IONOS DNS
certbot certonly \
  --authenticator dns-ionos \
  --dns-ionos-credentials /path/to/credentials.ini \
  -d registry.your-public-domain.com

Option 3: Enterprise CA Certificate

Requirements:

  • Certificate from internal CA
  • CA cert in additionalTrustBundle

When to use:

  • Enterprise environments with mandatory internal CA
  • When aligning with existing PKI infrastructure

Air-Gap Considerations

True Air-Gap vs. Semi-Connected

True Air-Gap:

  • No internet connectivity at any point
  • Requires "sneakernet" transfer of images
  • Use mirror-to-disk, then disk-to-mirror workflow

Semi-Connected (This Project):

  • Internet available during preparation
  • Air-gapped during installation
  • Direct mirroring to internal registry

Image Mirroring Requirements

All images must be mirrored before installation:

Category Images Config File
OpenShift Release ~150 images mirror/images-list.yaml
Red Hat Operators NFD, etc. mirror/images-list.yaml
Certified Operators GPU Operator, CLIFE mirror/images-list.yaml
Additional Images CLIFE, custom images mirror/images-list.yaml

oc-mirror Workflow

# Mirror to registry directly (semi-connected)
oc-mirror --config images-list.yaml docker://registry:5000

# Mirror to disk (true air-gap)
oc-mirror --config images-list.yaml file:///path/to/disk

# Upload from disk to registry
oc-mirror --from /path/to/disk docker://registry:5000

Registry Accessibility During Installation

CRITICAL: The registry must be accessible from the target server network during installation.

  • Agent boots and needs to pull images
  • Verify network path exists before generating ISO
  • Test: curl -v https://registry:5000/v2/_catalog

Network Customization

VLAN Configuration

Update VLANs in UCS policies:

File: ucs/data/example/policies/network-policies.ezi.yaml

vlan_policies:
  - name: saif-ai-pod-vlan
    vlans:
      - vlan_list: "100"  # Change to your VLAN
        name: ai-pod-vlan
        auto_allow_on_uplinks: true

ACI Integration

For ACI-attached UCS-X:

  1. Create EPG in ACI for cluster VLAN
  2. Configure static path bindings to FI uplinks
  3. Ensure ARP flood and unicast routing enabled on Bridge Domain
  4. See TOPOLOGY.md for example ACI config

MTU / Jumbo Frames

For jumbo frames (MTU 9216):

UCS Policy:

ethernet_network_control_policies:
  - name: saif-ai-pod-netctrl
    lldp_transmit: true
    lldp_receive: true

Agent Config (NMState):

networkConfig:
  interfaces:
    - name: eth0
      mtu: 9000

UCSX-S9108-100G IFM Notes

The 100G IFMs have specific requirements:

  • NO server ports needed - IFM handles server port mapping internally
  • NO EthNetworkGroupPolicy on uplinks - VLANs controlled via VLAN policy
  • Use auto_allow_on_uplinks: true in VLAN policy

Hardware Variations

M.2 Storage Configuration

Different servers may have 1 or 2 M.2 drives:

Configuration Storage Policy Notes
2x M.2 disks saif-ucsx-storage RAID1 for redundancy
1x M.2 disk saif-ucsx-storage-noraid No RAID possible

Detection: Check Intersight or iserver for disk inventory before deployment.

Profile Override: Set storage policy per-profile in profiles/server.ezi.yaml:

profiles:
  server:
    - name: saif-ai-pod-1
      storage_policy: saif-ucsx-storage-noraid

GPU vs Non-GPU Nodes

GPU nodes require:

  • Secure Boot disabled in BIOS policy
  • GPU detected in PCI inventory (verify in Intersight)
  • No additional UCS configuration - GPU Operator handles the rest

Non-GPU nodes:

  • Can use standard BIOS policy
  • Skip GPU Operator deployment

Server Profile Templates

Two templates are provided:

Template Use Case
saif-ai-pod-template Standard config, RAID M.2
saif-ai-pod-template-noraid Single M.2 disk

To change a profile's template, detach and reattach:

# Detach
isctl update server profile moid <MOID> --SrcTemplate 'null'

# Update profile YAML, run terraform apply
# Profile will associate with new template

Support

For issues with this project: