Troubleshooting Guide

Quick solutions to common issues with OpenShift DPF automation deployments.

Quick Diagnostic Commands

# Overall cluster health
oc get nodes
oc get pods --all-namespaces | grep -v Running | grep -v Completed

# DPF-specific status
oc get pods -n dpf-operator-system
oc get hostedcluster -n clusters
oc get dpudeployment -n dpf-operator-system

# Check automation status
make cluster-status
make worker-status

Deployment Issues

Cluster Creation Fails

Problem: make create-cluster fails or hangs

# Check aicli configuration
aicli list clusters

# Verify Red Hat credentials
cat ~/.aicli/config.yaml

# Check .env configuration
grep -E "CLUSTER_NAME|BASE_DOMAIN|OPENSHIFT_VERSION" .env

# Common fixes:
# 1. Verify internet connectivity
# 2. Check Red Hat account access
# 3. Ensure unique cluster name
# 4. Verify domain ownership

VM Creation Fails

Problem: make create-vms fails

# Check libvirt status
sudo systemctl status libvirtd

# Verify available resources
free -h                      # Need 32GB+ RAM
df -h /var/lib/libvirt      # Need 200GB+ storage

# Check VM configuration
grep -E "VM_COUNT|RAM|VCPUS" .env

# Common fixes:
# 1. Increase host RAM/storage
# 2. Reduce VM_COUNT or RAM settings
# 3. Start libvirt service
# 4. Fix /var/lib/libvirt permissions

DPF Installation Fails

Problem: make deploy-dpf fails

# Check prerequisite operators
oc get pods -n cert-manager
oc get pods -n openshift-nfd
oc get pods -n openshift-sriov-network-operator

# Verify pull secrets
oc get secret pull-secret -n dpf-operator-system -o yaml

# Check DPF operator logs
oc logs -n dpf-operator-system deployment/dpf-operator-controller-manager

# Common fixes:
# 1. Wait for prerequisites to be ready (5+ minutes)
# 2. Verify NGC pull secret credentials
# 3. Check internet connectivity from cluster
# 4. Verify DPF version compatibility

Worker Provisioning Fails

Problem: Workers don't join cluster

# Check BMC connectivity
ping 192.168.1.101
curl -k https://192.168.1.101/redfish/v1/

# Check worker provisioning status
oc get bmh -n openshift-machine-api
oc get csr | grep Pending

# Check automation logs
oc logs -n openshift-machine-api deployment/metal3

# Common fixes:
# 1. Verify BMC credentials and IP
# 2. Check boot MAC address is correct
# 3. Manually approve CSRs if needed
# 4. Verify network connectivity to cluster

Network Issues

Cluster Network Not Accessible

Problem: Cannot reach cluster API

# Check cluster status
make cluster-status

# Verify VM network
virsh net-list
sudo ip addr show br-dpf

# Check cluster IPs
grep -E "API_VIP|INGRESS_VIP" .env

# Test cluster connectivity
ping ${API_VIP}
curl -k https://${API_VIP}:6443/healthz

# Common fixes:
# 1. Wait for cluster installation to complete
# 2. Verify network bridge configuration
# 3. Check firewall rules
# 4. Restart libvirt networking

DPU Network Issues

Problem: DPU interfaces not configured

# Check SR-IOV operator status
oc get pods -n openshift-sriov-network-operator

# Verify DPU interface configuration
oc get sriovnetworkpolicy -n openshift-sriov-network-operator
oc get sriovnetwork -n openshift-sriov-network-operator

# Check DPU interface on worker nodes
oc debug node/worker-01
chroot /host
ip link show | grep ens7f0

# Common fixes:
# 1. Wait for SR-IOV operator to configure interfaces (10+ minutes)
# 2. Verify DPU_INTERFACE setting in .env
# 3. Check DPU hardware is properly installed
# 4. Verify NUM_VFS configuration

Storage Issues

Persistent Volume Issues

Problem: Pods stuck in Pending state due to storage

# Check storage class
oc get storageclass

# For SNO/single-node (uses LVMS)
oc get pods -n openshift-local-storage

# For multi-node (uses ODF)
oc get pods -n openshift-storage

# Check available storage
oc get pv
oc get pvc --all-namespaces

# Common fixes:
# 1. Wait for storage operators to be ready (10+ minutes)
# 2. Verify disk space on nodes
# 3. Check storage operator logs
# 4. For multi-node: ensure 3+ worker nodes

Performance Issues

Slow Deployment

Problem: Deployment takes longer than expected

# Check resource usage on host
top
iostat 1        # Check disk I/O
free -h         # Check memory usage

# Check cluster resource usage
oc adm top nodes
oc adm top pods -n dpf-operator-system

# Common fixes:
# 1. Allocate more RAM to VMs
# 2. Use faster storage (SSD)
# 3. Increase CPU cores
# 4. Close other applications

DPU Performance Issues

Problem: Poor DPU performance

# Check DPU utilization
oc get dpudeployment -n dpf-operator-system -o wide

# Verify SR-IOV VF allocation
oc describe node worker-01 | grep "openshift.io/bf3"

# Test DPU networking
# Run network performance tests between DPU-enabled pods

# Common optimizations:
# 1. Tune NUM_VFS for your workload
# 2. Enable jumbo frames (NODES_MTU=9000)
# 3. Optimize DPU interface settings
# 4. Check for hardware issues

Configuration Issues

Environment Variables

Problem: Invalid configuration values

# Validate .env configuration
make validate-environment

# Check for common issues
grep -E "^[A-Z_]+=.*[[:space:]]" .env    # Trailing spaces
grep -E "^[A-Z_]+=$" .env                # Empty values

# Verify required variables are set
grep -E "CLUSTER_NAME|BASE_DOMAIN|OPENSHIFT_VERSION" .env

# Common issues:
# 1. Trailing spaces in values
# 2. Empty required variables
# 3. Invalid IP addresses or hostnames
# 4. Conflicting network ranges

Pull Secret Issues

Problem: Image pull failures

# Check pull secret format
jq . openshift_pull.json
cat pull-secret.txt

# Verify pull secret is applied
oc get secret pull-secret -n dpf-operator-system

# Test NGC registry access
podman login nvcr.io --username '$oauthtoken' --password-stdin < pull-secret.txt

# Common fixes:
# 1. Re-download Red Hat pull secret
# 2. Verify NGC API key is valid
# 3. Merge pull secrets correctly
# 4. Check internet connectivity

Recovery Procedures

Clean Recovery

Problem: Need to start completely fresh

# Complete cleanup (WARNING: Destroys everything)
make clean-all
make delete-cluster
make delete-vms

# Remove any leftover resources
sudo virsh net-destroy dpf-net 2>/dev/null || true
sudo virsh net-undefine dpf-net 2>/dev/null || true

# Start fresh
cp .env.example .env
# Edit .env with your settings
make all

Partial Recovery

Problem: Need to recover specific component

# Recreate VMs only
make delete-vms
make create-vms

# Redeploy DPF only
oc delete namespace dpf-operator-system
make deploy-dpf

# Re-provision workers only
make delete-workers  # If target exists
make add-worker-nodes

Getting Help

Log Collection

# Collect automation logs
make collect-logs > dpf-deployment.log 2>&1

# Collect cluster logs
oc adm must-gather --image=quay.io/openshift/origin-must-gather

# Collect DPF-specific logs
oc logs -n dpf-operator-system deployment/dpf-operator-controller-manager > dpf-operator.log

Debug Mode

# Enable debug output for automation
export DEBUG=true
make deploy-dpf

# Verbose OpenShift commands
oc get pods -v=6
oc describe node worker-01

Common Error Patterns

Error Message	Likely Cause	Quick Fix
"connection refused"	Service not ready	Wait 5+ minutes
"pull secret"	Registry auth issue	Check pull secrets
"no such host"	DNS/network issue	Check network config
"insufficient resources"	Resource limits	Increase RAM/CPU
"timeout"	Process taking too long	Wait or check logs

Next Steps

If you can't resolve the issue:

Check logs: Collect relevant logs using commands above
Search documentation: Check other guides for specific topics
File issue: Report the problem with logs and configuration
Community: Ask for help in project discussions

For complex issues, include your .env configuration (remove sensitive data) and relevant log outputs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troubleshooting Guide

Quick Diagnostic Commands

Deployment Issues

Cluster Creation Fails

VM Creation Fails

DPF Installation Fails

Worker Provisioning Fails

Network Issues

Cluster Network Not Accessible

DPU Network Issues

Storage Issues

Persistent Volume Issues

Performance Issues

Slow Deployment

DPU Performance Issues

Configuration Issues

Environment Variables

Pull Secret Issues

Recovery Procedures

Clean Recovery

Partial Recovery

Getting Help

Log Collection

Debug Mode

Common Error Patterns

Next Steps

FilesExpand file tree

troubleshooting.md

Latest commit

History

troubleshooting.md

File metadata and controls

Troubleshooting Guide

Quick Diagnostic Commands

Deployment Issues

Cluster Creation Fails

VM Creation Fails

DPF Installation Fails

Worker Provisioning Fails

Network Issues

Cluster Network Not Accessible

DPU Network Issues

Storage Issues

Persistent Volume Issues

Performance Issues

Slow Deployment

DPU Performance Issues

Configuration Issues

Environment Variables

Pull Secret Issues

Recovery Procedures

Clean Recovery

Partial Recovery

Getting Help

Log Collection

Debug Mode

Common Error Patterns

Next Steps