Troubleshooting

This guide covers common issues when using the DPF HCP Provisioner Operator and how to diagnose them.

Viewing Operator Logs

kubectl logs -n dpf-hcp-provisioner-system deployment/dpf-hcp-provisioner-operator -f

For more verbose output, redeploy with logLevel: debug in Helm values.

Inspecting CR Status

# Quick status overview
kubectl get dpfhcp -n <namespace>

# Detailed conditions
kubectl describe dpfhcp <name> -n <namespace>

# JSON output for scripting
kubectl get dpfhcp <name> -n <namespace> -o jsonpath='{.status.conditions}' | jq .

Common Issues

DPUCluster Not Found

Condition: DPUClusterMissing = True

Symptoms: CR stays in Pending phase.

Cause: The dpuClusterRef points to a DPUCluster that doesn't exist or is in a different namespace than specified.

Resolution:

Verify the DPUCluster exists: kubectl get dpucluster -n <namespace>
Ensure the name and namespace in dpuClusterRef match exactly
Check RBAC -- the operator needs get permissions on DPUCluster resources in the target namespace

DPUCluster Type Invalid

Condition: ClusterTypeValid = False

Symptoms: CR stays in Pending phase.

Cause: The referenced DPUCluster is not of a supported type for HyperShift provisioning.

Resolution: Ensure the DPUCluster is configured with a type compatible with HyperShift-based provisioning.

DPUCluster Already In Use

Condition: DPUClusterInUse = True

Symptoms: CR stays in Pending phase.

Cause: Another DPFHCPProvisioner is already using the same DPUCluster. Each DPUCluster can only be referenced by one DPFHCPProvisioner (1:1 mapping).

Resolution: Use a different DPUCluster or delete the existing DPFHCPProvisioner that references it.

Secrets Validation Failure

Condition: SecretsValid = False

Symptoms: CR stays in Pending phase.

Cause: SSH key or pull secret is missing, or has incorrect format.

Resolution:

Verify the SSH key secret exists and contains the id_rsa.pub key:

kubectl get secret <ssh-secret-name> -n <provisioner-namespace> -o jsonpath='{.data}' | jq 'keys'

Verify the pull secret exists and contains .dockerconfigjson:

kubectl get secret <pull-secret-name> -n <provisioner-namespace> -o jsonpath='{.data}' | jq 'keys'

Ensure both secrets are in the same namespace as the DPFHCPProvisioner CR

BlueField OCP Layer Image Lookup Failure

Condition: BlueFieldOCPLayerImageFound = False

Symptoms: CR stays in Provisioning phase.

Cause: The operator cannot find a BlueField container image tag matching the OCP version in the configured registry.

Resolution:

Check the DPFHCPProvisionerConfig for the correct registry:
```
kubectl get dpfhcpconfig default -o yaml
```
Verify the OCP version extracted from ocpReleaseImage has a matching tag in the BlueField OCP layer image repository
If the BlueField OCP layer image is known, set machineOSURL in the DPFHCPProvisioner spec to skip the lookup

HostedCluster Stuck in Provisioning

Conditions: HostedClusterAvailable = False, HostedClusterProgressing = True

Symptoms: CR stays in Provisioning phase for an extended period.

Resolution:

Check the HostedCluster directly:

kubectl get hostedcluster -A
kubectl describe hostedcluster <name> -n <namespace>

Look for issues with:
- The OCP release image (invalid or inaccessible)
- etcd storage (storage class not available)
- Node scheduling (control plane nodes not available or have resource pressure)
- Network connectivity (DNS resolution, API accessibility)
Check HyperShift operator logs for more details

MetalLB Configuration Issues

Condition: MetalLBConfigured = False

Symptoms: Services are not accessible via LoadBalancer.

Resolution:

Verify MetalLB operator is installed and running
Check that the virtualIP in the spec is a valid, routable IP on the management cluster network

Inspect MetalLB resources:

kubectl get ipaddresspool -A
kubectl get l2advertisement -A

CSR Approval Not Working

Condition: CSRAutoApprovalActive = False

Symptoms: DPU worker nodes cannot join the hosted cluster.

Cause: The operator cannot connect to the hosted cluster to watch for CSRs.

Resolution:

Ensure the HostedCluster is in Ready state
Check the hosted cluster is reachable from the management cluster

Look for CSRApproval controller errors in operator logs:

kubectl logs -n dpf-hcp-provisioner-system deployment/dpf-hcp-provisioner-operator | grep -i csr

Ignition Generation Failure

Condition: IgnitionConfigured = False

Symptoms: CR stays in IgnitionGenerating phase.

Resolution:

Verify the dpuDeploymentRef points to a valid DPUDeployment
Check that the DPUDeployment has the required DPU flavor information
If using a custom machineOSURL, verify the URL is accessible
Check operator logs for ignition generation errors

Deleting a DPFHCPProvisioner

Deletion triggers finalizer cleanup that removes the HostedCluster, MetalLB resources, and injected kubeconfig in order. If the CR is stuck in Deleting phase:

Check conditions for cleanup status:

kubectl get dpfhcp <name> -n <namespace> -o jsonpath='{.status.conditions}' | jq '.[] | select(.type=="HostedClusterCleanup")'

Check if the HostedCluster is being deleted:
```
kubectl get hostedcluster -A
```
If the HostedCluster itself is stuck deleting, investigate HyperShift operator logs

Events

The operator emits Kubernetes events on the DPFHCPProvisioner CR for key lifecycle actions. View them with:

kubectl get events -n <namespace> --field-selector involvedObject.name=<dpfhcp-name>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troubleshooting

Viewing Operator Logs

Inspecting CR Status

Common Issues

DPUCluster Not Found

DPUCluster Type Invalid

DPUCluster Already In Use

Secrets Validation Failure

BlueField OCP Layer Image Lookup Failure

HostedCluster Stuck in Provisioning

MetalLB Configuration Issues

CSR Approval Not Working

Ignition Generation Failure

Deleting a DPFHCPProvisioner

Events

FilesExpand file tree

troubleshooting.md

Latest commit

History

troubleshooting.md

File metadata and controls

Troubleshooting

Viewing Operator Logs

Inspecting CR Status

Common Issues

DPUCluster Not Found

DPUCluster Type Invalid

DPUCluster Already In Use

Secrets Validation Failure

BlueField OCP Layer Image Lookup Failure

HostedCluster Stuck in Provisioning

MetalLB Configuration Issues

CSR Approval Not Working

Ignition Generation Failure

Deleting a DPFHCPProvisioner

Events