Skip to content

Latest commit

 

History

History
421 lines (296 loc) · 11.2 KB

File metadata and controls

421 lines (296 loc) · 11.2 KB

Package Troubleshooting

This guide helps you diagnose and resolve issues with Big Bang packages. Package problems can range from deployment failures and configuration issues to networking connectivity and policy violations.

Overview

Big Bang packages are deployed using Flux and can encounter various types of issues:

  • Deployment Issues: Pods failing to start, image pull errors, resource constraints
  • Configuration Problems: Invalid Helm values, schema validation failures
  • Network Connectivity: Service mesh issues, network policies, DNS resolution
  • Policy Violations: Kyverno admission controller blocks, security policy denials
  • Resource Issues: Insufficient resources, scaling problems, persistent volume issues

Quick Diagnostics

1. Check Package Status

Start by examining the overall package health:

# Check Flux HelmRelease status
kubectl get helmreleases -A

# Check specific package status
kubectl get helmrelease <package-name> -n bigbang -o yaml

# Check pod status for the package
kubectl get pods -n <package-namespace>

2. Review Events

Events provide immediate insight into recent issues:

# Get events for a specific namespace
kubectl get events -n <package-namespace> --sort-by='.lastTimestamp'

# Get events for a specific pod
kubectl describe pod <pod-name> -n <package-namespace>

# Get cluster-wide events
kubectl get events --all-namespaces --sort-by='.lastTimestamp'

Flux Troubleshooting

1. Check Flux Controllers

Verify Flux components are healthy:

# Check Flux system pods
kubectl get pods -n flux-system

# Check Flux controller logs
kubectl logs -n flux-system deployment/helm-controller
kubectl logs -n flux-system deployment/source-controller
kubectl logs -n flux-system deployment/kustomize-controller

2. HelmRelease Debugging

Examine HelmRelease status and conditions:

# Get detailed HelmRelease status
kubectl describe helmrelease <package-name> -n bigbang

# Check for reconciliation errors
kubectl get helmrelease <package-name> -n bigbang -o jsonpath='{.status.conditions[*].message}'

# Force reconciliation
flux reconcile helmrelease <package-name> -n bigbang

3. Common Flux Issues

Schema Validation Errors:

# Check for schema validation issues in HelmRelease status
kubectl get helmrelease <package-name> -n bigbang -o yaml | grep -A 10 "conditions:"

# Common schema errors indicate:
# - Invalid Helm values
# - Missing required fields
# - Type mismatches in configuration

Source Errors:

# Check GitRepository or HelmRepository status
kubectl get gitrepository -n flux-system
kubectl get helmrepository -n flux-system

# Check source controller logs for repository access issues
kubectl logs -n flux-system deployment/source-controller

Helm Installation Failures:

# Check Helm release status directly
helm list -A
helm status <release-name> -n <namespace>

# Get Helm release history
helm history <release-name> -n <namespace>

Kyverno Policy Troubleshooting

1. Check Policy Violations

Identify admission policy blocks:

# Check Kyverno admission controller logs
kubectl logs -n kyverno deployment/kyverno-admission-controller

# Get policy violation events
kubectl get events --all-namespaces | grep -i "blocked\|denied\|failed"

# Check specific policy status
kubectl get cpol  # ClusterPolicy
kubectl get pol -A  # Policy

2. Policy Reports

Review policy evaluation results:

# Get cluster policy reports
kubectl get cpolr  # ClusterPolicyReport

# Get namespace policy reports
kubectl get polr -A  # PolicyReport

# Detailed policy report for a specific resource
kubectl describe cpolr <report-name>

3. Kyverno Reporter Setup

Follow the Overview of Kyverno Reporter to set up detailed reporting and alerting for policy violations.

4. Common Policy Issues

Resource Mutation Conflicts:

  • Check if multiple policies modify the same resource
  • Review policy precedence and order
  • Examine mutating vs validating policies

Review Kyverno Exceptions for guidance on handling necessary exceptions.

Network Connectivity Issues

For network-related package problems, refer to the networking troubleshooting guide which covers:

  • Service Mesh Issues: Istio configuration, mTLS problems, traffic routing
  • Network Policies: Connectivity blocks, policy misconfigurations
  • DNS Resolution: Service discovery failures, external DNS issues
  • Ingress Problems: Load balancer issues, certificate problems
  • Service Entries: External service access, HTTPS/TLS configuration

Quick Network Checks

# Test pod-to-pod connectivity
kubectl exec -it <pod-name> -n <namespace> -- nslookup <service-name>

# Check service endpoints
kubectl get endpoints <service-name> -n <namespace>

# Verify Istio sidecar injection
kubectl get pods -n <namespace> -o jsonpath='{.items[*].spec.containers[*].name}'

Resource and Scaling Issues

1. Resource Constraints

Check for resource-related problems:

# Check node resources
kubectl top nodes
kubectl describe nodes

# Check pod resource usage
kubectl top pods -A
kubectl describe pod <pod-name> -n <namespace>

# Check resource quotas
kubectl get resourcequota -A
kubectl describe resourcequota <quota-name> -n <namespace>

2. Persistent Volume Issues

Debug storage problems:

# Check PVC status
kubectl get pvc -A
kubectl describe pvc <pvc-name> -n <namespace>

# Check storage classes
kubectl get storageclass

# Check persistent volumes
kubectl get pv
kubectl describe pv <pv-name>

3. Scaling Problems

Address autoscaling issues:

# Check HPA status
kubectl get hpa -A
kubectl describe hpa <hpa-name> -n <namespace>

# Check VPA recommendations
kubectl get vpa -A
kubectl describe vpa <vpa-name> -n <namespace>

# Check deployment replica status
kubectl get deployment -n <namespace>
kubectl describe deployment <deployment-name> -n <namespace>

Observability and Monitoring

1. Check Monitoring Stack

Use Big Bang's observability tools:

  • Grafana Dashboards: Review package-specific dashboards
  • Prometheus Metrics: Query application and infrastructure metrics
  • Tempo Tracing: Analyze request flows and performance
  • AlertManager: Check for active alerts

2. Application Logs

Examine application logs for errors:

# Get pod logs
kubectl logs <pod-name> -n <namespace>

# Get logs from all containers in a pod
kubectl logs <pod-name> -n <namespace> --all-containers

# Follow logs in real-time
kubectl logs -f <pod-name> -n <namespace>

# Get previous container logs (for crashed pods)
kubectl logs <pod-name> -n <namespace> --previous

3. Custom Metrics

Enable application-specific monitoring as described in the monitoring guide:

# Add Prometheus scraping annotations
metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"

Configuration and Immutability Issues

1. Configuration Drift

Check for Flux drift detection and reconcile with Flux CLI:

# Inspect Flux resources and their conditions
flux get kustomizations -A
flux get helmreleases -A

# Inspect a specific resource for reconciliation status
flux get kustomization <name> -n <namespace> -o yaml
flux get helmrelease <name> -n <namespace> -o yaml
# Use flux diff to compare cluster state vs Git/source (detects drift)
flux diff kustomization <name> -n <namespace>
flux diff helmrelease <name> -n <namespace>
flux diff source gitrepository <repo-name> -n flux-system
# Remediate detected drift by forcing reconciliation from source
flux reconcile kustomization <name> -n <namespace> --with-source
flux reconcile helmrelease <name> -n <namespace>
# Reconcile source if Git/Helm repository changes need to be refreshed
flux reconcile source git <repo-name> -n flux-system

Interpretation and guidance:

  • If flux diff shows differences, those are drifted resources (cluster != Git/source).
  • Reconcile to reapply Git-desired state; if the drift is intentional, update the Git source instead of reconciling.
  • Use consistent Kustomization/HelmRelease intervals and automation to reduce manual drift.
  • Review Flux resource status (conditions and lastApplied/lastAttempted revisions) to determine why reconciliation failed and whether source updates are required.
  • Consider adding alerting around failed reconciliations or large diffs to catch drift early.

2. Immutable Field Updates

Handle immutable field errors:

# Common immutable fields that cause issues:
# - Pod selectors in Deployments
# - Service ClusterIP
# - PVC storage size (depending on storage class)

# Solution: Delete and recreate the resource
kubectl delete deployment <deployment-name> -n <namespace>
# Flux will recreate based on GitOps

3. Helm Value Validation

Validate Helm values before deployment:

# Dry-run Helm install
helm install <release-name> <chart> --dry-run --debug --values values.yaml

# Template and validate manifests
helm template <release-name> <chart> --values values.yaml | kubectl apply --dry-run=client -f -

Advanced Debugging

1. Debug Containers

Use debug containers for deeper investigation:

# Create debug container
kubectl debug <pod-name> -n <namespace> -it --image=busybox

# Debug with specific tools
kubectl debug <pod-name> -n <namespace> -it --image=nicolaka/netshoot

2. Package-Specific Issues

Image Pull Problems:

# Check image pull secrets
kubectl get secrets -n <namespace> | grep docker

# Verify registry access
kubectl describe pod <pod-name> -n <namespace>

Init Container Failures:

# Check init container logs
kubectl logs <pod-name> -n <namespace> -c <init-container-name>

# Check init container status
kubectl describe pod <pod-name> -n <namespace>

3. Rollback Procedures

When issues persist, consider rollback:

# Rollback Helm release
helm rollback <release-name> <revision> -n <namespace>

# Rollback via Flux (revert Git commit)
git revert <commit-hash>
git push origin main

Escalation and Support

1. Gather Debug Information

Before escalating, collect:

# Create debug bundle
kubectl cluster-info dump --output-directory=./debug-info

# Export relevant logs
kubectl logs -n <namespace> --all-containers --prefix=true > package-logs.txt

# Export events
kubectl get events --all-namespaces --sort-by='.lastTimestamp' > events.txt

2. Community Resources

  • Check Big Bang documentation and troubleshooting guides
  • Search Big Bang GitLab issues for similar problems
  • Engage with the Big Bang community for complex issues
  • Review package-specific documentation and upstream issues

3. Preventive Measures

  • Implement comprehensive monitoring and alerting
  • Use staging environments for testing changes
  • Regularly review and update package configurations
  • Maintain backup and restore procedures
  • Document custom configurations and known issues

Remember to always test fixes in a non-production environment first and maintain detailed logs of troubleshooting steps for future reference.