diff --git a/Introduction-to-kubernetes.md b/Introduction-to-kubernetes.md new file mode 100644 index 0000000..69f6b09 --- /dev/null +++ b/Introduction-to-kubernetes.md @@ -0,0 +1,51 @@ +# Introduction to Kubernetes + +## What is Kubernetes? +Kubernetes is an open-source container orchestration platform. It is the vital technology that handles the scaling, automates deployment and management of containerized applications across a cluster of machines. + +## Kubernetes and Microservices: A Practical Example +Let's say you're running a retail store app with three main microservices: + +- **Product Catalog Service:** A container listing all products and inventory +- **Shopping Cart Service:** A container is managing user shopping carts +- **Payment Service:** A container processing orders and payments + +(Imagine a customer journey: A customer browses products (Product Catalog), adds items to cart (Shopping Cart), and checks out (Payment). +How do these containers communicate with each other? +If the Payment Service crashes, how will it heal automatically so that the customer never loses their cart because the Shopping Cart Service is still running? +This seamless orchestration is exactly what K8s provides: emphasize resilience, automation, and scale.) + +So instead of manually deciding which server each container runs on, monitoring them, and replacing failed ones, Kubernetes does it automatically based on rules you define. + +--- + +## Why Kubernetes Matters in Modern DevOps +Kubernetes has become the industry standard for container orchestration across cloud platforms (AWS, Azure, GCP, on-prem). Here's why it matters for your DevOps career: + +- It works on AWS EKS, Azure AKS, Google GKE, and on-premises — you learn once, work anywhere. +- Enables high availability and zero-downtime deployments for production systems +- This is the core skill for DevOps, SRE’s, Platform Engineers, and Cloud Architects +- K8’s is essential for building modern CI/CD pipelines with ArgoCD, Helm, and other GitOps tools + +In modern DevOps, Kubernetes isn't optional, it's the foundation. Whether you're optimizing costs, building reliable systems, or preparing for senior roles, Kubernetes proficiency is non-negotiable. + +It defines the operation reality of cloud development. +K8s sets the common language, and API for defining how infrastructure should look, enabling automated, repeatable, and scalable operations. +It runs the declarative infrastructure. +With this k8s skill you will understand the core challenge and solutions for things like resiliency, deployment automation and massive scalability. + +--- + +## How mastering K8s can boost your career (roles in SRE, DevOps, Cloud, Platform Engineering) +In the rapidly evolving landscape of 2026, mastering Kubernetes is arguably the highest-leverage moves you can make for your tech career. + +Why? Because it sits at the core of how modern organizations build, scale and operate software across DevOps, SRE, Cloud, and Platform Engineering teams. + +Organizations standardize on K8s for microservices and ML/agentic apps, the engineers who can design and operate clusters remain in sustained demand into 2026 and beyond. +K8s provides a path to senior roles in Cloud Architecture, SRE, and Platform Engineering. + +As a DevOps Engineer mastering K8s helps you to understand how to optimize CICD Pipelines, implementing advanced deployment strategies using tools like ArgoCD and FluxCD — these tools maintain the desired state +by constantly monitoring the repo. You define the infrastructure declaratively, achieving extreme velocity in getting code to production. + +In this course you’ll work on real-world Kubernetes activities such as cluster lifecycle management, security/RBAC, and networking. +--- diff --git a/Modules/module1/01-cluster-lifecycle.md b/Modules/module1/01-cluster-lifecycle.md new file mode 100644 index 0000000..2c78160 --- /dev/null +++ b/Modules/module1/01-cluster-lifecycle.md @@ -0,0 +1,301 @@ +# Cluster Architecture, Installation & Configuration (25%) + +This module covers 25% of the CKA exam and focuses on understanding, installing, and configuring Kubernetes clusters. + +## Learnings +By completing this module, you will be able to: + +- Understand Kubernetes cluster architecture and components +- Install and configure clusters using kind and kubeadm +- Set up highly available (HA) cluster configurations +- Implement Pod Security standards and troubleshoot admission errors +- Configure RBAC (Role-Based Access Control) for secure access +- Work with Custom Resource Definitions (CRDs) and Operators +- Deploy applications using Helm and Kustomize + +## Cluster Architecture + +### Control Plane Components +The control plane is the brain of the Kubernetes cluster. It maintains the desired state of the cluster and responds to changes. + +### Api-Server +Purpose: The API server is the front-end for the Kubernetes control plane and the central management entity. + +**Responsibilities:** + +Exposes the Kubernetes API (REST interface) +Validates and processes API requests +Serves as the only component that directly communicates with etcd +Handles authentication, authorization, and admission control +Provides the interface for kubectl and other clients +Key Characteristics: + +Horizontally scalable (can run multiple instances) +Stateless (all state stored in etcd) +Listens on port 6443 (default) + +Example: API Request +``` +kubectl create deployment nginx --image=nginx +# 1. kubectl sends HTTP POST to kube-apiserver +# 2. API server authenticates and authorizes the request +# 3. Admission controllers validate the request +# 4. API server writes to etcd +# 5. API server returns response to kubectl +``` + +### etcd +Purpose: Distributed, reliable key-value store that serves as Kubernetes' backing store for all cluster data. + +**Responsibilities:** + +Stores all cluster state and configuration +Maintains consistency across the cluster +Provides watch functionality for detecting changes +Ensures data persistence and reliability + +### kube-scheduler +Purpose: Watches for newly created Pods with no assigned node and selects a node for them to run on. + +Responsibilities: + +Monitors API server for unscheduled Pods +Evaluates node suitability based on multiple factors +Assigns Pods to appropriate nodes +Respects constraints and requirements + +### kube-controller-manager +Purpose: Runs controller processes that regulate the state of the cluster. + +**Responsibilities:** + +Watches cluster state through API server +Makes changes to move current state toward desired state +Runs multiple controllers as separate processes (compiled into single binary + +## Worker Node Components +Worker nodes run the actual application workloads. Each worker node contains the components necessary to run Pods and be managed by the control plane. + +### kubelet +Purpose: Primary node agent that runs on each worker node and ensures containers are running in Pods. + +**Responsibilities:** + +Registers node with API server +Watches API server for Pods assigned to its node +Ensures containers described in PodSpecs are running and healthy +Reports node and Pod status back to API server +Executes liveness and readiness probes +Mounts volumes as specified in Pod specs + +### kube-proxy +Purpose: Network proxy that runs on each node and maintains network rules for Pod communication. + +**Responsibilities:** + +Implements Kubernetes Service abstraction +Maintains network rules on nodes +Performs connection forwarding +Enables Service discovery and load balancing + +Proxy Modes: +1. iptables mode (default) +2. IPVS mode +3. userspace mode + +How kube-proxy Works: +``` +Client Pod → Service IP → kube-proxy rules → Backend Pod +``` + +### Container Runtime +Purpose: Software responsible for running containers on the node. + +**Responsibilities:** + +Pulls container images from registries +Unpacks and runs containers +Manages container lifecycle +Provides container isolation + +Understanding how components interact is crucial for troubleshooting. + +### Pod Creation Flow +``` +1. User runs: kubectl create -f pod.yaml + ↓ +2. kubectl → API Server (HTTPS) + ↓ +3. API Server validates and writes to etcd + ↓ +4. Scheduler watches API Server, sees unscheduled Pod + ↓ +5. Scheduler selects node, updates Pod binding in API Server + ↓ +6. API Server writes binding to etcd + ↓ +7. kubelet on selected node watches API Server, sees new Pod + ↓ +8. kubelet tells container runtime to pull image and start container + ↓ +9. kubelet reports Pod status back to API Server + ↓ +10. API Server updates Pod status in etcd +``` + +### CNI (Container Network Interface) Plugin +- Provides Pod networking +- Examples: Calico, Flannel, Weave, Cilium +- Must be installed for Pod-to-Pod communication + +--- + +# Installation for Kubernetes v1.34 + +**kubectl v1.34** +``` +curl -LO "https://dl.k8s.io/release/v1.34.0/bin/linux/amd64/kubectl" +chmod +x kubectl +sudo mv kubectl /usr/local/bin/ +``` + +**kind (supports v1.34)** +``` +curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.23.0/kind-linux-amd64 +chmod +x ./kind +sudo mv ./kind /usr/local/bin/kind + +``` + +**Helm 3 (compatible with v1.34)** +``` +curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash +``` +# Verify installations +``` +kubectl version --client +kind version +helm version +``` + +**Create v1.34 cluster** +``` +kind create cluster --name k8s-v134 --image kindest/node:v1.34.0 +``` + + +### Check Cluster Components + +```bash +# Check nodes +kubectl get nodes -o wide + +# Check system pods +kubectl get pods -A + +# Check cluster info +kubectl cluster-info +``` +## Kubernetes v1.34 support advanced features +- Provides **Dynamic Resource Allocation** for GPUs, TPUs, NICs, etc +- **Delayed Job Pod Replacement** - This policy only creates replacement pods when the original pod is completely terminated +- **Security Tokens** - kubelet can use short-lived, audience-bound ServiceAccount tokens that are automatically rotated +- **Pod-Level Resources** - enable containers to share CPU and memory from a common pod allocation +- **Job Success Policy** - allows jobs to succeed when a subset of pods complete successfully + +--- + +## Cluster Lifecycle Management (kubeadm) + +Initialize a Single Control Plane +``` +sudo kubeadm init \ + --pod-network-cidr=10.244.0.0/16 \ + --apiserver-advertise-address= +``` +**Set up kubectl** +``` +mkdir -p $HOME/.kube +sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config +sudo chown $(id -u):$(id -g) $HOME/.kube/config +``` +Install a CNI (e.g. Calico, Flannel, Cilium) so Pods can communicate +#### Weave Net +``` +kubectl apply -f https://github.com/weaveworks/weave/releases/download/v2.8.1/weave-daemonset-k8s.yaml +``` +## Join Worker Nodes +``` +sudo kubeadm join LOAD_BALANCER_DNS:6443 \ + --token \ + --discovery-token-ca-cert-hash sha256: \ + --control-plane \ + --certificate-key +``` +--- + +## HA Configuration + +**Overview** + +A High Availability (HA) Kubernetes cluster eliminates single points of failure by running multiple control plane nodes. This ensures the cluster remains operational even if one or more control plane nodes fail. + +### Components + +``` + ┌─────────────────┐ + │ Load Balancer │ + │ (HAProxy/ │ + │ nginx) │ + └────────┬────────┘ + │ + ┌──────────────────┼──────────────────┐ + │ │ │ + ┌─────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐ + │ Control │ │ Control │ │ Control │ + │ Plane 1 │ │ Plane 2 │ │ Plane 3 │ + │ │ │ │ │ │ + │ + etcd │◄────►│ + etcd │◄───►│ + etcd │ + └───────────┘ └───────────┘ └───────────┘ + │ │ │ + └──────────────────┼──────────────────┘ + │ + ┌──────────────────┼──────────────────┐ + │ │ │ + ┌─────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐ + │ Worker │ │ Worker │ │ Worker │ + │ Node 1 │ │ Node 2 │ │ Node 3 │ + └───────────┘ └───────────┘ └───────────┘ +``` +### Key Concepts + +**Stacked etcd Topology** (Recommended for most cases): +- etcd runs on the same nodes as control plane components +- Simpler to set up and manage +- Requires fewer nodes (minimum 3) +- If a control plane node fails, both control plane and etcd member are lost + +**External etcd Topology**: +- etcd runs on separate dedicated nodes +- More resilient (control plane and etcd failures are independent) +- Requires more nodes (3 for etcd + 2+ for control plane) +- More complex to set up and manage + +### Infrastructure Requirements + +**Minimum for HA:** +- 3 control plane nodes (odd number recommended: 3, 5, 7) +- 3+ worker nodes +- 1 load balancer (can be external or software-based) + +**Per Control Plane Node:** +- 2 CPUs (4 recommended) +- 4GB RAM (8GB recommended) +- 50GB disk space +- Network connectivity between all nodes + +**Load Balancer:** +- Can be hardware (F5, Citrix) or software (HAProxy, nginx) +- Must support TCP load balancing +- Health checks for API server + diff --git a/Modules/module1/02-pod-security.md b/Modules/module1/02-pod-security.md new file mode 100644 index 0000000..dc66407 --- /dev/null +++ b/Modules/module1/02-pod-security.md @@ -0,0 +1,283 @@ +# Pod Security Standards and Admission Control + +## Overview + +Kubernetes uses Pod Security Admission (PSA) to enforce the built‑in Pod Security Standards (PSS) at namespace level to enforce security best practices. +PSA evaluates Pod create/update requests at admission time and can enforce, warn, or audit policy violations before Pods ever run. + +Pod Security Standards (PSS) levels +The three built‑in profiles: + +1. Privileged (Unrestricted) +Purpose: No restrictions - allows known privilege escalations + +- Allows privileged Pods, host networking, hostPath volumes, Running as root and all other capabilities. + +2. Baseline + +Purpose: Prevents known privilege escalations while minimizing restrictions + +- Minimally restrictive prevents known privilege escalations while allowing most default Pod specs. +- Disallows privileged containers, some host namespaces, and unsafe capabilities.​ + +3. Restricted +Purpose: Follows pod hardening best practices + +- Most secure, based on current Pod hardening best practices. +- Security-critical applications +- Enforces non‑root, seccomp, and limited capabilities and restricts host access patterns.​ + +## Pod Security Admission + +### Admission Modes + +Pod Security Admission operates in three modes per namespace: + +#### 1. enforce +- **Behavior**: Rejects pods that violate the policy +- **Use**: Production namespaces +- **Effect**: Pod creation fails + +#### 2. audit +- **Behavior**: Allows pods but logs violations +- **Use**: Monitoring and gradual rollout +- **Effect**: Pod created, event logged + +#### 3. warn +- **Behavior**: Allows pods but shows warning to user +- **Use**: Development and testing +- **Effect**: Pod created, warning displayed + +### Namespace Labels + +Configure Pod Security using namespace labels: + +```yaml +apiVersion: v1 +kind: Namespace +metadata: + name: my-namespace + labels: + # Enforce restricted standard + pod-security.kubernetes.io/enforce: restricted + pod-security.kubernetes.io/enforce-version: v1.34 + + # Audit baseline standard + pod-security.kubernetes.io/audit: baseline + pod-security.kubernetes.io/audit-version: v1.34 + + # Warn on privileged violations + pod-security.kubernetes.io/warn: baseline + pod-security.kubernetes.io/warn-version: v1.34 +``` + +Namespaces are labeled to select profile + mode, for example: +``` +kubectl label namespace team-a \ + pod-security.kubernetes.io/enforce=restricted \ + pod-security.kubernetes.io/audit=baseline \ + pod-security.kubernetes.io/warn=baseline +``` + +## Implementing Pod Security + +### Step 1: Check Current Configuration + +```bash +# Check if Pod Security Admission is enabled +kubectl api-resources | grep podsecurity + +# Check existing namespace labels +kubectl get namespaces --show-labels + +# Check specific namespace +kubectl get namespace default -o yaml +``` + +## Admission Controllers + +### What are Admission Controllers? + +Admission controllers are plugins that intercept requests to the Kubernetes API server before object persistence. They can: +- **Validate**: Check if request meets requirements +- **Mutate**: Modify the request +- **Reject**: Deny the request + +### Common Admission Controllers + +#### 1. PodSecurity (Validating) +- Enforces Pod Security Standards + +#### 2. NamespaceLifecycle (Validating) +- Prevents creation of objects in terminating namespaces +- Ensures system namespaces cannot be deleted + +#### 3. LimitRanger (Validating) +- Enforces resource limits on pods and containers +- Applies default limits if not specified + +#### 4. ResourceQuota (Validating) +- Enforces resource quotas per namespace +- Prevents resource exhaustion + +#### 5. ServiceAccount (Mutating) +- Automatically adds ServiceAccount to pods +- Mounts ServiceAccount token + +#### 6. DefaultStorageClass (Mutating) +- Adds default StorageClass to PVCs +- Only if no StorageClass specified + +#### 7. MutatingAdmissionWebhook (Mutating) +- Calls external webhooks to mutate objects +- Used by service meshes, policy engines + +#### 8. ValidatingAdmissionWebhook (Validating) +- Calls external webhooks to validate objects +- Used for custom policies + +### Checking Enabled Admission Controllers +``` +# Check enabled admission controllers +kubectl exec -n kube-system kube-apiserver- -- kube-apiserver -h | grep enable-admission-plugins +``` +--- + +## Troubleshooting Admission Errors + +### Common Error Patterns + +#### Example 1 – HostPath blocked by restricted policy + +Error: +``` +Error from server (Forbidden): error when creating "pod.yaml": +pods "nginx" is forbidden: violates PodSecurity "restricted:latest": +hostPath volumes are not allowed +``` +Cause: restricted profile disallows hostPath because it can expose the node filesystem. + +Problem Pod: +```yaml +spec: + volumes: + - name: host-logs + hostPath: + path: /var/log + containers: + - name: nginx + image: nginx + volumeMounts: + - name: host-logs + mountPath: /logs +``` + +Solution: Pod (using PVC instead of hostPath): +```yaml +spec: + volumes: + - name: app-logs + persistentVolumeClaim: + claimName: app-logs-pvc + containers: + - name: nginx + image: nginx + volumeMounts: + - name: app-logs + mountPath: /logs +``` + +**Debugging Steps:** + +```bash +# Check namespace Pod Security labels +kubectl get namespace -o yaml | grep pod-security + +# Try with warn mode first +kubectl label namespace \ + pod-security.kubernetes.io/enforce=privileged \ + pod-security.kubernetes.io/warn=restricted \ + --overwrite + +# Create pod and see warnings +kubectl apply -f pod.yaml +``` + +#### Example 2 : ResourceQuota Exceeded + +**Error Message:** +``` +Error from server (Forbidden): pods "my-pod" is forbidden: +exceeded quota: compute-quota, requested: requests.cpu=2, used: requests.cpu=8, limited: requests.cpu=10 +``` + +**Solution:** +```bash +# Check quota +kubectl get resourcequota -n +kubectl describe resourcequota compute-quota -n + +# Reduce pod resources or increase quota +kubectl edit resourcequota compute-quota -n +``` +--- + +#### Example 3: LimitRange Violation + +**Error Message:** +``` +Error from server (Forbidden): pods "my-pod" is forbidden: +maximum cpu usage per Container is 2, but limit is 4 +``` + +**Solution:** +```bash +# Check LimitRange +kubectl get limitrange -n +kubectl describe limitrange -n + +# Adjust pod resources +spec: + containers: + - name: nginx + resources: + limits: + cpu: "2" + memory: "2Gi" +``` +### Debugging Workflow + +```bash +# 1. Check admission error details +kubectl apply -f pod.yaml --dry-run=server -o yaml + +# 2. Check namespace Pod Security labels +kubectl get namespace -o yaml + +# 3. Check events +kubectl get events -n --sort-by='.lastTimestamp' + +# 4. Check API server logs +kubectl logs -n kube-system kube-apiserver- + +# 5. Test with different security levels +kubectl label namespace \ + pod-security.kubernetes.io/enforce=privileged \ + --overwrite + +# 6. Gradually increase restrictions +kubectl label namespace \ + pod-security.kubernetes.io/enforce=baseline \ + --overwrite +``` +--- + +## Exam Tips + +1. **Know the Three Standards**: Privileged, Baseline, Restricted +2. **Understand Modes**: enforce, audit, warn +3. **Label Format**: `pod-security.kubernetes.io/: ` +4. **Common Fixes**: runAsNonRoot, drop capabilities, seccomp profile +5. **Debugging**: Use `--dry-run=server` to test +6. **Quick Fix**: Temporarily set to privileged, then fix and restore +7. **Check Events**: `kubectl get events` shows admission errors diff --git a/Modules/module1/03-rbac.md b/Modules/module1/03-rbac.md new file mode 100644 index 0000000..455d292 --- /dev/null +++ b/Modules/module1/03-rbac.md @@ -0,0 +1,364 @@ +# RBAC (Role‑Based Access Control) + +### Let's understand Role, RoleBinding, ClusterRole, ClusterRoleBinding, ServiceAccount creation/config/troubleshooting + +## Overview +RBAC (Role‑Based Access Control) is the authorization mechanism in K8s. That allows you to control who can perform what actions on which resources. + +**Core Components:** +- **Role**: Defines permissions within a namespace +- **ClusterRole**: Defines permissions cluster-wide or resuable set of permissions +- **RoleBinding**: Grants Role permissions to subjects in a namespace +- **ClusterRoleBinding**: Grants ClusterRole permissions to subjects cluster-wide +- **ServiceAccount**: Provides identity for processes running in Pods + +## ServiceAccounts + +### What are ServiceAccounts? + +ServiceAccounts provide an identity for processes running in Pods. Every namespace has a default ServiceAccount. + +### Examples + +### ServiceAccount creation and use + +#### Create namespace and ServiceAccount +``` +kubectl create namespace dev +kubectl create serviceaccount app-sa -n dev +``` + +#### Inspect +``` +kubectl get sa -n dev +kubectl describe sa app-sa -n dev +``` + +Use this in a Pod: + +``` +apiVersion: v1 +kind: Pod +metadata: + name: nginx + namespace: dev +spec: + serviceAccountName: app-sa + containers: + - name: nginx + image: nginx +``` + + +### Role and RoleBinding (namespace‑scoped) + +Motive: Allow app-sa to list/get/watch Pods only in dev namespace. + +**Roles** +``` +apiVersion: rbac.authorization.k8s.io/v1 +kind: Role +metadata: + name: pod-reader + namespace: dev +rules: +- apiGroups: [""] + resources: ["pods"] + verbs: ["get", "list", "watch"] +``` +**RoleBinding** +``` +apiVersion: rbac.authorization.k8s.io/v1 +kind: RoleBinding +metadata: + name: pod-reader-binding + namespace: dev +subjects: +- kind: ServiceAccount + name: app-sa + namespace: dev +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: Role + name: pod-reader +``` +``` +kubectl apply -f +kubectl apply -f +``` + +## ClusterRole and ClusterRoleBinding + +Motive: Allow app-sa to list/get/watch Pods in all namespace. + +**clusterrole** + +``` +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: pod-reader-cluster +rules: +- apiGroups: [""] + resources: ["pods"] + verbs: ["get", "list", "watch"] +``` + +**ClusterRoleBinding** + +``` +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: pod-reader-cluster-binding +subjects: +- kind: ServiceAccount + name: app-sa + namespace: dev +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: ClusterRole + name: pod-reader-cluster +``` + +## Troubleshooting + +### Checking Permissions + +#### kubectl auth can-i + +```bash +# Check if current user can create pods +kubectl auth can-i create pods + +# Check if current user can delete deployments in namespace +kubectl auth can-i delete deployments --namespace=production + +# Check as another user +kubectl auth can-i get pods --as=jane + +# Check as ServiceAccount +kubectl auth can-i list secrets --as=system:serviceaccount:default:app-sa + +# List all permissions for current user +kubectl auth can-i --list + +# List all permissions in namespace +kubectl auth can-i --list --namespace=production +``` +### Examples + +#### Example 1: ServiceAccount Cannot Access Resources + +**Error:** +``` +Error from server (Forbidden): deployments.apps is forbidden: +User "system:serviceaccount:default:app-sa" cannot list resource "deployments" +``` + +**Debug:** +```bash +# Check ServiceAccount permissions +kubectl auth can-i list deployments \ + --as=system:serviceaccount:default:app-sa \ + --namespace=default + +# Check RoleBindings for ServiceAccount +kubectl get rolebindings -n default -o yaml | grep -A 10 "app-sa" + +# Describe ServiceAccount +kubectl describe serviceaccount app-sa -n default +``` +**Solution:** + +### Create Role and RoleBinding +```yaml +apiVersion: rbac.authorization.k8s.io/v1 +kind: Role +metadata: + name: deployment-reader + namespace: default +rules: +- apiGroups: ["apps"] + resources: ["deployments"] + verbs: ["get", "list", "watch"] +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: RoleBinding +metadata: + name: app-sa-deployment-reader + namespace: default +subjects: +- kind: ServiceAccount + name: app-sa + namespace: default +roleRef: + kind: Role + name: deployment-reader + apiGroup: rbac.authorization.k8s.io +``` +#### Example 2: Forbidden - User Cannot Perform Action + +**Error:** +``` +Error from server (Forbidden): pods is forbidden: +User "jane" cannot list resource "pods" in API group "" in the namespace "default" +``` + +**Debug:** +```bash +# Check if user has permission +kubectl auth can-i list pods --as=jane --namespace=default + +# Check RoleBindings for user +kubectl get rolebindings -n default -o yaml | grep -A 5 "name: jane" + +# Check ClusterRoleBindings for user +kubectl get clusterrolebindings -o yaml | grep -A 5 "name: jane" +``` + +**Solution:** +```bash +# Create appropriate RoleBinding +kubectl create rolebinding jane-pod-reader \ + --role=pod-reader \ + --user=jane \ + --namespace=default +``` + +#### Example 3: Multiple Resources: + +- To assign roles for multiple resources + +```yaml +apiVersion: rbac.authorization.k8s.io/v1 +kind: Role +metadata: + name: app-manager + namespace: production +rules: +- apiGroups: [""] + resources: ["pods", "services", "configmaps", "secrets"] + verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] +- apiGroups: ["apps"] + resources: ["deployments", "replicasets"] + verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] +``` + +#### Example 4: Missing RoleBinding! + +**Problem: No RoleBinding exists for the user** + +##### ❌ Incorrect Configuration + +```yaml +# Only Role exists, no binding +apiVersion: rbac.authorization.k8s.io/v1 +kind: Role +metadata: + name: pod-reader + namespace: default +rules: +- apiGroups: [""] + resources: ["pods"] + verbs: ["get", "list", "watch"] +``` +##### ✅ Correct Configuration + +```yaml +# Role definition +apiVersion: rbac.authorization.k8s.io/v1 +kind: Role +metadata: + name: pod-reader + namespace: default +rules: +- apiGroups: [""] + resources: ["pods"] + verbs: ["get", "list", "watch"] +--- +# RoleBinding to grant permissions to user +apiVersion: rbac.authorization.k8s.io/v1 +kind: RoleBinding +metadata: + name: jane-pod-reader + namespace: default +subjects: +- kind: User + name: jane + apiGroup: rbac.authorization.k8s.io +roleRef: + kind: Role + name: pod-reader + apiGroup: rbac.authorization.k8s.io +``` + +#### Example 5: Incorrect Configuration + +**Problem: Wrong API group specified (empty string instead of "apps")** + +```yaml +apiVersion: rbac.authorization.k8s.io/v1 +kind: Role +metadata: + name: deployment-manager + namespace: production +rules: +- apiGroups: [""] # ❌ WRONG! Deployments are not in core API group + resources: ["deployments"] + verbs: ["get", "list", "create", "update", "delete"] +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: RoleBinding +metadata: + name: bob-deployment-manager + namespace: production +subjects: +- kind: User + name: bob + apiGroup: rbac.authorization.k8s.io +roleRef: + kind: Role + name: deployment-manager + apiGroup: rbac.authorization.k8s.io +``` + +### ✅ Correct Configuration + +```yaml +apiVersion: rbac.authorization.k8s.io/v1 +kind: Role +metadata: + name: deployment-manager + namespace: production +rules: +- apiGroups: ["apps"] # ✅ CORRECT! Deployments are in "apps" API group + resources: ["deployments"] + verbs: ["get", "list", "create", "update", "delete"] +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: RoleBinding +metadata: + name: bob-deployment-manager + namespace: production +subjects: +- kind: User + name: bob + apiGroup: rbac.authorization.k8s.io +roleRef: + kind: Role + name: deployment-manager + apiGroup: rbac.authorization.k8s.io +``` + +### Verification +```bash +# Test permission +kubectl auth can-i create deployments --as=bob --namespace=production +# Output: yes + +# Try creating a deployment +kubectl create deployment nginx --image=nginx --as=bob --namespace=production +``` + +--- diff --git a/Modules/module1/04-crd's.md b/Modules/module1/04-crd's.md new file mode 100644 index 0000000..0d2eeee --- /dev/null +++ b/Modules/module1/04-crd's.md @@ -0,0 +1,411 @@ +# Understanding role of CRDs and operators + +## Overview + +Custom Resource Definitions (CRDs) extend the Kubernetes API by allowing you to define your own custom resources. Operators use +CRDs to manage complex applications by encoding operational knowledge into software. + +**What are CRDs?** +CRDs allow you to extend Kubernetes by defining new resource types without modifying the Kubernetes source code. + +**Why Use CRDs?** + +- Extend Kubernetes API with domain-specific resources (for example databases.example.com), with its own schema, versions, and scope +- Manage complex applications declaratively +- Leverage Kubernetes features (RBAC, kubectl, API server) +- Enable GitOps workflows +- Provide a contract that Operators/controllers can watch and reconcile, encoding operational runbooks into code. + +### CRD Architecture + +``` +┌─────────────────────────────────────┐ +│ Kubernetes API │ +│ │ +│ Built-in Resources CRDs │ +│ ├── Pod ├── Database │ +│ ├── Service ├── Backup │ +│ └── Deployment └── App │ +└─────────────────────────────────────┘ + │ │ + ▼ ▼ + ┌─────────┐ ┌──────────┐ + │ etcd │ │ etcd │ + │(built-in)│ │ (custom) │ + └─────────┘ └──────────┘ +``` +### Understanding Operators + +**What is an Operator?** +Operator is a method of packaging, deploying, and managing a Kubernetes application. It extends Kubernetes by using custom resources and controllers to automate operational tasks. + +Operator = CRD + Controller + Operational Knowledge + +### Operator Pattern + +``` +┌─────────────────────────────────────────┐ +│ Kubernetes API │ +└────────────┬────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────┐ +│ Custom Resource (CR) │ +│ apiVersion: example.com/v1 │ +│ kind: Database │ +│ spec: │ +│ engine: postgres │ +│ replicas: 3 │ +└────────────┬────────────────────────────┘ + │ + │ watches + ▼ +┌─────────────────────────────────────────┐ +│ Operator (Controller) │ +│ │ +│ 1. Watch for changes │ +│ 2. Compare desired vs actual state │ +│ 3. Take action to reconcile │ +│ 4. Update status │ +└────────────┬────────────────────────────┘ + │ + │ creates/manages + ▼ +┌─────────────────────────────────────────┐ +│ Kubernetes Resources │ +│ - StatefulSet │ +│ - Service │ +│ - ConfigMap │ +│ - PersistentVolumeClaim │ +└─────────────────────────────────────────┘ +``` + +## Common Operators + +### 1. Prometheus Operator + +**Purpose**: Manages Prometheus monitoring instances + +**CRDs**: +- Prometheus +- ServiceMonitor +- AlertManager +- PrometheusRule + +**Example:** + +```yaml +apiVersion: monitoring.coreos.com/v1 +kind: Prometheus +metadata: + name: prometheus + namespace: monitoring +spec: + replicas: 2 + serviceAccountName: prometheus + serviceMonitorSelector: + matchLabels: + team: frontend + resources: + requests: + memory: 400Mi +``` +**Installation:** + +```bash +# Install Prometheus Operator +kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/bundle.yaml + +# Verify CRDs +kubectl get crds | grep monitoring.coreos.com +``` +### 2. Cert-Manager Operator + +**Purpose**: Manages TLS certificates + +**CRDs**: +- Certificate +- Issuer +- ClusterIssuer +- CertificateRequest + +**Example:** + +```yaml +apiVersion: cert-manager.io/v1 +kind: Certificate +metadata: + name: example-com + namespace: default +spec: + secretName: example-com-tls + issuerRef: + name: letsencrypt-prod + kind: ClusterIssuer + dnsNames: + - example.com + - www.example.com +``` +**Installation:** + +```bash +# Install cert-manager +kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml + +# Verify CRDs +kubectl get crds | grep cert-manager.io +``` +### 3. ArgoCD Operator + +**Purpose**: Manages GitOps continuous delivery + +**CRDs**: +- Application +- AppProject +- ApplicationSet + +**Example:** + +```yaml +apiVersion: argoproj.io/v1alpha1 +kind: Application +metadata: + name: guestbook + namespace: argocd +spec: + project: default + source: + repoURL: https://github.com/argoproj/argocd-example-apps.git + targetRevision: HEAD + path: guestbook + destination: + server: https://kubernetes.default.svc + namespace: guestbook + syncPolicy: + automated: + prune: true + selfHeal: true +``` +### 4. Istio Operator + +**Purpose**: Manages Istio service mesh + +**CRDs**: +- VirtualService +- DestinationRule +- Gateway +- ServiceEntry + +**Example:** + +```yaml +apiVersion: networking.istio.io/v1 +kind: VirtualService +metadata: + name: reviews +spec: + hosts: + - reviews + http: + - match: + - headers: + end-user: + exact: jason + route: + - destination: + host: reviews + subset: v2 + - route: + - destination: + host: reviews + subset: v1 +``` + +## Troubleshooting + +**Common CRD Issues** + +#### Issue 1: CRD Not Found + +**Error:** +``` +error: the server doesn't have a resource type "databases" +``` + +**Debug:** +```bash +# Check if CRD exists +kubectl get crds | grep database + +# Check CRD details +kubectl get crd databases.example.com +``` + +**Solution:** +```bash +# Apply the CRD +kubectl apply -f database-crd.yaml + +# Verify +kubectl get crds databases.example.com +``` +### Debugging Workflow + +```bash +# 1. Check CRD exists +kubectl get crds + +# 2. Check CRD details +kubectl describe crd + +# 3. Check custom resources +kubectl get --all-namespaces + +# 4. Check resource details +kubectl describe + +# 5. Check operator pod +kubectl get pods -n + +# 6. Check operator logs +kubectl logs -n -f + +# 7. Check events +kubectl get events --sort-by='.lastTimestamp' + +# 8. Check RBAC +kubectl auth can-i --list --as=system:serviceaccount:: +``` +--- + +## Few Examples for CRD's and how to resolve + +### Example 1: Type Mismatch + +### Scenario + +You are working in a cloud platform team responsible for managing internal PostgreSQL databases using a custom Kubernetes Operator. +Your organization has defined a custom resource called Database, which developers use to request new PostgreSQL instances for their applications. A junior DevOps engineer submits the following manifest to create a new database: +```bash +spec: + replicas: "three" +``` + +However, in your CRD definition, spec.replicas is strictly defined as an integer (since it represents the number of database replicas). When they apply the manifest Kubernetes immediately rejects the request why? + +-> Because the value "three" is a string, not a number. + + +**Error Message:** +``` +The Database "my-db" is invalid: +spec.replicas: Invalid value: "three": spec.replicas in body must be of type integer: "string" +``` + +**Cause:** Field value type doesn't match schema type + +**❌ Incorrect Custom Resource:** +```yaml +apiVersion: example.com/v1 +kind: Database +metadata: + name: my-db +spec: + engine: postgres + version: "14.5" + replicas: "three" # ❌ String instead of integer +``` + +**✅ Correct Custom Resource:** +```yaml +apiVersion: example.com/v1 +kind: Database +metadata: + name: my-db +spec: + engine: postgres + version: "14.5" + replicas: 3 # ✅ Integer type +``` + +### Example 2: Multiple Storage Versions + +**Scenario** + +You are building a Kubernetes Operator for managing databases. +Your team wants to support two API versions of the Database CRD: + +- v1beta1 → older version +- v1 → stable version + +**Error Message:** +``` +The CustomResourceDefinition "databases.example.com" is invalid: +spec.versions: Invalid value: ...: must have exactly one version marked as storage version +``` + +**Cause:** More than one version has `storage: true` + +**❌ Incorrect CRD:** +```yaml +apiVersion: apiextensions.k8s.io/v1 +kind: CustomResourceDefinition +metadata: + name: databases.example.com +spec: + group: example.com + versions: + - name: v1 + served: true + storage: true # ❌ Both marked as storage + schema: + openAPIV3Schema: + type: object + - name: v1beta1 + served: true + storage: true # ❌ Both marked as storage + schema: + openAPIV3Schema: + type: object + scope: Namespaced + names: + plural: databases + singular: database + kind: Database +``` + +**✅ Correct CRD:** +```yaml +apiVersion: apiextensions.k8s.io/v1 +kind: CustomResourceDefinition +metadata: + name: databases.example.com +spec: + group: example.com + scope: Namespaced + names: + plural: databases + singular: database + kind: Database + versions: + - name: v1 + served: true + storage: true # ✅ Only one storage version + schema: + openAPIV3Schema: + type: object + + - name: v1beta1 + served: true + storage: false # ❗ This must be false + deprecated: true + schema: + openAPIV3Schema: + type: object +``` + +### Why? +Kubernetes allows only one storage version because the version is used to store objects in etcd. +If two versions are marked as storage, Kubernetes does not know which format to use. + + diff --git a/Modules/module2/01-pod-networking.md b/Modules/module2/01-pod-networking.md new file mode 100644 index 0000000..9135aa9 --- /dev/null +++ b/Modules/module2/01-pod-networking.md @@ -0,0 +1,14 @@ +# Installing Cluster Components with Helm and Kustomize +``` +# Add community repo +helm repo add prometheus-community https://prometheus-community.github.io/helm-charts +helm repo update + +# Install kube-prometheus-stack (Prometheus + Alertmanager + Grafana) +helm install monitoring prometheus-community/kube-prometheus-stack \ + -n monitoring --create-namespace + +# Check components +kubectl get pods -n monitoring +helm status monitoring -n monitoring +``` diff --git a/Modules/module2/04-core-dns.md b/Modules/module2/04-core-dns.md new file mode 100644 index 0000000..2d953f5 --- /dev/null +++ b/Modules/module2/04-core-dns.md @@ -0,0 +1,346 @@ +# CoreDNS Configuration and Troubleshooting + +## Overview + +CoreDNS is the default DNS server in Kubernetes clusters (since v1.13). It provides service discovery by resolving service names to cluster IPs, enabling pods to communicate using DNS names instead of IP addresses. + +## Why CoreDNS Matters + +- **Service Discovery**: Pods find services by name (e.g., `backend-service`) +- **Cross-Namespace Communication**: Access services in other namespaces +- **External DNS Resolution**: Resolves external domain names +- **Custom DNS Configuration**: Add custom DNS entries and forwarding rules + +## How Kubernetes DNS Works + +### DNS Naming Convention + +``` +..svc.cluster.local +``` + +**Examples**: +```bash +# Same namespace +curl http://backend-service + +# Different namespace +curl http://backend-service.production + +# Fully qualified domain name (FQDN) +curl http://backend-service.production.svc.cluster.local + +# External domain +curl http://google.com +``` + +### DNS Resolution Flow + +1. Pod makes DNS query (e.g., `backend-service`) +2. Query sent to CoreDNS (usually at `10.96.0.10`) +3. CoreDNS checks: + - Is it a Kubernetes service? → Return ClusterIP + - Is it external? → Forward to upstream DNS +4. Response returned to pod + +--- + +## Common DNS Patterns + +### Pattern 1: Custom DNS Entries + +**Use case**: Add custom DNS records for external services + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: coredns + namespace: kube-system +data: + Corefile: | + .:53 { + errors + health + ready + kubernetes cluster.local in-addr.arpa ip6.arpa { + pods insecure + fallthrough in-addr.arpa ip6.arpa + } + # Custom hosts + hosts { + 192.168.1.100 custom-db.example.com + fallthrough + } + prometheus :9153 + forward . /etc/resolv.conf + cache 30 + loop + reload + loadbalance + } +``` +**Apply changes**: +```bash +kubectl edit configmap coredns -n kube-system +# CoreDNS will auto-reload +``` +--- + +### Pattern 2: Increase Cache TTL + +**Use case**: Reduce DNS query load for stable services + +```yaml +data: + Corefile: | + .:53 { + errors + health + ready + kubernetes cluster.local in-addr.arpa ip6.arpa { + pods insecure + fallthrough in-addr.arpa ip6.arpa + ttl 300 # Increase from 30 to 300 seconds + } + prometheus :9153 + forward . /etc/resolv.conf + cache 300 # Increase cache duration + loop + reload + loadbalance + } +``` + +--- + +## Troubleshooting Guide + +### Example 1: DNS Resolution Fails (nslookup fails) + +**ERRor**: +```bash +kubectl exec -it test-pod -- nslookup kubernetes.default +# Server: 10.96.0.10 +# Address 1: 10.96.0.10 +# nslookup: can't resolve 'kubernetes.default' +``` + +**Debug Steps**: +```bash +# 1. Check if CoreDNS pods are running +kubectl get pods -n kube-system -l k8s-app=kube-dns + +# 2. Check CoreDNS logs +kubectl logs -n kube-system -l k8s-app=kube-dns + +# 3. Check CoreDNS service +kubectl get svc -n kube-system kube-dns + +# 4. Verify DNS service IP +kubectl get svc -n kube-system kube-dns -o jsonpath='{.spec.clusterIP}' + +# 5. Check pod's DNS configuration +kubectl exec -it test-pod -- cat /etc/resolv.conf + +# 6. Test DNS from node +nslookup kubernetes.default.svc.cluster.local 10.96.0.10 +``` + +**Solutions**: + +**A. CoreDNS pods not running**: +```bash +# Check pod status +kubectl get pods -n kube-system -l k8s-app=kube-dns + +# Describe pod for errors +kubectl describe pod -n kube-system + +# Restart CoreDNS +kubectl rollout restart deployment coredns -n kube-system +``` +**B. Wrong DNS service IP in pod**: +```bash +# Check kubelet DNS configuration +# On node: +cat /var/lib/kubelet/config.yaml | grep -A 2 clusterDNS + +# Should match kube-dns service IP +kubectl get svc -n kube-system kube-dns -o jsonpath='{.spec.clusterIP}' +``` + +**C. CoreDNS configuration error**: +```bash +# Check for syntax errors in Corefile +kubectl get configmap coredns -n kube-system -o yaml + +# Validate by checking CoreDNS logs +kubectl logs -n kube-system -l k8s-app=kube-dns | grep -i error +``` + +--- + +### Example 2: External DNS Not Working + +**Error**: +```bash +kubectl exec -it test-pod -- nslookup google.com +# Server: 10.96.0.10 +# Address 1: 10.96.0.10 +# nslookup: can't resolve 'google.com' +``` + +**Debug Steps**: +```bash +# 1. Check CoreDNS forward configuration +kubectl get configmap coredns -n kube-system -o yaml | grep -A 3 forward + +# 2. Test DNS from CoreDNS pod +kubectl exec -it -n kube-system -- nslookup google.com + +# 3. Check node's DNS configuration +cat /etc/resolv.conf + +# 4. Check CoreDNS logs for forwarding errors +kubectl logs -n kube-system -l k8s-app=kube-dns | grep -i forward +``` + +**Solutions**: + +**A. Upstream DNS not reachable**: +```bash +# Test from node +nslookup google.com + +# If node DNS works, check CoreDNS forward config +kubectl edit configmap coredns -n kube-system + +# Change forward to use public DNS +forward . 8.8.8.8 1.1.1.1 +``` + +**B. Network policy blocking DNS**: +```bash +# Check network policies +kubectl get networkpolicy -A + +# Add DNS egress rule +kubectl apply -f - < + +# 2. Check previous logs +kubectl logs -n kube-system --previous + +# 3. Describe pod +kubectl describe pod -n kube-system + +# 4. Check events +kubectl get events -n kube-system --sort-by='.lastTimestamp' +``` + +**Common Causes and Solutions**: + +**A. Corefile syntax error** +```bash +# Check logs for syntax errors +kubectl logs -n kube-system | grep -i error + +# Fix Corefile +kubectl edit configmap coredns -n kube-system + +# Restart CoreDNS +kubectl rollout restart deployment coredns -n kube-system +``` +**B. Loop detection**: +```bash +# Logs show: "plugin/loop: Loop detected" +# This means CoreDNS is forwarding to itself + +# Check node's /etc/resolv.conf +cat /etc/resolv.conf + +# If it points to 127.0.0.x, update CoreDNS forward +kubectl edit configmap coredns -n kube-system + +# Change: +forward . /etc/resolv.conf +# To: +forward . 8.8.8.8 1.1.1.1 +``` + +**C. Resource limits**: +```bash +# Check resource usage +kubectl top pod -n kube-system -l k8s-app=kube-dns + +# Increase limits +kubectl edit deployment coredns -n kube-system + +# Update resources: +resources: + limits: + memory: 256Mi + requests: + cpu: 100m + memory: 128Mi +``` + +--- + +## Best Practices + +1. **Monitor CoreDNS**: Set up alerts for CoreDNS pod failures +2. **Scale appropriately**: Run at least 2 replicas for HA +3. **Tune cache**: Increase cache TTL for stable services +4. **Use FQDN**: Use fully qualified names for cross-namespace communication +5. **Test DNS**: Include DNS tests in application health checks +6. **Backup Corefile**: Keep a backup of custom CoreDNS configurations +7. **Resource limits**: Set appropriate CPU and memory limits +8. **Log monitoring**: Monitor CoreDNS logs for errors and warnings + +--- + + + + + + + + + + + + + + diff --git a/lifecycle-management.md b/lifecycle-management.md new file mode 100644 index 0000000..567e0dd --- /dev/null +++ b/lifecycle-management.md @@ -0,0 +1,608 @@ +# Cluster Lifecycle Management + +## Overview + +Cluster lifecycle management involves maintaining, upgrading, backing up, and restoring Kubernetes clusters. These operations are critical for production environments. + +## Table of Contents + +1. [Cluster Upgrades](#cluster-upgrades) +2. [etcd Backup and Restore](#etcd-backup-and-restore) +3. [Node Maintenance](#node-maintenance) +4. [Certificate Management](#certificate-management) +5. [Troubleshooting](#troubleshooting) + +## Cluster Upgrades + +### Upgrade Strategy + +Kubernetes follows a version skew policy: +- Control plane components can be at most one minor version apart +- kubelet can be up to two minor versions behind API server +- kubectl can be ±1 minor version from API server + +**Upgrade Order:** +1. Control plane nodes (one at a time if HA) +2. Worker nodes (can be done in batches) + +### Pre-Upgrade Checklist + +```bash +# 1. Check current versions +kubectl version +kubectl get nodes + +# 2. Review release notes +# https://kubernetes.io/releases/ + +# 3. Backup etcd +ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db + +# 4. Check cluster health +kubectl get nodes +kubectl get pods -A +kubectl get componentstatuses # Deprecated but useful + +# 5. Drain control plane node (if HA) +kubectl drain --ignore-daemonsets +``` + +### Upgrade Control Plane (Ubuntu/Debian) + +#### Step 1: Upgrade kubeadm + +```bash +# Find available versions +apt-cache madison kubeadm + +# Upgrade to specific version (e.g., v1.34.1) +sudo apt-mark unhold kubeadm +sudo apt-get update +sudo apt-get install -y kubeadm=1.34.1-00 +sudo apt-mark hold kubeadm + +# Verify version +kubeadm version +``` + +#### Step 2: Plan the Upgrade + +```bash +# Check what will be upgraded +sudo kubeadm upgrade plan + +# Output shows: +# - Current version +# - Target version +# - Component versions +# - Upgrade path +``` + +#### Step 3: Apply the Upgrade + +```bash +# For first control plane node +sudo kubeadm upgrade apply v1.34.1 + +# For additional control plane nodes (if HA) +sudo kubeadm upgrade node +``` + +**Expected Output:** +``` +[upgrade/successful] SUCCESS! Your cluster was upgraded to "v1.34.1". Enjoy! + +[upgrade/kubelet] Now that your control plane is upgraded, please proceed with upgrading your kubelets if you haven't already done so. +``` + +#### Step 4: Upgrade kubelet and kubectl + +```bash +# Upgrade kubelet and kubectl +sudo apt-mark unhold kubelet kubectl +sudo apt-get update +sudo apt-get install -y kubelet=1.34.1-00 kubectl=1.34.1-00 +sudo apt-mark hold kubelet kubectl + +# Restart kubelet +sudo systemctl daemon-reload +sudo systemctl restart kubelet + +# Verify +kubectl version +kubelet --version +``` + +#### Step 5: Uncordon the Node + +```bash +kubectl uncordon + +# Verify node is Ready +kubectl get nodes +``` + +### Upgrade Worker Nodes + +Repeat for each worker node: + +#### Step 1: Drain the Node + +```bash +# From control plane +kubectl drain --ignore-daemonsets --delete-emptydir-data + +# Verify pods are evicted +kubectl get pods -o wide | grep +``` + +#### Step 2: Upgrade kubeadm (on worker node) + +```bash +# SSH to worker node +ssh + +# Upgrade kubeadm +sudo apt-mark unhold kubeadm +sudo apt-get update +sudo apt-get install -y kubeadm=1.34.1-00 +sudo apt-mark hold kubeadm +``` + +#### Step 3: Upgrade Node Configuration + +```bash +# On worker node +sudo kubeadm upgrade node +``` + +#### Step 4: Upgrade kubelet and kubectl + +```bash +# On worker node +sudo apt-mark unhold kubelet kubectl +sudo apt-get update +sudo apt-get install -y kubelet=1.34.1-00 kubectl=1.34.1-00 +sudo apt-mark hold kubelet kubectl + +# Restart kubelet +sudo systemctl daemon-reload +sudo systemctl restart kubelet +``` + +#### Step 5: Uncordon the Node + +```bash +# From control plane +kubectl uncordon + +# Verify +kubectl get nodes +``` + +### Upgrade Verification + +```bash +# Check all nodes are upgraded +kubectl get nodes + +# Expected output: +NAME STATUS ROLES AGE VERSION +control-plane-1 Ready control-plane 30d v1.34.1 +worker-1 Ready 30d v1.34.1 +worker-2 Ready 30d v1.34.1 + +# Check system pods +kubectl get pods -n kube-system + +# Check cluster info +kubectl cluster-info +``` + +--- + +## etcd Backup and Restore + +### Understanding etcd + +etcd stores all cluster data: +- All Kubernetes objects (Pods, Services, Deployments, etc.) +- Cluster configuration +- Secrets and ConfigMaps +- Resource definitions + +**Critical**: Regular backups are essential for disaster recovery. + +### etcd Backup + +#### Method 1: Using etcdctl (Recommended) + +```bash +# Install etcdctl if not present +ETCD_VER=v3.5.9 +wget https://github.com/etcd-io/etcd/releases/download/${ETCD_VER}/etcd-${ETCD_VER}-linux-amd64.tar.gz +tar xzf etcd-${ETCD_VER}-linux-amd64.tar.gz +sudo mv etcd-${ETCD_VER}-linux-amd64/etcdctl /usr/local/bin/ +rm -rf etcd-${ETCD_VER}-linux-amd64* + +# Create backup +ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \ + --endpoints=https://127.0.0.1:2379 \ + --cacert=/etc/kubernetes/pki/etcd/ca.crt \ + --cert=/etc/kubernetes/pki/etcd/server.crt \ + --key=/etc/kubernetes/pki/etcd/server.key + +# Verify backup +ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot.db --write-out=table +``` + +**Output:** +``` ++----------+----------+------------+------------+ +| HASH | REVISION | TOTAL KEYS | TOTAL SIZE | ++----------+----------+------------+------------+ +| 12345678 | 12345 | 1234 | 5.0 MB | ++----------+----------+------------+------------+ +``` + +#### Method 2: Automated Backup Script + +```bash +#!/bin/bash +# /usr/local/bin/backup-etcd.sh + +BACKUP_DIR="/backup/etcd" +TIMESTAMP=$(date +%Y%m%d-%H%M%S) +BACKUP_FILE="${BACKUP_DIR}/etcd-snapshot-${TIMESTAMP}.db" + +# Create backup directory +mkdir -p ${BACKUP_DIR} + +# Create snapshot +ETCDCTL_API=3 etcdctl snapshot save ${BACKUP_FILE} \ + --endpoints=https://127.0.0.1:2379 \ + --cacert=/etc/kubernetes/pki/etcd/ca.crt \ + --cert=/etc/kubernetes/pki/etcd/server.crt \ + --key=/etc/kubernetes/pki/etcd/server.key + +# Verify snapshot +ETCDCTL_API=3 etcdctl snapshot status ${BACKUP_FILE} + +# Keep only last 7 days of backups +find ${BACKUP_DIR} -name "etcd-snapshot-*.db" -mtime +7 -delete + +echo "Backup completed: ${BACKUP_FILE}" +``` + +**Schedule with cron:** +```bash +# Edit crontab +sudo crontab -e + +# Add daily backup at 2 AM +0 2 * * * /usr/local/bin/backup-etcd.sh >> /var/log/etcd-backup.log 2>&1 +``` + +### etcd Restore + +**⚠️ Warning**: Restoring etcd will overwrite all cluster data. Only do this in disaster recovery scenarios. + +#### Step 1: Stop API Server and etcd + +```bash +# Move manifests to stop static pods +sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/ +sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/ + +# Wait for pods to stop +docker ps | grep -E 'kube-apiserver|etcd' +``` + +#### Step 2: Restore from Snapshot + +```bash +# Restore snapshot to new directory +ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \ + --data-dir=/var/lib/etcd-restore \ + --name= \ + --initial-cluster==https://:2380 \ + --initial-advertise-peer-urls=https://:2380 + +# Example: +ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \ + --data-dir=/var/lib/etcd-restore \ + --name=control-plane-1 \ + --initial-cluster=control-plane-1=https://10.0.0.10:2380 \ + --initial-advertise-peer-urls=https://10.0.0.10:2380 +``` + +#### Step 3: Update etcd Configuration + +```bash +# Edit etcd manifest +sudo vi /tmp/etcd.yaml + +# Update data directory path: +# Change: --data-dir=/var/lib/etcd +# To: --data-dir=/var/lib/etcd-restore + +# Or use sed +sudo sed -i 's|/var/lib/etcd|/var/lib/etcd-restore|g' /tmp/etcd.yaml +``` + +#### Step 4: Restart etcd and API Server + +```bash +# Move manifests back +sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/ +sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/ + +# Wait for pods to start +watch kubectl get pods -n kube-system +``` + +#### Step 5: Verify Restore + +```bash +# Check cluster status +kubectl get nodes +kubectl get pods -A + +# Verify your data is restored +kubectl get deployments -A +kubectl get services -A +``` + +--- + +## Node Maintenance + +### Draining Nodes + +Safely evict pods before maintenance: + +```bash +# Drain node (evict all pods) +kubectl drain --ignore-daemonsets --delete-emptydir-data + +# Options: +# --ignore-daemonsets: Ignore DaemonSet-managed pods +# --delete-emptydir-data: Delete pods using emptyDir volumes +# --force: Force deletion of pods not managed by controllers +# --grace-period=: Grace period for pod termination +# --timeout=: Timeout for drain operation + +# Check pods are evicted +kubectl get pods -o wide | grep +``` + +### Cordoning Nodes + +Mark node as unschedulable without evicting pods: + +```bash +# Cordon node (mark unschedulable) +kubectl cordon + +# Verify +kubectl get nodes +# Node will show SchedulingDisabled + +# Uncordon when ready +kubectl uncordon +``` + +### Node Removal + +```bash +# 1. Drain the node +kubectl drain --ignore-daemonsets --delete-emptydir-data + +# 2. Delete the node from cluster +kubectl delete node + +# 3. On the node itself, reset kubeadm +sudo kubeadm reset + +# 4. Clean up +sudo rm -rf /etc/cni/net.d +sudo rm -rf $HOME/.kube/config +``` + +### Adding Nodes Back + +```bash +# Generate new join command on control plane +kubeadm token create --print-join-command + +# On the node, run the join command +sudo kubeadm join :6443 --token \ + --discovery-token-ca-cert-hash sha256: + +# Verify from control plane +kubectl get nodes +``` + +--- + +## Certificate Management + +### Check Certificate Expiration + +```bash +# Check all certificates +sudo kubeadm certs check-expiration + +# Output shows expiration dates for: +# - admin.conf +# - apiserver +# - apiserver-etcd-client +# - apiserver-kubelet-client +# - controller-manager.conf +# - etcd-healthcheck-client +# - etcd-peer +# - etcd-server +# - front-proxy-client +# - scheduler.conf +``` + +### Renew Certificates + +```bash +# Renew all certificates +sudo kubeadm certs renew all + +# Renew specific certificate +sudo kubeadm certs renew apiserver + +# Restart control plane components +sudo systemctl restart kubelet + +# For static pods, move and restore manifests +sudo mv /etc/kubernetes/manifests/*.yaml /tmp/ +sleep 10 +sudo mv /tmp/*.yaml /etc/kubernetes/manifests/ +``` + +### Manual Certificate Renewal + +```bash +# Backup current certificates +sudo cp -r /etc/kubernetes/pki /etc/kubernetes/pki.backup + +# Renew certificates +sudo kubeadm certs renew all + +# Update kubeconfig +sudo cp /etc/kubernetes/admin.conf $HOME/.kube/config +sudo chown $(id -u):$(id -g) $HOME/.kube/config + +# Verify +kubectl get nodes +``` + +--- + +## Troubleshooting + +### Upgrade Issues + +**Issue**: kubeadm upgrade fails + +```bash +# Check kubeadm version +kubeadm version + +# Check for version skew +kubectl version + +# Review upgrade plan +sudo kubeadm upgrade plan + +# Check logs +sudo journalctl -u kubelet -f +``` + +**Issue**: Pods not starting after upgrade + +```bash +# Check pod status +kubectl get pods -A +kubectl describe pod -n + +# Check node status +kubectl get nodes +kubectl describe node + +# Check kubelet +sudo systemctl status kubelet +sudo journalctl -u kubelet -f +``` + +### Backup/Restore Issues + +**Issue**: etcdctl command not found + +```bash +# Install etcdctl +ETCD_VER=v3.5.9 +wget https://github.com/etcd-io/etcd/releases/download/${ETCD_VER}/etcd-${ETCD_VER}-linux-amd64.tar.gz +tar xzf etcd-${ETCD_VER}-linux-amd64.tar.gz +sudo mv etcd-${ETCD_VER}-linux-amd64/etcdctl /usr/local/bin/ +``` + +**Issue**: Backup fails with certificate errors + +```bash +# Verify certificate paths +ls -la /etc/kubernetes/pki/etcd/ + +# Use correct paths in etcdctl command +ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \ + --endpoints=https://127.0.0.1:2379 \ + --cacert=/etc/kubernetes/pki/etcd/ca.crt \ + --cert=/etc/kubernetes/pki/etcd/server.crt \ + --key=/etc/kubernetes/pki/etcd/server.key +``` + +**Issue**: Restore fails + +```bash +# Check snapshot integrity +ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot.db + +# Ensure cluster is stopped +sudo mv /etc/kubernetes/manifests/*.yaml /tmp/ + +# Verify no etcd process running +ps aux | grep etcd + +# Try restore again with correct parameters +``` + +## Best Practices + +1. **Regular Backups**: + - Automate daily etcd backups + - Store backups off-cluster + - Test restore procedures regularly + +2. **Upgrade Strategy**: + - Test upgrades in non-production first + - Upgrade one minor version at a time + - Keep detailed upgrade logs + +3. **Maintenance Windows**: + - Schedule maintenance during low-traffic periods + - Communicate with stakeholders + - Have rollback plan ready + +4. **Monitoring**: + - Monitor cluster health before/after operations + - Set up alerts for certificate expiration + - Track upgrade progress + +5. **Documentation**: + - Document cluster configuration + - Keep runbooks for common operations + - Record all changes + +## Exam Tips + +1. **Know the Commands**: Memorize upgrade and backup commands +2. **Practice Speed**: Upgrades take time - practice for efficiency +3. **Certificate Locations**: Know where certificates are stored +4. **Backup Verification**: Always verify backups after creation +5. **Drain vs Cordon**: Understand the difference +6. **Version Skew**: Understand Kubernetes version compatibility + +## References + +- [Upgrading kubeadm clusters](https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/) +- [Operating etcd clusters](https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/) +- [Certificate Management](https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-certs/) +- [Safely Drain a Node](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/) + +--- + +[← Back to Installation](README.md) | [Next: HA Configuration →](04-ha-installation.md)