Skip to content

Latest commit

 

History

History
594 lines (464 loc) · 24.8 KB

File metadata and controls

594 lines (464 loc) · 24.8 KB

🐳 Kubernetes Pause Containers

Reference: Kubernetes Official Documentation - The Pause Container


📋 Table of Contents

  1. What is a Pause Container?
  2. Why Pause Containers Exist
  3. How Pause Containers Work
  4. Network Namespace Management
  5. Lifecycle and Behavior
  6. Inspecting Pause Containers
  7. Troubleshooting

🤔 What is a Pause Container?

The pause container (also called the infrastructure container or sandbox container) is a special hidden container that Kubernetes automatically creates for every Pod. It serves as the parent container that holds the Linux namespaces that all other containers in the Pod will share.

Key Characteristics

  • Minimal Footprint: Extremely small (~500KB) container image
  • Does Nothing: Runs an almost empty binary that simply sleeps forever
  • First to Start: Created before any application containers in the Pod
  • Last to Stop: Remains alive until the entire Pod is deleted
  • Not Visible: Doesn't appear in kubectl get pods output
  • One Per Pod: Every Pod has exactly one pause container
┌──────────────────────────────────────────────────────┐
│                      Pod: web-app                     │
│                                                       │
│  ┌─────────────────────────────────────────────────┐ │
│  │        Pause Container (Hidden)                 │ │
│  │        registry.k8s.io/pause:3.9                │ │
│  │                                                 │ │
│  │  Creates and owns:                              │ │
│  │  • Network namespace                            │ │
│  │  • IPC namespace                                │ │
│  │  • PID namespace (if shareProcessNamespace)     │ │
│  │  • Pod IP address                               │ │
│  └─────────────────────────────────────────────────┘ │
│              ▲                    ▲                  │
│              │                    │                  │
│     ┌────────┴────────┐  ┌────────┴────────┐        │
│     │   Container 1   │  │   Container 2   │        │
│     │   (nginx)       │  │   (redis)       │        │
│     │                 │  │                 │        │
│     │   Joins pause   │  │   Joins pause   │        │
│     │   container's   │  │   container's   │        │
│     │   namespaces    │  │   namespaces    │        │
│     └─────────────────┘  └─────────────────┘        │
│                                                      │
└──────────────────────────────────────────────────────┘

💡 Why Pause Containers Exist

Problem: Container Lifecycle Independence

Without pause containers, if the first container in a Pod created the network namespace and then crashed/restarted, all other containers would lose their network connectivity.

❌ Without Pause Container (Bad):
┌──────────────────────────────────────────┐
│ Pod                                       │
│  ┌────────────────┐                      │
│  │ Container A    │ ← Creates network    │
│  │ (crashes)      │   namespace          │
│  └────────────────┘                      │
│         ↓                                 │
│  Network namespace destroyed!            │
│         ↓                                 │
│  ┌────────────────┐                      │
│  │ Container B    │ ← Lost network!      │
│  │ (still running)│                      │
│  └────────────────┘                      │
└──────────────────────────────────────────┘

Solution: Stable Namespace Anchor

The pause container acts as a stable anchor that owns the namespaces. Application containers can come and go without affecting the Pod's network.

✅ With Pause Container (Good):
┌──────────────────────────────────────────┐
│ Pod                                       │
│  ┌────────────────┐                      │
│  │ Pause Container│ ← Owns namespaces    │
│  │ (never dies)   │   Always stable      │
│  └────────────────┘                      │
│         ↓                                 │
│  ┌────────────────┐  ┌────────────────┐  │
│  │ Container A    │  │ Container B    │  │
│  │ (can restart)  │  │ (unaffected)   │  │
│  └────────────────┘  └────────────────┘  │
│                                          │
│  Network stays intact!                   │
└──────────────────────────────────────────┘

Benefits

Benefit Description
Namespace Stability Network, IPC, and PID namespaces persist across container restarts
Simplified Container Management Containers can crash/restart without affecting each other
Consistent Networking Pod IP address remains stable throughout Pod lifetime
Resource Reaping Can act as PID 1 to reap zombie processes (when PID namespace sharing is enabled)
Clean Abstraction Separates Pod infrastructure from application logic

🔧 How Pause Containers Work

Creation Process

  1. kubelet receives Pod specification from API server
  2. Container runtime (containerd/CRI-O) creates the pause container first
  3. Pause container establishes Linux namespaces
  4. CNI plugin sets up networking for the pause container
  5. Pod IP address is assigned to the pause container's network namespace
  6. Application containers join the pause container's namespaces
Sequence Diagram:
┌─────────┐     ┌──────────┐     ┌─────────┐     ┌──────────┐
│ kubelet │     │ Runtime  │     │  Pause  │     │   CNI    │
└────┬────┘     └────┬─────┘     └────┬────┘     └────┬─────┘
     │               │                │               │
     │ Create Pod    │                │               │
     ├──────────────►│                │               │
     │               │ Start pause    │               │
     │               ├───────────────►│               │
     │               │                │ Create netns  │
     │               │                ├──────────────►│
     │               │                │               │
     │               │                │ Assign IP     │
     │               │                │◄──────────────┤
     │               │ Start app      │               │
     │               │ containers     │               │
     │               │ (join netns)   │               │
     │               ├────────────────┼──────────────►│

Container Runtime Implementation

The pause container is defined in the container runtime (not Kubernetes core):

containerd (most common):

# /etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri"]
  sandbox_image = "registry.k8s.io/pause:3.9"

CRI-O:

# /etc/crio/crio.conf
[crio.image]
pause_image = "registry.k8s.io/pause:3.9"

The Pause Binary

The pause container runs an extremely simple program that does almost nothing:

// Simplified version of the pause binary
#include <signal.h>
#include <unistd.h>

void sigdown(int signo) {
  psignal(signo, "Shutting down, got signal");
  exit(0);
}

int main() {
  signal(SIGINT, sigdown);
  signal(SIGTERM, sigdown);
  
  // Infinite loop doing nothing
  for (;;) pause();
  
  return 0;
}

Purpose: Simply exists and holds the namespaces, responding only to termination signals.


🌐 Network Namespace Management

How Pause Containers Maintain Pod Networking

The pause container is the owner of the Pod's network namespace, which contains:

┌─────────────────────────────────────────────────────────┐
│        Pause Container's Network Namespace              │
│                                                         │
│  ┌──────────────────────────────────────────────────┐  │
│  │  Network Interfaces:                              │  │
│  │  • lo (loopback): 127.0.0.1                       │  │
│  │  • eth0: 10.8.0.5 (Pod IP assigned by CNI)       │  │
│  └──────────────────────────────────────────────────┘  │
│                                                         │
│  ┌──────────────────────────────────────────────────┐  │
│  │  Network Configuration:                           │  │
│  │  • Routing table                                  │  │
│  │  • iptables rules                                 │  │
│  │  • Network sockets                                │  │
│  └──────────────────────────────────────────────────┘  │
│                                                         │
│  ┌──────────────────────────────────────────────────┐  │
│  │  Port Bindings:                                   │  │
│  │  • All containers share these ports               │  │
│  │  • Container 1: listens on :8080                  │  │
│  │  • Container 2: listens on :6379                  │  │
│  │  • No port conflicts allowed                      │  │
│  └──────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────┘
         ▲                              ▲
         │                              │
    ┌────┴─────┐                   ┌────┴─────┐
    │Container1│                   │Container2│
    │ (nginx)  │                   │ (redis)  │
    │          │                   │          │
    │ Joins via│                   │ Joins via│
    │ --net=   │                   │ --net=   │
    │ container│                   │ container│
    │ :pause   │                   │ :pause   │
    └──────────┘                   └──────────┘

Network Namespace Sharing

When application containers start, they join the pause container's network namespace:

# Conceptually similar to:
docker run --name pause registry.k8s.io/pause:3.9
docker run --name nginx --net=container:pause nginx
docker run --name redis --net=container:pause redis

This means:

  • ✅ All containers see the same network interfaces
  • ✅ All containers have the same IP address (the Pod IP)
  • ✅ Containers communicate via localhost
  • ✅ Network persists even if application containers restart

CNI Plugin Interaction

The CNI plugin configures networking only for the pause container:

┌────────────────────────────────────────────────────────┐
│              kubelet + CNI Plugin                       │
│                                                         │
│  1. kubelet calls CNI plugin with pause container ID   │
│     ↓                                                   │
│  2. CNI plugin creates veth pair                       │
│     ↓                                                   │
│  3. One end goes to cni0 bridge                        │
│     ↓                                                   │
│  4. Other end goes to pause container's netns          │
│     ↓                                                   │
│  5. CNI plugin assigns IP: 10.8.0.5                    │
│     ↓                                                   │
│  6. Sets up routes and iptables rules                  │
│     ↓                                                   │
│  7. Returns success                                     │
│     ↓                                                   │
│  8. Application containers join this netns             │
│                                                         │
└────────────────────────────────────────────────────────┘

Network Persistence

Because the pause container never stops (until Pod deletion), the network remains stable:

Timeline of Pod with Pause Container:
═══════════════════════════════════════════════════════════

t0: Pod created
    ├─ Pause container starts
    ├─ Network namespace created
    ├─ CNI assigns IP: 10.8.0.5
    └─ eth0 interface configured

t1: Application container 1 starts (nginx)
    └─ Joins pause container's network namespace

t2: Application container 2 starts (sidecar)
    └─ Joins pause container's network namespace

t3: Container 1 crashes and restarts
    ├─ Network namespace unchanged (pause still running)
    ├─ New nginx container joins existing namespace
    └─ Pod IP still 10.8.0.5 ✅

t4: Container 2 stops and starts
    ├─ Network namespace unchanged (pause still running)
    └─ Pod IP still 10.8.0.5 ✅

t5: Pod deleted
    ├─ All application containers stopped
    ├─ Pause container stopped
    └─ Network namespace destroyed

🔄 Lifecycle and Behavior

Pause Container Lifecycle

The pause container follows a very simple lifecycle:

┌─────────────────────────────────────────────────────┐
│           Pause Container Lifecycle                  │
│                                                      │
│  ┌────────┐                                          │
│  │ CREATE │  kubelet requests Pod creation           │
│  └───┬────┘                                          │
│      │                                               │
│      ▼                                               │
│  ┌────────┐                                          │
│  │ START  │  Pause container is the first to start  │
│  └───┬────┘                                          │
│      │                                               │
│      ▼                                               │
│  ┌────────┐                                          │
│  │ RUNNING│  Sleeps indefinitely (pause() syscall)  │
│  │  ∞     │  • Holds namespaces                     │
│  │        │  • Does nothing else                    │
│  │        │  • Minimal CPU/memory                   │
│  └───┬────┘                                          │
│      │                                               │
│      │  Application containers can:                 │
│      │  • Start                                     │
│      │  • Stop                                      │
│      │  • Restart                                   │
│      │  • Crash                                     │
│      │  (Pause container unaffected)                │
│      │                                               │
│      ▼                                               │
│  ┌────────┐                                          │
│  │  STOP  │  Only when entire Pod is deleted        │
│  └───┬────┘                                          │
│      │                                               │
│      ▼                                               │
│  ┌────────┐                                          │
│  │ DELETE │  Namespaces are cleaned up              │
│  └────────┘                                          │
│                                                      │
└─────────────────────────────────────────────────────┘

Resource Usage

The pause container is designed to be extremely lightweight:

Resource Typical Usage
Memory ~500 KB
CPU Nearly 0% (just sleeps)
Disk ~500 KB (image size)
Network 0 (doesn't send/receive packets)

Why Pause Never Exits

The pause container must remain running because:

  1. Namespace Ownership: If it exits, the network namespace is destroyed
  2. Pod Health: Kubernetes considers a Pod failed if its infrastructure container exits
  3. Container Joining: New containers (from restarts) need the namespace to still exist

🔍 Inspecting Pause Containers

Finding Pause Containers

Pause containers are hidden from kubectl but visible to the container runtime:

# Using containerd
sudo crictl ps -a | grep pause
# Output:
# 5a4b3c2d1e    registry.k8s.io/pause:3.9    ...    Running    ...

# Using Docker (if using dockershim - deprecated)
sudo docker ps | grep pause

# View all containers for a specific Pod
# First, find the Pod UID
kubectl get pod <pod-name> -o jsonpath='{.metadata.uid}'

# Then search for containers with that UID
sudo crictl ps | grep <pod-uid>

Example: Inspecting a Pod's Containers

# Create a test Pod
kubectl run test-pod --image=nginx

# Get Pod details
kubectl get pod test-pod -o wide
# Output:
# NAME       READY   STATUS    IP          NODE
# test-pod   1/1     Running   10.8.0.42   worker1

# On the worker node, inspect containers
sudo crictl ps

# Output shows BOTH containers:
# CONTAINER ID   IMAGE                    ...   NAMES
# abc123def456   nginx                    ...   test-pod_nginx
# 789ghi012jkl   registry.k8s.io/pause    ...   POD_test-pod ← Pause container

Inspecting Network Namespace

# Find the pause container ID
PAUSE_ID=$(sudo crictl ps | grep pause | grep test-pod | awk '{print $1}')

# Inspect the pause container
sudo crictl inspect $PAUSE_ID

# View network namespace
PID=$(sudo crictl inspect $PAUSE_ID | jq .info.pid)
sudo nsenter -t $PID -n ip addr show
# Output:
# 1: lo: <LOOPBACK,UP> ...
#     inet 127.0.0.1/8 ...
# 2: eth0@if123: <BROADCAST,MULTICAST,UP> ...
#     inet 10.8.0.42/24 ...  ← Pod IP

Checking Process Tree

# View process tree on the node
ps auxf | grep pause

# You'll see the pause container as parent to app containers
# root  12345  pause
#   └─ root  12346  nginx

🔧 Troubleshooting

Common Issues

1. Pause Container Image Not Available

Symptom: Pods stuck in ContainerCreating state

kubectl describe pod <pod-name>
# Events:
# Failed to pull image "registry.k8s.io/pause:3.9"

Solution: Ensure the pause image is available

# Pre-pull the pause image on all nodes
sudo crictl pull registry.k8s.io/pause:3.9

# Or check containerd config
sudo cat /etc/containerd/config.toml | grep sandbox_image

2. Wrong Pause Image Version

Symptom: Pods fail to start with CRI errors

Solution: Update container runtime configuration

# Check current pause image
sudo crictl images | grep pause

# Update containerd config
sudo nano /etc/containerd/config.toml
# Set: sandbox_image = "registry.k8s.io/pause:3.9"

# Restart containerd
sudo systemctl restart containerd

3. Orphaned Pause Containers

Symptom: Pause containers remain after Pod deletion

Solution: Clean up manually

# List orphaned pause containers
sudo crictl ps -a | grep pause | grep Exited

# Remove them
sudo crictl rm <container-id>

# Or clean all stopped containers
sudo crictl rm $(sudo crictl ps -a -q --state=Exited)

4. Network Namespace Issues

Symptom: Containers can't communicate via localhost

Solution: Verify namespace sharing

# Check if containers share the same network namespace
for container in $(sudo crictl ps -q); do
  echo "Container: $container"
  sudo crictl inspect $container | jq -r '.info.runtimeSpec.linux.namespaces[] | select(.type=="network") | .path'
done

# All containers in the same Pod should show the same network namespace path

Debugging Commands

# 1. Check pause container status
sudo crictl ps | grep pause

# 2. View pause container logs (should be empty)
sudo crictl logs <pause-container-id>

# 3. Inspect pause container details
sudo crictl inspect <pause-container-id> | jq

# 4. Check network namespace
PAUSE_PID=$(sudo crictl inspect <pause-container-id> | jq .info.pid)
sudo ls -la /proc/$PAUSE_PID/ns/

# 5. View network interfaces in Pod's namespace
sudo nsenter -t $PAUSE_PID -n ip addr show

# 6. Check if application containers joined the namespace
APP_PID=$(sudo crictl inspect <app-container-id> | jq .info.pid)
sudo readlink /proc/$APP_PID/ns/net
sudo readlink /proc/$PAUSE_PID/ns/net
# ↑ These should match!

📚 Additional Resources


📝 Summary

Key Takeaways:

  1. ✅ Every Pod has a hidden pause container that owns the Pod's namespaces
  2. ✅ The pause container is created first and destroyed last
  3. ✅ Application containers join the pause container's namespaces
  4. ✅ This design provides network stability across container restarts
  5. ✅ The pause container is extremely lightweight (~500KB, minimal CPU)
  6. ✅ Network configuration (IP, interfaces) is done only for the pause container
  7. ✅ All containers in a Pod share the same network namespace via the pause container

The pause container is a clever infrastructure component that makes Kubernetes' Pod networking model reliable and simple, even though most users never directly interact with it.