| Field | Value |
|---|---|
| Document ID | TR-2026-001 |
| Date | 2026-02-27 |
| Author | Backend.AI Architecture Team |
| Status | Final |
| Classification | Internal Technical Reference |
| Scope | Kata Containers 3.x architecture, feature parity with Docker/containerd/runc, and feasibility assessment for Backend.AI integration |
- Executive Summary
- Introduction and Motivation
- Kata Containers Architecture
- Hypervisor Landscape
- Feature Parity Analysis
- 5.1 Networking
- 5.2 Storage and Volumes
- 5.3 GPU and Device Passthrough
- 5.4 Resource Management
- 5.5 Container Image Handling
- 5.6 Multi-Container Pods
- 5.7 Security Model
- Performance Analysis
- 6.1 Startup Latency
- 6.2 Memory Overhead
- 6.3 CPU Performance
- 6.4 I/O Performance
- 6.5 Network Performance
- Advanced Capabilities
- Known Limitations and Unsupported Features
- Production Adoption
- Implications for Backend.AI
- Conclusions and Recommendations
- References
Kata Containers is an open-source project under the OpenInfra Foundation that provides hardware-isolated container workloads by running each Kubernetes pod inside a lightweight virtual machine (VM). It delivers the security guarantees of VMs while maintaining OCI runtime specification compliance and seamless Kubernetes integration via the RuntimeClass mechanism.
This report provides a comprehensive analysis of Kata Containers 3.x architecture, evaluates feature parity against native container runtimes (Docker, containerd with runc), and assesses implications for Backend.AI's container orchestration stack. The analysis covers networking, storage, GPU passthrough, resource management, security, performance, and advanced features including Confidential Containers.
Key findings:
- Kata provides strong hardware-enforced isolation with full OCI compliance, making it suitable for multi-tenant environments where untrusted workloads must be sandboxed.
- Significant feature gaps exist for Backend.AI's core use cases: multi-GPU passthrough is not supported (single GPU only via VFIO), storage I/O adds measurable overhead through the virtualization layer, and checkpoint/restore is not implemented.
- Kata 3.x's Rust rewrite and built-in Dragonball VMM deliver substantial performance improvements, with Alibaba demonstrating 3,000 secure containers launched in 6 seconds on a single host.
- Cold start overhead (125-500ms depending on hypervisor) and per-pod memory overhead (15-150MB) impact density and interactive session responsiveness compared to native containers.
- Confidential Containers (CoCo) integration with Intel TDX and AMD SEV-SNP represents a compelling future capability for privacy-sensitive AI/ML workloads.
Traditional Linux containers (runc, Docker) share the host kernel and rely on namespaces and cgroups for isolation. While efficient, this model exposes a large kernel attack surface — a kernel vulnerability in any container can potentially compromise the entire host and all co-located workloads. This is a significant concern in multi-tenant environments where untrusted code executes alongside sensitive workloads.
Kata Containers addresses this by interposing a hypervisor boundary between the host and the containerized workload. Each pod receives its own guest kernel, eliminating cross-pod kernel attack surface. The project originated from the merger of Intel Clear Containers and Hyper runV in December 2017, combining Intel's hardware virtualization expertise with Hyper's container-optimized VM technology.
This report evaluates Kata Containers from the perspective of a container orchestration platform (Backend.AI) that:
- Schedules and orchestrates containerized AI/ML workloads across distributed agent nodes
- Provides fractional and multi-GPU allocation to containers
- Mounts distributed storage (CephFS, NetApp, etc.) to containers via bind mounts
- Manages multi-node overlay networks for distributed training
- Requires low-latency interactive session startup for notebook environments
The analysis focuses on Kata 3.x (current mainline) with references to the upcoming 4.0 release where relevant.
This report synthesizes information from the following sources:
- Official Kata Containers architecture documentation and design documents (GitHub repository)
- Community blog posts and project updates (2024-2025)
- Conference presentations (KubeCon, OpenInfra Summit)
- Third-party benchmarks and production case studies (StackHPC, Red Hat, Northflank)
- Cloud provider documentation (AWS, Azure, GCP)
- Academic papers on container runtime performance analysis
Kata Containers defines three distinct execution contexts that are fundamental to understanding its behavior:
Host: The physical machine running the container manager (containerd or CRI-O), the Kata runtime shim process, and the hypervisor process. The host is considered untrusted in confidential computing scenarios.
Guest VM: A lightweight virtual machine with its own Linux kernel, root filesystem, and the kata-agent daemon. The VM provides hardware-enforced isolation using KVM. The guest kernel is a standard Linux LTS kernel stripped down for container workloads — minimal drivers, fast boot, and small memory footprint.
Container: An isolated process environment within the guest VM, created using standard Linux namespaces and cgroups. From the workload's perspective, it appears identical to a native container. From the host's perspective, it is doubly isolated — first by the VM boundary, then by namespace/cgroup isolation within the guest.
The mapping model is one VM per Kubernetes pod. All containers within a single pod share the same VM. This preserves the Kubernetes pod semantics (shared network namespace, IPC, optional PID namespace sharing) while adding a hardware isolation boundary at the pod level.
┌─────────────────────── HOST ───────────────────────┐
│ │
│ ┌──────────────────┐ ┌──────────────────────┐ │
│ │ containerd / │ │ Other pods │ │
│ │ CRI-O │ │ (runc or kata) │ │
│ └────────┬─────────┘ └──────────────────────┘ │
│ │ shimv2 API │
│ ┌────────▼─────────┐ │
│ │ containerd-shim- │ │
│ │ kata-v2 │ │
│ │ (1 per pod) │ │
│ └────────┬─────────┘ │
│ │ fork + manage │
│ ┌────────▼─────────┐ ┌──────────────────────┐ │
│ │ Hypervisor │ │ virtiofsd │ │
│ │ (QEMU/CLH/ │ │ (filesystem sharing) │ │
│ │ Dragonball) │ └──────────────────────┘ │
│ └────────┬─────────┘ │
│ │ VM boundary (KVM) │
│ ╔════════╧═══════════════════════════════════════╗ │
│ ║ GUEST VM ║ │
│ ║ ┌─────────────────────────────────────────┐ ║ │
│ ║ │ Guest Kernel (Linux LTS, minimal) │ ║ │
│ ║ └─────────────────────────────────────────┘ ║ │
│ ║ ┌─────────────────────────────────────────┐ ║ │
│ ║ │ kata-agent (Rust, ttRPC over VSOCK) │ ║ │
│ ║ └──────────┬──────────────────────────────┘ ║ │
│ ║ │ ║ │
│ ║ ┌──────────▼────┐ ┌───────────────────┐ ║ │
│ ║ │ Container A │ │ Container B │ ║ │
│ ║ │ (namespaces │ │ (namespaces │ ║ │
│ ║ │ + cgroups) │ │ + cgroups) │ ║ │
│ ║ └───────────────┘ └───────────────────┘ ║ │
│ ╚════════════════════════════════════════════════╝ │
└─────────────────────────────────────────────────────┘
The shim is the single entry point for Kata Containers on the host side. It implements the containerd Runtime V2 (shimv2) API, which replaced the earlier model requiring 2N+1 separate processes (kata-runtime, kata-shim, kata-proxy, and reaper) per pod.
Key characteristics:
- One shim process per pod: A single long-running daemon handles all containers within a pod.
- Unix domain socket communication: containerd creates a socket and passes it to the shim at startup. All subsequent lifecycle commands flow over this socket as gRPC calls.
- Configuration: Reads from
configuration.toml(default at/usr/share/defaults/kata-containers/), which specifies hypervisor type, guest kernel/image paths, resource defaults, and feature toggles. - virtcontainers library: Internally uses the virtcontainers abstraction layer, providing a runtime-spec-agnostic, hardware-virtualized containers API.
The hypervisor is spawned by the shim (or, in Dragonball's case, runs in-process) to create and manage the VM. It:
- Creates the VM with configured CPU/memory resources
- Sets up virtio devices (vsock for host-guest communication, block/fs for storage, net for networking)
- DAX-maps the guest image into VM memory for zero-copy access (on supported hypervisors)
- Supports hotplugging of CPU, memory, and block devices (hypervisor-dependent)
The agent is a Rust daemon running inside the guest VM. It serves as the supervisor for all container operations within the VM:
- Deployment modes: In rootfs image deployments, systemd is PID 1 and kata-agent runs as a systemd service. In initrd deployments, kata-agent runs directly as PID 1 (no init system).
- Communication: Runs a ttRPC server (a lightweight gRPC alternative optimized for low-overhead communication) listening on a VSOCK socket.
- API surface:
CreateSandbox,CreateContainer,StartContainer,ExecProcess,WaitProcess,DestroySandbox, and related operations. - Responsibilities: Creates and manages namespaces, cgroups, and mounts within the guest for each container. Carries stdio streams (stdout, stderr, stdin) between containers and the host runtime.
Host-guest communication uses ttRPC (Tiny TTRPC — a simplified, more efficient alternative to gRPC) over VSOCK (virtual socket, available since Linux kernel 4.8). VSOCK provides a direct host-guest socket without requiring network configuration. The protocol:
- Uses Protocol Buffers for serialization
- Supports multiplexed streams for concurrent container I/O
- Eliminates the need for the separate kata-proxy process that was required in pre-shimv2 architectures when using virtio-serial
The step-by-step process from a pod creation request to a running container:
- Kubernetes kubelet or Docker sends a container creation request to the CRI runtime (containerd or CRI-O).
- CRI runtime identifies the RuntimeClass as
kataand invokes thecontainerd-shim-kata-v2binary. - containerd creates a Unix domain socket and passes it to the shim, which starts as a long-running daemon.
- The shim reads
configuration.tomland launches the configured hypervisor (QEMU, Cloud Hypervisor, or Dragonball in-process). - The hypervisor boots the VM with the guest kernel and guest image. The guest image is DAX-mapped into VM memory for zero-copy access (on x86_64 with QEMU).
- The kata-agent starts inside the VM (either as PID 1 for initrd or via systemd for rootfs images).
- The shim establishes a ttRPC connection to the agent over VSOCK.
- The shim calls
CreateSandboxon the agent, establishing the pod-level environment. - For each container in the pod, the shim calls
CreateContainer, and the agent creates the appropriate namespaces, cgroups, rootfs mounts, and environment within the guest. - The agent spawns the workload process and returns success.
- The shim reports container running status to containerd.
Shutdown: The agent detects workload exit, captures the exit status, and returns it via WaitProcess. For forced shutdown, DestroySandbox terminates the VM via the hypervisor.
Kata 3.x represents a fundamental architectural shift from the Go-based runtime of version 2.x:
| Component | Kata 1.x | Kata 2.x | Kata 3.x |
|---|---|---|---|
| kata-agent | Go | Rust | Rust |
| kata-runtime | Go | Go | Rust (runtime-rs) |
| Shim model | 2N+1 processes | shimv2 (single process, Go) | shimv2 (single process, Rust) |
| VMM integration | External process | External process | Built-in (Dragonball) or external |
| Policy engine | N/A | Go-based OPA | Rust-based regorus |
| Async runtime | goroutines | goroutines | Tokio (configurable worker threads) |
- Performance: Eliminates Go garbage collection pauses and reduces memory consumption. The runtime is a long-lived system daemon where GC pauses can impact container lifecycle latency.
- Memory safety: Rust's ownership model prevents null pointer dereferences, use-after-free, and data races at compile time — critical for system-level software managing VM lifecycle.
- Async runtime: Tokio provides Go-like concurrency with configurable worker threads (
TOKIO_RUNTIME_WORKER_THREADS, default: 2), enabling high concurrency with predictable resource consumption.
The most significant architectural change in 3.x is the built-in Dragonball VMM. Instead of forking a separate hypervisor process and communicating over IPC, Dragonball is linked as a Rust library directly into the runtime process:
Kata 2.x (separated VMM):
runtime process ──(fork + RPC)──> VMM process (QEMU, CLH)
Kata 3.x (built-in VMM):
single process: runtime-rs + Dragonball VMM (shared address space)
Benefits:
- Zero IPC overhead between runtime and VMM
- Unified lifecycle management — no orphaned VMM processes on crashes
- Simplified resource cleanup and exception handling
- Out-of-the-box experience — no external VMM installation or version matching required
Dragonball is built on the rust-vmm crate ecosystem (shared building blocks with Firecracker and Cloud Hypervisor) and is specifically optimized for container workloads.
Alibaba Cloud has validated the Rust runtime + Dragonball in production:
- Launch 3,000 secure containers in 6 seconds on a single host
- Run 4,000+ secure containers simultaneously on a single host
- Support approximately 12 billion calls per day for Function Compute services
Kata Containers supports five hypervisors, all KVM-based (Type 2):
QEMU — The most feature-complete option. Written in C with approximately 2 million lines of code. Supports all CPU architectures (x86_64, ARM, ppc64le, s390x), all device types (40+ emulated devices including legacy), full CPU/memory/disk hotplug, VFIO device passthrough, virtio-fs, and 9pfs. The trade-off is a larger attack surface, higher memory overhead, and slower boot times compared to newer Rust-based VMMs. QEMU is the default hypervisor for the Go runtime.
Cloud Hypervisor (CLH) — A Rust-based VMM with approximately 50,000 lines of code. Supports x86_64 and aarch64, CPU/memory hotplug, virtio-fs, VFIO passthrough, and approximately 16 modern virtio devices. Boots in approximately 200ms. Provides a good balance between feature completeness and security (small Rust codebase, minimal attack surface).
Firecracker — AWS's minimal VMM, also Rust-based. Extremely fast boot (approximately 125ms) and tiny footprint, but deliberately omits filesystem sharing (no virtio-fs or 9pfs), device hotplug, and VFIO passthrough. All resources must be pre-allocated at VM boot. Designed for serverless/FaaS workloads where density and startup latency are critical.
Dragonball — Alibaba's built-in VMM for the Rust runtime. Runs in-process (not as a separate process), eliminating IPC overhead. Built on the rust-vmm crate ecosystem. Supports CPU/memory/disk hotplug, VFIO, and virtio-fs. Optimized for high-density container deployments. Default for runtime-rs.
StratoVirt — Huawei's VMM supporting both MicroVM (lightweight) and StandardVM modes. Currently supports MicroVM mode in Kata with limited features. The community has discussed potentially deprecating StratoVirt support due to maintenance concerns (April 2025 PTG).
| Feature | QEMU | Cloud Hypervisor | Firecracker | Dragonball | StratoVirt |
|---|---|---|---|---|---|
| Language | C (~2M LOC) | Rust (~50K LOC) | Rust | Rust | Rust |
| Architectures | x86_64, ARM, ppc64le, s390x | x86_64, aarch64 | x86_64, aarch64 | x86_64, aarch64 | x86_64, aarch64 |
| Typical boot time | ~500ms | ~200ms | ~125ms | ~200ms | ~200ms |
| Memory footprint | 50-150 MB | 20-50 MB | 15-30 MB | 20-50 MB | 20-50 MB |
| CPU hotplug | Yes | Yes | No | Yes | No |
| Memory hotplug | Yes | Yes | No | Yes | No |
| Disk hotplug | Yes | Yes | No | Yes | Yes |
| VFIO passthrough | Yes | Yes | No | Yes | No |
| virtio-fs | Yes | Yes | No | Yes | Yes |
| 9pfs | Yes | No | No | No | No |
| Device emulation scope | 40+ (incl. legacy) | ~16 modern | Minimal (3) | Container-optimized | MicroVM-focused |
| Execution model | Separate process | Separate process | Separate process | In-process (library) | Separate process |
| Default for | Go runtime | — | — | Rust runtime | — |
| Confidential computing | TDX, SEV-SNP | TDX, SEV-SNP | No | Planned | No |
| Use Case | Recommended VMM | Rationale |
|---|---|---|
| General-purpose, maximum compatibility | QEMU | Broadest architecture and feature support |
| Cloud-native production (balanced) | Cloud Hypervisor | Good feature set with small Rust attack surface |
| Serverless / FaaS (density + speed) | Firecracker | Minimal boot time, smallest footprint |
| High-density containers (Rust runtime) | Dragonball | In-process VMM, zero IPC, optimized for containers |
| Confidential computing | QEMU | Most mature TDX/SEV-SNP support |
| Legacy hardware / non-x86 architectures | QEMU | Only option for ppc64le, s390x |
This section provides a detailed comparison of Kata Containers capabilities against native container runtimes (runc via Docker or containerd). Each subsection identifies feature gaps, behavioral differences, and performance implications.
Native containers use Linux network namespaces with veth (virtual Ethernet) pairs connecting containers to a bridge (e.g., docker0 or cni0). The packet path is straightforward:
Container process → veth (container end) → veth (host end) → bridge → host routing
Overlay networks (VXLAN, Geneve) for multi-host communication add encapsulation/decapsulation at the host level. CNI plugins manage the full lifecycle. Performance is near-native since all operations occur within the host kernel's network stack.
Kata must bridge the gap between the host network namespace (where CNI plugins create veth pairs) and the guest VM (which uses virtio-net devices). Three networking models are supported:
TC-filter (tc-redirect) — Default
This is the current default networking model, chosen for maximum CNI plugin compatibility. The flow:
- CNI plugin creates a veth pair as usual (e.g.,
eth0in pod netns,vethXXXXon bridge). - Kata runtime creates a TAP device (
tap0_kata) inside the pod network namespace. - Linux Traffic Control (TC) mirror rules redirect traffic between the veth's ingress/egress and the TAP device's egress/ingress.
- The TAP device connects to a virtio-net device inside the VM.
- Inside the VM, the kata-agent configures the network interface with the same IP/MAC as the original veth.
Container (guest) → virtio-net → TAP (host) → TC mirror → veth → bridge → host
This model introduces up to five network hops before a packet reaches its destination, adding measurable latency and CPU overhead.
MACVTAP — Alternative
Uses the Linux MACVTAP driver, which combines TUN/TAP and bridge functionality into a single module based on the MACVLAN driver. The packet path is slightly shorter than TC-filter, with similar performance characteristics.
Vhost-user — High Performance
The virtio-net backend runs in a userspace process, requiring fewer syscalls than a full VMM path. Used for high-performance scenarios and in conjunction with DPDK-based virtual switches.
| Feature | Native (runc) | Kata Containers | Gap Assessment |
|---|---|---|---|
| CNI plugin compatibility | Full | Good (TC-filter preserves compatibility) | Minor — some edge cases with unusual CNI configs |
| Overlay networks (VXLAN, Geneve) | Full | Full (via guest virtio-net) | Parity |
| Bridge networking | Full | Full (via TC-filter or MACVTAP) | Parity |
--net=host (host networking) |
Supported | NOT supported | Critical gap — VM cannot share host netns |
--link (container linking) |
Supported (deprecated) | NOT supported | Minor — deprecated in Docker |
| Network namespace sharing | Full | Broken — takes over interfaces | Significant gap |
| Service mesh (Istio/Envoy sidecar) | Full | Works (sidecar in same VM) | Parity |
| Network policies (Calico, Cilium) | Full | Full (applied at host level) | Parity |
| IPv6 | Full | Supported | Parity |
| Multicast | Full | Supported (virtio-net) | Parity |
| SR-IOV | Direct | Via VFIO passthrough | Additional configuration required |
| RDMA / InfiniBand | Direct device access | Via VFIO passthrough | Additional configuration, hypervisor-dependent |
| DNS resolution | Full | Full (resolv.conf shared via virtio-fs) | Parity |
Red Hat benchmarks (OpenShift Sandboxed Containers) show:
- CPU overhead: A sandboxed pod consumes approximately 4.5 CPU cores versus approximately 1.8 cores for a non-sandboxed pod during network-intensive workloads (approximately 2.5x overhead).
- Latency: Typically under 1ms additional latency per hop.
- Throughput: TC-filter and MACVTAP achieve comparable throughput; both are superior to the deprecated bridge mode.
Container images and volumes stored on the host must be shared with the guest VM. Kata supports multiple mechanisms:
virtio-fs (Default since Kata 2.0)
A shared filesystem protocol based on FUSE, implemented as a virtio device. A virtiofsd daemon runs on the host, serving filesystem requests from the guest. Key characteristics:
- Near-native performance when combined with DAX (Direct Access) mapping
- Full POSIX compliance
- Used for sharing container rootfs, volumes, ConfigMaps, Secrets, and configuration files (hostname, hosts, resolv.conf)
- Available since Linux 5.4, QEMU 5.0, Kata 1.7
DAX (Direct Access) Optimization: On x86_64 with QEMU, virtio-fs can use NVDIMM/DAX mapping to map the guest image directly into VM memory without going through the guest kernel page cache. This provides zero-copy access, reduced memory consumption (pages are shared, not duplicated), and faster boot times.
9pfs (Legacy, Deprecated)
Plan 9 file protocol. Simpler implementation but significantly slower — every file operation traverses the 9P protocol and host syscalls. Has POSIX compliance limitations. Only supported on QEMU. Not recommended for production.
virtio-blk / virtio-SCSI
Block-level device sharing. The runtime detects whether container images use overlay-based filesystem drivers (shared via virtio-fs) or devicemapper/other block-based snapshotters (shared via virtio-blk/scsi). Used for persistent volumes and block-based workloads.
vhost-user-blk
Highest performance option. The virtio block backend runs in a separate userspace process, bypassing the VMM entirely. Benchmarks show near host-level I/O performance.
| Feature | Native (runc) | Kata Containers | Gap Assessment |
|---|---|---|---|
| OverlayFS image layers | Direct kernel access | Shared via virtio-fs | Minor overhead |
| Bind mounts | Zero overhead | Via filesystem sharing (virtio-fs) | Measurable overhead |
| Bind mounts (TEE mode) | N/A | Files copied into guest; changes NOT synced back | Critical gap for TEE |
| tmpfs | Full | Supported | Parity |
| emptyDir (tmpfs-backed) | Full | /dev/shm sizing broken (capped at 64KB) |
Bug |
| ConfigMaps | Full, live updates | Supported; SubPath does NOT receive live updates | Gap |
| Secrets | Full, tmpfs-backed | Supported, tmpfs-backed | Parity |
volumeMount.subPath |
Supported | NOT supported | Gap |
| hostPath | Direct | Depends on fs sharing; no sync-back in TEE | Behavioral difference |
| PersistentVolumes (block) | Full | Via virtio-blk/scsi | Parity (minor overhead) |
| PersistentVolumes (NFS/CephFS) | Direct mount | Via virtio-fs passthrough | Overhead |
| CSI drivers | Full | Supported (host-side) | Parity |
StackHPC benchmarks using fio with varying client counts and I/O patterns:
With 9pfs (legacy — included for reference):
| Metric | Bare Metal | Kata (9pfs) | Degradation |
|---|---|---|---|
| Sequential read bandwidth (64 clients) | Baseline | ~15% of bare metal | 6.7x worse |
| Sequential read p99 latency (64 clients) | 47,095 us | 563,806 us | 12x worse |
| Random write p99 latency (64 clients) | 159,285 us | 15,343,845 us | ~100x worse |
With virtio-fs + DAX: Dramatic improvement over 9pfs. Near-native sequential read performance. Random write latency reduced significantly but still measurable overhead due to FUSE round-trips.
With vhost-user-blk: Near host-level performance for block I/O workloads. Benchmarks by Xinnor show performance within 5-10% of bare metal for NVMe devices.
Kata Containers exposes hardware devices to guest VMs using VFIO (Virtual Function I/O), which provides safe, non-privileged userspace device access via IOMMU (Input/Output Memory Management Unit). The VFIO path:
- Device is unbound from its host driver.
- Device is bound to the
vfio-pcidriver. - The hypervisor maps the device into the VM's address space via IOMMU.
- The guest VM accesses the device directly (near-native performance for data plane operations).
| Feature | Native (runc) | Kata Containers | Gap Assessment |
|---|---|---|---|
| Single GPU access | --device /dev/nvidia0 |
VFIO passthrough (requires IOMMU) | Behavioral difference (VFIO vs. direct) |
| Multi-GPU (same container) | Trivial (--device per GPU) |
NOT supported | Critical gap |
| Fractional GPU (MIG, vGPU) | NVIDIA MIG, vGPU supported | vGPU: NOT supported; MIG: limited | Critical gap |
| GPU sharing (time-slicing) | Supported (NVIDIA MPS, time-slicing) | NOT supported | Critical gap |
| Intel GPU (GVT-g) | Supported | Supported (mediated device via VFIO) | Parity |
| NVIDIA GPU Operator | Full | Supported (containerd only, install-only, not upgrade) | Limited |
| Large BAR GPUs (e.g., Tesla P100) | Direct | Supported (Kata 1.11+, specific config) | Minor config needed |
| FPGA passthrough | Direct (--device) |
Via VFIO | Works but requires IOMMU |
| USB device passthrough | Direct (--device) |
Via VFIO (limited) | Limited |
/dev/kvm access (nested virt) |
Direct | Not meaningful (already in VM) | N/A |
| Hypervisor | VFIO Support | Notes |
|---|---|---|
| QEMU | Full | Most mature; supports all device types |
| Cloud Hypervisor | Full | Good support for modern devices |
| Firecracker | None | Deliberately excluded (minimal VMM) |
| Dragonball | Full | Supports GPU passthrough |
| StratoVirt | None | MicroVM mode limitation |
NVIDIA Hopper-generation GPUs support single-GPU confidential compute passthrough when combined with Intel TDX or AMD SEV-SNP. The GPU's on-chip security engine participates in the attestation chain. NVIDIA Blackwell-generation extends this to multi-GPU confidential passthrough.
The NVIDIA GPU Operator (versions 24.6.x through 24.9.x) supports Kata Containers with confidential container configurations, enabling GPU-accelerated workloads within TEE-protected VMs.
Backend.AI's accelerator plugin system (backendai_accelerator_v21) supports:
- Discrete allocation (whole GPU via
DiscretePropertyAllocMap) - Fractional allocation (sub-GPU via
FractionAllocMap) - Multi-GPU allocation to single containers
- NUMA-aware placement
Kata's limitation to single-GPU VFIO passthrough per container directly conflicts with Backend.AI's multi-GPU allocation model and fractional GPU sharing via CUDA hook libraries. The VFIO approach also precludes the use of NVIDIA Container Toolkit's device mapping, which Backend.AI relies on through generate_docker_args() in its compute plugins.
Kata uses a dual-layer resource management strategy:
Host-level cgroups: A single cgroup per sandbox constrains the overall VM process (hypervisor + guest memory). The CRI runtime and resource manager can restrict or collect stats at the sandbox level.
Guest-level cgroups: Inside the VM, the kata-agent applies per-container cgroup constraints, making containers within the same VM behave like standard Linux containers with individual resource limits.
| Feature | Native (runc) | Kata Containers | Gap Assessment |
|---|---|---|---|
| CPU shares/quota | Direct cgroup | Dual-layer (host + guest cgroup) | Parity (semantic) |
| CPU pinning | Direct cgroup | VM vCPU pinning | Behavioral difference |
| Memory limits | Direct cgroup | VM memory + guest cgroup | Parity (semantic) |
| Memory swap | Supported | VM-level swap | Behavioral difference |
| OOM handling | Direct cgroup OOM killer | Guest OOM (host sees VM memory) | Different OOM behavior |
| PID limits | Direct cgroup | Guest-level PID limits | Parity |
| Block I/O weight | cgroup blkio | NOT supported | Gap |
| Block I/O limits (bps/iops) | cgroup blkio | Supported (at VM level) | Parity |
| CPU hotplug | N/A | QEMU, CLH: yes; FC, SV: no | Hypervisor-dependent |
| Memory hotplug | N/A | QEMU, CLH: yes (virtio-mem); FC: no | Hypervisor-dependent |
| Cgroups v2 | Full | Supported (not default) | Parity (with config) |
--sysctl |
Supported | NOT supported | Gap |
The VM starts with a configurable minimum resource allocation (default: 1 vCPU, 2 GB RAM defined in configuration.toml). When containers within the pod request additional resources, the runtime hot-plugs CPU and memory into the VM. This means:
- There is a base resource cost per pod regardless of container resource requests.
- Hot-plug operations add latency to container startup (typically milliseconds for CPU, potentially longer for memory).
- Memory hotplug can fail if the guest kernel cannot online the hot-plugged memory due to insufficient initial memory.
- AMD CPU hotplug has reported failures in some configurations.
Kata supports two distinct image pull models:
Host Pull (Default)
The standard model matching native container behavior:
- containerd/CRI pulls and unpacks the OCI image on the host.
- Image layers are exposed to the guest via filesystem passthrough:
- virtio-fs for overlay-based graph drivers (default)
- virtio-blk/scsi for block-based graph drivers
- The guest mounts the shared filesystem and constructs the container rootfs.
- Image layers can be shared across multiple sandboxes on the same node.
Guest Pull (Confidential Containers)
Required when the host is considered untrusted:
- The containerd snapshotter on the host intercepts the image pull.
- Control is redirected to
image-rsrunning inside the guest VM. - image-rs downloads, verifies signatures, and decrypts the image within the TEE.
- Container images never appear in plaintext on the host.
- Trade-off: Images are NOT shared across sandboxes — each VM downloads its own copy.
Available since Kata v2.4.0, Nydus provides on-demand lazy loading of container image layers:
- Research shows 76% of container startup time is spent on image pull, but only 6.4% of pulled data is actually read during execution.
- Nydus uses the RAFS v6 (Registry Acceleration File System) format, which stores image data in content-addressable chunks.
- Chunks are fetched on-demand from the registry as the container accesses files.
- In the Kata integration,
nydusdreplacesvirtiofsdas the FUSE/virtiofs daemon, serving image data to the guest via virtio-fs. - Supports P2P distribution via Dragonfly, reducing registry load by over 80%.
Kata's multi-container pod behavior is semantically equivalent to native Kubernetes pods:
- All containers in a pod share the same VM (the "sandbox").
- A single kata-agent manages all container processes within the VM.
- Containers share the network namespace (same IP, same ports), IPC namespace, and optionally PID namespace.
- Each container has its own mount namespace, cgroup, and user namespace.
The kata-agent differentiates between sandbox creation and container creation using OCI annotations:
io.kubernetes.cri-o.ContainerType: "sandbox"→ triggers VM creationio.kubernetes.cri-o.ContainerType: "container"→ creates a container within the existing VM
| Boundary | Native (runc) | Kata Containers |
|---|---|---|
| Inter-pod isolation | Namespace + cgroup (soft) | Hardware VM boundary (hard) |
| Intra-pod isolation | Namespace + cgroup | Namespace + cgroup (inside VM) — equivalent |
| Kernel sharing (inter-pod) | Shared host kernel | Separate guest kernels per pod |
| Kernel sharing (intra-pod) | Shared host kernel | Shared guest kernel within pod |
| Threat | Native (runc) | Kata Containers | gVisor |
|---|---|---|---|
| Container escape via kernel CVE | Vulnerable (shared kernel) | Protected (separate kernel per pod) | Partially protected (reduced syscall surface) |
| Container escape via runtime CVE | Vulnerable | Protected (workload in VM, not on host) | Protected (workload in Sentry) |
| Resource exhaustion (noisy neighbor) | cgroup enforcement | VM-level + cgroup enforcement (dual-layer) | cgroup enforcement |
| Network-based attack (pod-to-pod) | Network policy (soft) | Network policy + VM boundary (hard) | Network policy (soft) |
| Privilege escalation (root in container) | Host kernel exposed | Guest kernel only; host devices NOT passed | Sentry kernel only |
| Side-channel attacks (Spectre, etc.) | Vulnerable (shared kernel/CPU) | Protected (separate VM; can use TEE hardware) | Partially protected |
| Supply chain (malicious image) | Vulnerable | Vulnerable (unless using CoCo with attestation) | Vulnerable |
Kata applies defense-in-depth:
- VM boundary (KVM): Hardware-enforced memory isolation between pods.
- Minimal guest kernel: Stripped-down Linux LTS with only virtio drivers, reducing attack surface.
- Seccomp filters: Applied to the VMM process on the host and within the guest.
- AppArmor/SELinux: Profiles applied to the VMM process and virtiofsd.
- Rootless VMM: Hypervisor process can run as an unprivileged user (QEMU, CLH).
- Minimal guest rootfs: Alpine-based initrd with only the kata-agent binary.
- Confidential computing (optional): TEE hardware (TDX, SEV-SNP) for memory encryption and attestation.
A critical differentiation from gVisor: Kata runs a real Linux kernel in the guest, providing full syscall compatibility (all ~330+ Linux syscalls). gVisor's Sentry intercepts syscalls and implements only a subset (~70-80%), causing compatibility issues with some workloads (particularly those using advanced filesystem operations, signaling, or less common syscalls).
This is significant for AI/ML workloads, which may use advanced Linux features (io_uring, perf events, eBPF, RDMA verbs) that gVisor does not support.
| Runtime | Cold Start (typical) | With VM Templating | Notes |
|---|---|---|---|
| Native (runc) | 50-100ms | N/A | Process fork + namespace creation |
| gVisor | 50-100ms | N/A | Sentry initialization |
| Kata + QEMU | ~500ms | ~300ms | Full VM boot; most feature-complete |
| Kata + Cloud Hypervisor | ~200ms | ~125ms | Optimized boot path |
| Kata + Firecracker | ~125ms | N/A (no hotplug) | Minimal VMM; pre-allocated resources |
| Kata + Dragonball | ~200ms | ~150ms | In-process VMM; concurrent boot optimization |
VM Templating: Pre-boots a VM and clones it for new sandboxes. Speeds up creation by up to 38.68% and saves approximately 72% of guest memory in high-density scenarios (100 containers saved 9 GB). Currently only available in the Go runtime; migration to runtime-rs is in progress.
| Runtime | Per-Pod Overhead | Components |
|---|---|---|
| Native (runc) | 1-5 MB | Namespace metadata, shim process |
| gVisor | 10-50 MB | Sentry process, gofer |
| Kata + QEMU | 50-150 MB | VMM process, guest kernel, agent, virtiofsd |
| Kata + Cloud Hypervisor | 20-50 MB | Smaller VMM, same guest components |
| Kata + Firecracker | 15-30 MB | Minimal VMM, minimal guest |
| Kata + Dragonball | 20-50 MB | In-process VMM, shared memory space |
The per-pod memory overhead directly impacts container density. On a node with 256 GB RAM:
- runc: theoretical maximum ~50,000+ containers (memory-limited by workload, not runtime)
- Kata + QEMU: ~1,700 pods before overhead alone consumes all memory
- Kata + CLH/Dragonball: ~5,000 pods before overhead alone consumes all memory
For CPU-bound workloads, Kata introduces approximately 4% overhead compared to bare metal. This is consistent across hypervisors because the overhead comes from VM exits for privileged instructions and timer interrupts, not from the VMM itself.
For system-call-heavy workloads, overhead increases to 5-15% due to the additional cost of VM exits per syscall. The exact overhead depends on syscall frequency and type.
Storage I/O is the area of greatest performance impact. The overhead varies dramatically by filesystem sharing mechanism:
| Mechanism | Sequential Read (vs. bare metal) | Random Write p99 Latency (vs. bare metal) | Use Case |
|---|---|---|---|
| 9pfs (legacy) | ~15% of bare metal | ~100x worse | Not recommended |
| virtio-fs | ~70-90% of bare metal | ~3-10x worse | General purpose (default) |
| virtio-fs + DAX | ~90-98% of bare metal | ~2-5x worse | Recommended for I/O workloads |
| virtio-blk | ~90-95% of bare metal | ~2-3x worse | Block-based volumes |
| vhost-user-blk | ~95-100% of bare metal | ~1-2x worse | High-performance block I/O |
For Backend.AI's use case of mounting distributed storage (CephFS, NetApp) via bind mounts, the I/O path would be:
Container → guest kernel → virtio-fs → virtiofsd (host) → host kernel VFS → distributed FS client → network → storage cluster
This adds one FUSE round-trip compared to the native path (container → host kernel VFS → distributed FS client → network → storage cluster), which is measurable but not catastrophic for typical AI/ML data loading patterns.
| Metric | Native (runc) | Kata (TC-filter) | Delta |
|---|---|---|---|
| Latency overhead | Baseline | <1ms additional | Minimal |
| CPU per packet | Baseline | ~2.5x | Significant for network-heavy |
| Throughput (single stream) | Near line rate | Near line rate | Parity |
| Throughput (many streams) | Near line rate | Reduced by CPU overhead | Depends on CPU budget |
| Overlay network (VXLAN) | Standard overhead | Standard + virtio overhead | Additive |
The CPU overhead is the primary concern: network-intensive distributed training workloads (e.g., NCCL all-reduce over Ethernet) would consume significantly more CPU for the same data transfer compared to native containers.
Confidential Containers is a CNCF sandbox project that combines Kata Containers with Trusted Execution Environment (TEE) hardware to protect data in use at the pod level. The enclave boundary encompasses the workload pod, kata-agent, and helper processes. Everything outside — hypervisor, other pods, control plane, host OS — is considered untrusted.
| TEE | Vendor | RuntimeClass | Capability |
|---|---|---|---|
| Intel TDX | Intel | kata-qemu-tdx |
Full VM memory encryption + integrity |
| AMD SEV-SNP | AMD | kata-qemu-snp |
Full VM memory encryption + integrity |
| IBM Secure Execution | IBM | kata-qemu-se |
s390x secure execution |
| Intel SGX | Intel | N/A (enclave-based) | Application-level enclaves |
Attester side (inside guest VM):
- Attestation Agent (AA): Generates TEE attestation reports from hardware.
- Confidential Data Hub (CDH): Manages secret and key access within the TEE.
- image-rs: Downloads, verifies, and decrypts container images inside the guest.
Verifier side (external service):
- Key Broker Service (KBS): Validates guest attestation evidence and releases secrets/keys.
- Attestation Service (AS): Verifies TEE evidence against reference values.
- Reference Value Provider Service (RVPS): Manages trusted reference values for verification.
- Workload or kata-agent requests a secret via CDH.
- CDH triggers the AA to generate a TEE attestation report.
- AA sends the report to KBS using the RCAR (Request-Challenge-Attestation-Response) protocol.
- KBS forwards to AS for verification against reference values.
- Only upon successful verification does KBS release the encryption key or secret.
- Supports lazy attestation — triggered on-demand, not at boot.
- Azure AKS: Confidential Containers preview with AMD SEV-SNP (DCa_cc, ECa_cc VM sizes) and Intel TDX.
- Red Hat OpenShift: Confidential containers on bare metal via Assisted Installer.
- NVIDIA GPU Operator: GPU workloads within confidential containers (Hopper GPUs).
Peer pods extend Kata so that the pod sandbox VM runs outside the Kubernetes worker node as a first-class cloud VM, created via cloud provider IaaS APIs. This addresses two key limitations:
- No nested virtualization required: The VM is a first-level instance, not nested inside a worker node VM.
- TEE VMs on demand: Cloud APIs can provision TEE-enabled VMs (e.g., AMD SEV-SNP instances) for confidential workloads.
The Cloud API Adaptor (CAA) DaemonSet runs on each worker node, translating sandbox lifecycle commands into cloud provider API calls (EC2, Azure Compute, GCE, Libvirt). A VxLAN tunnel provides network connectivity between the pod's network namespace on the worker and the peer pod VM.
Persistent volume support has been developed for peer pods, addressing storage attachment to remote VMs.
Nydus is a CNCF project providing a container image format (RAFS — Registry Acceleration File System) that supports on-demand lazy loading. In the Kata integration:
nydusdreplacesvirtiofsdas the FUSE/virtiofs daemon.- On sandbox creation, nydusd starts and the VM mounts the virtiofs share.
- On container creation, the shim instructs nydusd to mount a RAFS filesystem for the container's image layers.
- The guest constructs an overlay filesystem combining the lazy-loaded RAFS layer with snapshot directories.
- Image data is fetched from the registry on-demand as files are accessed.
Performance impact: Significantly reduces cold start time for large images. Instead of pulling a 10GB ML framework image entirely before starting, only the immediately needed files are fetched, with remaining data loaded in the background.
Kata 2.0+ implements a pull-based metrics system integrated with Prometheus:
- Zero overhead when not monitored: Metrics are collected only when Prometheus scrapes the endpoint.
- Stateless: No metrics held in memory between scrapes.
- Source: Primarily from
/procfilesystem for minimal performance impact.
A per-node management agent that:
- Aggregates metrics from running sandboxes, attaching
sandbox_idlabels. - Attaches Kubernetes metadata (
cri_uid,cri_name,cri_namespace). - Provides a single Prometheus scrape target per node.
- The shim's metrics socket is at
vc/sbs/${PODID}/shim-monitor.sock.
- Kata Agent Metrics: Process statistics from guest
/proc/<pid>/[io, stat, status]. - Guest OS Metrics: Guest system-level data from
/proc/[stats, diskstats, meminfo, vmstats]and/proc/net/dev. - Hypervisor Metrics: VMM process resource consumption.
- Shim Metrics: Runtime process observations.
Performance characteristics:
- End-to-end latency (Prometheus to kata-monitor): ~20ms average
- Agent RPC latency (shim to agent): ~3ms average
- Metrics payload per sandbox: ~144KB uncompressed, ~10KB gzipped
- Grafana Cloud (Prometheus-based)
- Datadog (containerd endpoint-based, zero additional setup)
- Red Hat OpenShift Monitoring (integrated for sandboxed containers)
| Feature | Status | Impact Level | Notes |
|---|---|---|---|
--net=host (host networking) |
Not supported | High | Fundamental VM limitation; cannot share host network namespace |
--link (container linking) |
Not supported | Low | Deprecated in Docker; replaced by user-defined networks |
--shm-size |
Not implemented | Medium | /dev/shm sizing broken with tmpfs emptyDir; capped at 64KB |
--sysctl |
Not implemented | Medium | Cannot tune guest kernel parameters from host |
checkpoint / restore (CRIU) |
Not implemented | High | No live migration or suspend/resume; open issue since April 2021 |
| Live VM migration | Not implemented | High | Underlying hypervisors support it but integration is unsolved |
| Block I/O weight (cgroups) | Not supported | Low | Other blkio controls (bps, iops) work |
events (OOM notifications) |
Partial | Medium | OOM and Intel RDT stats incomplete |
volumeMount.subPath |
Not supported | Medium | Kubernetes volume limitation |
| Podman integration | Not supported | Medium | Docker supported since 22.06 |
| Multi-GPU passthrough | Not supported | Critical | Single GPU only via VFIO |
| NVIDIA vGPU | Not supported | Critical | Cannot share a physical GPU across multiple Kata VMs |
| GPU time-slicing (MPS) | Not supported | Critical | Requires host-side MPS daemon access |
| Network namespace sharing | Broken | Medium | Takes over interfaces, breaking connectivity |
| Full rootless operation | Partial | Medium | Only VMM is rootless; shim + virtiofsd still need root |
| Docker Compose | Limited | Low | Basic support; advanced features may not work |
docker exec with TTY |
Supported | — | Works via kata-agent ExecProcess API |
| Privileged containers | Behavioral difference | Medium | Host devices NOT passed by default; must be explicitly configured |
| Feature | QEMU | Cloud Hypervisor | Firecracker | Dragonball | StratoVirt |
|---|---|---|---|---|---|
| CPU hotplug | Yes | Yes | No | Yes | No |
| Memory hotplug | Yes | Yes | No | Yes | No |
| Disk hotplug | Yes | Yes | No | Yes | Yes |
| VFIO passthrough | Yes | Yes | No | Yes | No |
| virtio-fs | Yes | Yes | No | Yes | Yes |
| 9pfs | Yes | No | No | No | No |
| VM templating | Yes | Yes | N/A | In progress | N/A |
| Confidential computing | TDX, SEV-SNP | TDX, SEV-SNP | No | Planned | No |
| Rootless VMM | Yes | Yes | Jailer mechanism | In progress | No |
- Docker 26.0.0: Broke Kata networking (
io.containerd.run.kata.v2), demonstrating fragility in integration points. Workarounds were required. - AMD CPU hotplug: Reported failures in some configurations (GitHub issue #2875).
- Memory hotplug with low initial memory: Can fail if guest kernel cannot online hot-plugged memory.
- Cgroups v2: Supported but not default; requires explicit configuration in
configuration.tomland guest kernel command line.
Ant Group (Alibaba)
- Using Kata Containers in production since 2019.
- Thousands of tasks scheduled on Kata across thousands of nodes.
- Co-locates online applications and batch jobs.
- Built AntCWPP (Cloud Workload Protection Platform) combining Kata with eBPF for defense-in-depth: eBPF programs attached to LSM hooks and network control points inside the Kata VM kernel.
- Successfully eliminated container escape risks, detected reverse shells, fileless malware, and policy drift.
- Reduced per-payment energy consumption by 50% through improved workload co-location enabled by strong isolation.
IBM Cloud
- Deploys Kata for Cloud Shell and CI/CD Pipeline SaaS offerings.
- CI/CD platform scaled 30-40x over five years with sustained ~10% monthly growth.
- Uses Kubernetes with Tekton as pipeline orchestrator; Kata plugs directly into the existing stack.
- Key challenge identified: filesystem/snapshotter setup for the Kata guest is complex and needs simplification.
- Anticipates increasing demand as AI-driven automation generates more workloads requiring secure isolation.
Northflank
- Runs over 2 million microVMs per month inside Kubernetes.
- Uses Kata Containers with Cloud Hypervisor in production.
- Multi-tenant platform where security is essential for running untrusted images.
- Developed custom tooling for running Kata on GKE with custom node images.
- Had to customize virtiofs settings and enable non-default kernel features, requiring custom Kata builds.
Microsoft Azure
- Pod Sandboxing feature on AKS uses Kata Containers.
- Confidential Containers preview with AMD SEV-SNP and Intel TDX.
- Reported as "extremely well-received by internal and external AKS customers" since February 2023 preview.
AWS
- Documents Kata integration with EKS for enhanced workload isolation.
- Requires bare-metal EC2 instances (e.g.,
m5.metal,c5.metal) as worker nodes.
Baidu
- Runs Kata in production for Function Computing, Cloud Container Instances, and Edge Computing.
- Seamless Kubernetes integration is Kata's greatest operational strength — workloads require minimal configuration changes beyond specifying
runtimeClassName. - Filesystem/snapshotter complexity is the biggest barrier to adoption (cited by IBM Cloud).
- Custom Kata builds may be required for advanced use cases (virtiofs tuning, kernel features — cited by Northflank).
- VMM selection significantly impacts operational profile: Cloud Hypervisor is preferred for performance, QEMU for feature completeness.
- Defense in depth is essential: Combining Kata with eBPF (as Ant Group does) provides both strong isolation and fine-grained runtime security.
- Steep learning curve: Operational knowledge of both container and VM management is required.
- Cloud provider constraints: Many cloud providers do not support nested virtualization on standard instances, requiring either bare-metal instances (expensive) or peer pods (complex).
This section maps Kata Containers' capabilities against Backend.AI's four core subsystems:
| Backend.AI Feature | Kata Compatibility | Assessment |
|---|---|---|
| Session lifecycle (PENDING→RUNNING→TERMINATED) | Compatible | Kata is transparent to the scheduler; RuntimeClass selection is the only change |
| Multi-node sessions (cluster_mode=MULTI_NODE) | Compatible | Overlay networking works with Kata (additional overhead) |
| Sokovan scheduler (FIFO, LIFO, DRF) | Compatible | No scheduler changes needed; resource accounting must include VM overhead |
| Agent-Manager RPC (ZeroMQ/Callosum) | Compatible | Agent runs on host; manages Kata containers via containerd |
| Session priority and deprioritization | Compatible | No changes needed |
| Hot-plug resources during session | Partially compatible | Depends on hypervisor; QEMU/CLH support hot-plug |
| Idle session detection and timeout | Compatible | Metrics available via kata-monitor |
Key consideration: Resource accounting must include the per-pod VM overhead (15-150MB memory, CPU for VMM). The scheduler's available_slots calculation on agents must subtract this overhead per expected concurrent session.
| Backend.AI Feature | Kata Compatibility | Assessment |
|---|---|---|
| Docker Swarm overlay networks | Compatible | Kata's TC-filter/MACVTAP works with overlay |
| Multi-container SSH (cluster_ssh_port_mapping) | Compatible | SSH within VM overlay works |
| Port exposure (service ports) | Compatible | Kata network plugin expose_ports() equivalent |
| RDMA / InfiniBand | Partially compatible | Requires VFIO passthrough; hypervisor-dependent |
| MPI over overlay (OMPI) | Compatible | Kata sets OMPI_MCA_btl_tcp_if_exclude automatically |
Key consideration: Network-intensive distributed training (NCCL all-reduce) will consume approximately 2.5x more CPU per packet due to the virtio-net path. For GPU-heavy training where network is the bottleneck, this overhead may be acceptable since GPUs consume most wall-clock time.
| Backend.AI Feature | Kata Compatibility | Assessment |
|---|---|---|
| VFolder bind mounts (VFolderMount → Docker bind mount) | Compatible | Via virtio-fs passthrough (overhead) |
| CephFS / NetApp / WekaFS mounts | Compatible | Host-side FS client + virtio-fs to guest |
| Intrinsic mounts (/home/work, /home/config) | Compatible | Via virtio-fs |
| Multiple vfolders per session | Compatible | Multiple bind mounts via virtio-fs |
| Read-only / read-write permissions | Compatible | Virtio-fs supports RO/RW |
| /home/config/resource.txt (KernelResourceSpec) | Compatible | Written via virtio-fs shared mount |
Key consideration: Every I/O operation traverses the virtio-fs FUSE path, adding measurable latency. For AI/ML workloads with large dataset loading (e.g., reading training data from CephFS), this overhead is most noticeable during data pipeline phases. GPU compute phases are unaffected.
| Backend.AI Feature | Kata Compatibility | Assessment |
|---|---|---|
| Single GPU allocation (DiscretePropertyAllocMap) | Compatible | Via VFIO passthrough |
| Multi-GPU allocation (2+ GPUs per container) | NOT compatible | Kata limits to single GPU via VFIO |
| Fractional GPU (FractionAllocMap, cuda.shares) | NOT compatible | VFIO passes whole GPU; no hook library support |
| CUDA hook libraries (generate_hooks()) | NOT compatible | Host-side hook injection does not work across VM boundary |
| NVIDIA Container Toolkit (generate_docker_args()) | NOT compatible | Toolkit's device mapping requires host kernel driver sharing |
| GPU metrics (gather_node_measures()) | Partially compatible | VFIO GPU metrics must come from guest-side monitoring |
| NUMA-aware allocation (AffinityHint) | Partially compatible | VM vCPU pinning can respect NUMA; GPU NUMA irrelevant with VFIO |
| ROCm, Habana, IPU, TPU plugins | Partially compatible | VFIO passthrough for each; single device only |
| Xilinx FPGA (fractional) | NOT compatible | Fractional allocation requires host-side arbitration |
This is the most significant incompatibility. Backend.AI's core value proposition includes fractional GPU sharing and multi-GPU allocation, both of which are fundamentally incompatible with Kata's VFIO passthrough model.
| Criterion | Feasibility | Blocking? |
|---|---|---|
| Single-GPU workloads (inference) | Feasible | No |
| Multi-GPU workloads (training) | Not feasible | Yes |
| Fractional GPU workloads | Not feasible | Yes |
| CPU-only workloads | Feasible | No |
| Storage-intensive workloads | Feasible with overhead | No |
| Network-intensive distributed training | Feasible with overhead | No |
| Interactive notebooks (startup latency) | Feasible (200-500ms overhead) | Marginal |
| High-density deployments | Feasible with reduced density | Depends on use case |
| Confidential AI workloads | Feasible (compelling use case) | No |
If Backend.AI were to support Kata as an optional runtime for specific workload classes, the integration would involve:
- RuntimeClass selection: Add
runtime_classto SessionCreationSpec, allowing users to opt into Kata isolation for specific sessions. - Resource accounting adjustment: Agents subtract per-session VM overhead from
available_slotswhen Kata runtime is selected. - Accelerator plugin modification: When Kata is selected, constrain allocation to single whole-GPU VFIO mode. Disable fractional allocation and multi-GPU for Kata sessions.
- Storage path: No changes needed — VFolderMount → bind mount path works through virtio-fs transparently. Accept I/O overhead.
- Network path: No changes needed — Docker Swarm overlay works with Kata's TC-filter. Accept CPU overhead for network-intensive workloads.
- Monitoring: Integrate kata-monitor metrics into Backend.AI's agent statistics collection.
Kata Containers provides genuine hardware-enforced isolation for containerized workloads while maintaining OCI compliance and Kubernetes integration. The Kata 3.x Rust rewrite and built-in Dragonball VMM have significantly improved performance and operational characteristics.
However, critical feature gaps exist for Backend.AI's primary use cases:
- Multi-GPU and fractional GPU allocation are not supported, which directly conflicts with Backend.AI's core accelerator management system.
- Storage I/O overhead is measurable but acceptable for most AI/ML workload patterns (data loading is typically not the bottleneck).
- Network overhead (2.5x CPU for network-intensive workloads) is relevant for distributed training but may be acceptable when GPU compute dominates wall-clock time.
- Cold start overhead (125-500ms) is noticeable for interactive sessions but unlikely to be a dealbreaker.
- Checkpoint/restore is not supported, preventing session migration or suspend/resume features.
Short-term: Do not adopt Kata as a general-purpose runtime for Backend.AI sessions. The GPU incompatibilities are fundamental and cannot be worked around without architectural changes to either Backend.AI or Kata.
Medium-term: Consider supporting Kata as an optional isolation mode for specific workload classes:
- CPU-only workloads in multi-tenant environments
- Single-GPU inference workloads where security isolation is required
- Confidential computing workloads (CoCo with TDX/SEV-SNP) for privacy-sensitive AI
Long-term: Monitor these Kata developments:
- Multi-GPU VFIO passthrough (NVIDIA Blackwell enables this for confidential containers)
- CDI (Container Device Interface) integration for standardized device management
- Runtime-rs reaching full feature parity with the Go runtime (Kata 4.0 target)
- Improvements to vGPU and MIG support within VMs
For multi-tenant isolation without Kata's GPU limitations:
- gVisor: Provides kernel-level isolation without VMs, but has syscall compatibility limitations that may affect AI/ML workloads.
- NVIDIA MIG + namespace isolation: Hardware GPU partitioning with standard container isolation; no VM overhead but weaker isolation boundary.
- Firecracker (direct): For serverless/FaaS-style workloads without GPU requirements; excellent density and startup time.
- Kata + GPU operator evolution: NVIDIA is actively developing confidential multi-GPU support; this may resolve the GPU limitation within 1-2 years.
- Kata Containers Architecture Design. https://github.com/kata-containers/kata-containers/blob/main/docs/design/architecture/README.md
- Kata Containers Virtualization Design. https://github.com/kata-containers/kata-containers/blob/main/docs/design/virtualization.md
- Kata Containers Hypervisor Comparison. https://github.com/kata-containers/kata-containers/blob/main/docs/hypervisors.md
- Kata Containers Limitations. https://github.com/kata-containers/kata-containers/blob/main/docs/Limitations.md
- Kata Containers Guest Assets. https://github.com/kata-containers/kata-containers/blob/main/docs/design/architecture/guest-assets.md
- Kata Containers 2.0 Metrics Design. https://github.com/kata-containers/kata-containers/blob/main/docs/design/kata-2-0-metrics.md
- Kata Nydus Design Document. https://github.com/kata-containers/kata-containers/blob/main/docs/design/kata-nydus-design.md
- Kata Containers Rootless VMM Guide. https://github.com/kata-containers/kata-containers/blob/main/docs/how-to/how-to-run-rootless-vmm.md
- Kata Containers with Kubernetes. https://github.com/kata-containers/kata-containers/blob/main/docs/how-to/run-kata-with-k8s.md
- Getting Rust-y: Introducing Kata Containers 3.0.0. https://katacontainers.io/blog/getting-rust-y-introducing-kata-containers-3-0-0/
- Kata 3.0 Technical Deep Dive (Alibaba Cloud). https://www.alibabacloud.com/blog/kata-3-0-is-coming-start-experiencing-the-out-of-the-box-secure-container_599575
- Kata Containers 3.5.0 Release Overview. https://katacontainers.io/blog/kata-containers-3-5-0-release-overview/
- Kata Containers Project Updates Q4 2024. https://katacontainers.io/blog/kata-containers-project-updates-q4-2024/
- Kata Community PTG Updates October 2025. https://katacontainers.io/blog/kata-community-ptg-updates-october-2025/
- IBM Cloud Kata Case Study. https://katacontainers.io/blog/kata-containers-ibm-cloud-updates/
- Northflank Kata Case Study. https://katacontainers.io/blog/kata-containers-northflank-case-study/
- Ant Group Kata + eBPF Whitepaper. https://katacontainers.io/collateral/kata-containers-ant-group_whitepaper.pdf
- Ant Group Container Security with eBPF. https://ebpf.foundation/ant-group-secures-their-platform-with-kata-containers-and-ebpf-for-fine-grained-control/
- StackHPC Kata I/O Performance Analysis. https://www.stackhpc.com/kata-io-1.html
- Xinnor vhost-user-blk Performance. https://xinnor.io/blog/bridging-the-storage-performance-gap-in-kata-containers-with-vhost-user-and-xiraid-opus-using-phison-x200-nvme-drives/
- OpenShift Sandboxed Containers Network Performance. https://www.redhat.com/en/blog/openshift-sandboxed-containers-network-performance
- Runtime Comparison: gVisor vs Kata vs Firecracker (2025). https://onidel.com/blog/gvisor-kata-firecracker-2025
- AWS: Enhancing Kubernetes Workload Isolation with Kata. https://aws.amazon.com/blogs/containers/enhancing-kubernetes-workload-isolation-and-security-using-kata-containers/
- Azure AKS Pod Sandboxing. https://learn.microsoft.com/en-us/azure/aks/use-pod-sandboxing
- Azure AKS Confidential Containers Overview. https://learn.microsoft.com/en-us/azure/aks/confidential-containers-overview
- NVIDIA GPU Operator with Kata. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/24.9.2/gpu-operator-kata.html
- NVIDIA GPU Operator with Confidential Containers. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/24.9.1/gpu-operator-confidential-containers.html
- Confidential Containers Design Overview. https://confidentialcontainers.org/docs/architecture/design-overview/
- Red Hat: Introducing Confidential Containers on Bare Metal. https://www.redhat.com/en/blog/introducing-confidential-containers-bare-metal
- Intel: Confidential Containers Made Easy. https://www.intel.com/content/www/us/en/developer/articles/technical/confidential-containers-made-easy.html
- Cloud API Adaptor (Peer Pods) Architecture. https://github.com/confidential-containers/cloud-api-adaptor/blob/main/docs/architecture.md
- Kata vs Firecracker vs gVisor (Northflank). https://northflank.com/blog/kata-containers-vs-firecracker-vs-gvisor
- Edera Isolation Comparison. https://edera.dev/stories/kata-vs-firecracker-vs-gvisor-isolation-compared
- Kata Containers for Docker in 2024. https://medium.com/@filippfrizzy/kata-containers-for-docker-in-2024-1e98746237ca