diff --git a/docs/reports/kata-containers-feature-parity-analysis.md b/docs/reports/kata-containers-feature-parity-analysis.md new file mode 100644 index 00000000000..c029fa3c725 --- /dev/null +++ b/docs/reports/kata-containers-feature-parity-analysis.md @@ -0,0 +1,1066 @@ +# Kata Containers Feature Parity Analysis with Native Container Runtimes + +| Field | Value | +|---|---| +| **Document ID** | TR-2026-001 | +| **Date** | 2026-02-27 | +| **Author** | Backend.AI Architecture Team | +| **Status** | Final | +| **Classification** | Internal Technical Reference | +| **Scope** | Kata Containers 3.x architecture, feature parity with Docker/containerd/runc, and feasibility assessment for Backend.AI integration | + +--- + +## Table of Contents + +1. [Executive Summary](#1-executive-summary) +2. [Introduction and Motivation](#2-introduction-and-motivation) +3. [Kata Containers Architecture](#3-kata-containers-architecture) + - 3.1 [Three-Environment Model](#31-three-environment-model) + - 3.2 [Runtime Components](#32-runtime-components) + - 3.3 [Container Creation Flow](#33-container-creation-flow) + - 3.4 [Kata 3.x Architectural Evolution](#34-kata-3x-architectural-evolution) +4. [Hypervisor Landscape](#4-hypervisor-landscape) + - 4.1 [Supported Hypervisors](#41-supported-hypervisors) + - 4.2 [Feature Comparison Matrix](#42-feature-comparison-matrix) + - 4.3 [Selection Guidelines](#43-selection-guidelines) +5. [Feature Parity Analysis](#5-feature-parity-analysis) + - 5.1 [Networking](#51-networking) + - 5.2 [Storage and Volumes](#52-storage-and-volumes) + - 5.3 [GPU and Device Passthrough](#53-gpu-and-device-passthrough) + - 5.4 [Resource Management](#54-resource-management) + - 5.5 [Container Image Handling](#55-container-image-handling) + - 5.6 [Multi-Container Pods](#56-multi-container-pods) + - 5.7 [Security Model](#57-security-model) +6. [Performance Analysis](#6-performance-analysis) + - 6.1 [Startup Latency](#61-startup-latency) + - 6.2 [Memory Overhead](#62-memory-overhead) + - 6.3 [CPU Performance](#63-cpu-performance) + - 6.4 [I/O Performance](#64-io-performance) + - 6.5 [Network Performance](#65-network-performance) +7. [Advanced Capabilities](#7-advanced-capabilities) + - 7.1 [Confidential Containers (CoCo)](#71-confidential-containers-coco) + - 7.2 [Peer Pods / Remote Hypervisor](#72-peer-pods--remote-hypervisor) + - 7.3 [Nydus Image Acceleration](#73-nydus-image-acceleration) + - 7.4 [Monitoring and Observability](#74-monitoring-and-observability) +8. [Known Limitations and Unsupported Features](#8-known-limitations-and-unsupported-features) +9. [Production Adoption](#9-production-adoption) +10. [Implications for Backend.AI](#10-implications-for-backendai) +11. [Conclusions and Recommendations](#11-conclusions-and-recommendations) +12. [References](#12-references) + +--- + +## 1. Executive Summary + +Kata Containers is an open-source project under the OpenInfra Foundation that provides hardware-isolated container workloads by running each Kubernetes pod inside a lightweight virtual machine (VM). It delivers the security guarantees of VMs while maintaining OCI runtime specification compliance and seamless Kubernetes integration via the RuntimeClass mechanism. + +This report provides a comprehensive analysis of Kata Containers 3.x architecture, evaluates feature parity against native container runtimes (Docker, containerd with runc), and assesses implications for Backend.AI's container orchestration stack. The analysis covers networking, storage, GPU passthrough, resource management, security, performance, and advanced features including Confidential Containers. + +**Key findings:** + +- Kata provides strong hardware-enforced isolation with full OCI compliance, making it suitable for multi-tenant environments where untrusted workloads must be sandboxed. +- Significant feature gaps exist for Backend.AI's core use cases: multi-GPU passthrough is not supported (single GPU only via VFIO), storage I/O adds measurable overhead through the virtualization layer, and checkpoint/restore is not implemented. +- Kata 3.x's Rust rewrite and built-in Dragonball VMM deliver substantial performance improvements, with Alibaba demonstrating 3,000 secure containers launched in 6 seconds on a single host. +- Cold start overhead (125-500ms depending on hypervisor) and per-pod memory overhead (15-150MB) impact density and interactive session responsiveness compared to native containers. +- Confidential Containers (CoCo) integration with Intel TDX and AMD SEV-SNP represents a compelling future capability for privacy-sensitive AI/ML workloads. + +--- + +## 2. Introduction and Motivation + +### 2.1 Background + +Traditional Linux containers (runc, Docker) share the host kernel and rely on namespaces and cgroups for isolation. While efficient, this model exposes a large kernel attack surface — a kernel vulnerability in any container can potentially compromise the entire host and all co-located workloads. This is a significant concern in multi-tenant environments where untrusted code executes alongside sensitive workloads. + +Kata Containers addresses this by interposing a hypervisor boundary between the host and the containerized workload. Each pod receives its own guest kernel, eliminating cross-pod kernel attack surface. The project originated from the merger of Intel Clear Containers and Hyper runV in December 2017, combining Intel's hardware virtualization expertise with Hyper's container-optimized VM technology. + +### 2.2 Scope of Analysis + +This report evaluates Kata Containers from the perspective of a container orchestration platform (Backend.AI) that: + +- Schedules and orchestrates containerized AI/ML workloads across distributed agent nodes +- Provides fractional and multi-GPU allocation to containers +- Mounts distributed storage (CephFS, NetApp, etc.) to containers via bind mounts +- Manages multi-node overlay networks for distributed training +- Requires low-latency interactive session startup for notebook environments + +The analysis focuses on Kata 3.x (current mainline) with references to the upcoming 4.0 release where relevant. + +### 2.3 Methodology + +This report synthesizes information from the following sources: + +- Official Kata Containers architecture documentation and design documents (GitHub repository) +- Community blog posts and project updates (2024-2025) +- Conference presentations (KubeCon, OpenInfra Summit) +- Third-party benchmarks and production case studies (StackHPC, Red Hat, Northflank) +- Cloud provider documentation (AWS, Azure, GCP) +- Academic papers on container runtime performance analysis + +--- + +## 3. Kata Containers Architecture + +### 3.1 Three-Environment Model + +Kata Containers defines three distinct execution contexts that are fundamental to understanding its behavior: + +**Host**: The physical machine running the container manager (containerd or CRI-O), the Kata runtime shim process, and the hypervisor process. The host is considered **untrusted** in confidential computing scenarios. + +**Guest VM**: A lightweight virtual machine with its own Linux kernel, root filesystem, and the kata-agent daemon. The VM provides hardware-enforced isolation using KVM. The guest kernel is a standard Linux LTS kernel stripped down for container workloads — minimal drivers, fast boot, and small memory footprint. + +**Container**: An isolated process environment *within* the guest VM, created using standard Linux namespaces and cgroups. From the workload's perspective, it appears identical to a native container. From the host's perspective, it is doubly isolated — first by the VM boundary, then by namespace/cgroup isolation within the guest. + +The mapping model is **one VM per Kubernetes pod**. All containers within a single pod share the same VM. This preserves the Kubernetes pod semantics (shared network namespace, IPC, optional PID namespace sharing) while adding a hardware isolation boundary at the pod level. + +``` +┌─────────────────────── HOST ───────────────────────┐ +│ │ +│ ┌──────────────────┐ ┌──────────────────────┐ │ +│ │ containerd / │ │ Other pods │ │ +│ │ CRI-O │ │ (runc or kata) │ │ +│ └────────┬─────────┘ └──────────────────────┘ │ +│ │ shimv2 API │ +│ ┌────────▼─────────┐ │ +│ │ containerd-shim- │ │ +│ │ kata-v2 │ │ +│ │ (1 per pod) │ │ +│ └────────┬─────────┘ │ +│ │ fork + manage │ +│ ┌────────▼─────────┐ ┌──────────────────────┐ │ +│ │ Hypervisor │ │ virtiofsd │ │ +│ │ (QEMU/CLH/ │ │ (filesystem sharing) │ │ +│ │ Dragonball) │ └──────────────────────┘ │ +│ └────────┬─────────┘ │ +│ │ VM boundary (KVM) │ +│ ╔════════╧═══════════════════════════════════════╗ │ +│ ║ GUEST VM ║ │ +│ ║ ┌─────────────────────────────────────────┐ ║ │ +│ ║ │ Guest Kernel (Linux LTS, minimal) │ ║ │ +│ ║ └─────────────────────────────────────────┘ ║ │ +│ ║ ┌─────────────────────────────────────────┐ ║ │ +│ ║ │ kata-agent (Rust, ttRPC over VSOCK) │ ║ │ +│ ║ └──────────┬──────────────────────────────┘ ║ │ +│ ║ │ ║ │ +│ ║ ┌──────────▼────┐ ┌───────────────────┐ ║ │ +│ ║ │ Container A │ │ Container B │ ║ │ +│ ║ │ (namespaces │ │ (namespaces │ ║ │ +│ ║ │ + cgroups) │ │ + cgroups) │ ║ │ +│ ║ └───────────────┘ └───────────────────┘ ║ │ +│ ╚════════════════════════════════════════════════╝ │ +└─────────────────────────────────────────────────────┘ +``` + +### 3.2 Runtime Components + +#### 3.2.1 containerd-shim-kata-v2 (The Runtime Shim) + +The shim is the single entry point for Kata Containers on the host side. It implements the containerd Runtime V2 (shimv2) API, which replaced the earlier model requiring 2N+1 separate processes (kata-runtime, kata-shim, kata-proxy, and reaper) per pod. + +Key characteristics: + +- **One shim process per pod**: A single long-running daemon handles all containers within a pod. +- **Unix domain socket communication**: containerd creates a socket and passes it to the shim at startup. All subsequent lifecycle commands flow over this socket as gRPC calls. +- **Configuration**: Reads from `configuration.toml` (default at `/usr/share/defaults/kata-containers/`), which specifies hypervisor type, guest kernel/image paths, resource defaults, and feature toggles. +- **virtcontainers library**: Internally uses the virtcontainers abstraction layer, providing a runtime-spec-agnostic, hardware-virtualized containers API. + +#### 3.2.2 Hypervisor + +The hypervisor is spawned by the shim (or, in Dragonball's case, runs in-process) to create and manage the VM. It: + +- Creates the VM with configured CPU/memory resources +- Sets up virtio devices (vsock for host-guest communication, block/fs for storage, net for networking) +- DAX-maps the guest image into VM memory for zero-copy access (on supported hypervisors) +- Supports hotplugging of CPU, memory, and block devices (hypervisor-dependent) + +#### 3.2.3 kata-agent + +The agent is a Rust daemon running inside the guest VM. It serves as the supervisor for all container operations within the VM: + +- **Deployment modes**: In rootfs image deployments, systemd is PID 1 and kata-agent runs as a systemd service. In initrd deployments, kata-agent runs directly as PID 1 (no init system). +- **Communication**: Runs a ttRPC server (a lightweight gRPC alternative optimized for low-overhead communication) listening on a VSOCK socket. +- **API surface**: `CreateSandbox`, `CreateContainer`, `StartContainer`, `ExecProcess`, `WaitProcess`, `DestroySandbox`, and related operations. +- **Responsibilities**: Creates and manages namespaces, cgroups, and mounts within the guest for each container. Carries stdio streams (stdout, stderr, stdin) between containers and the host runtime. + +#### 3.2.4 Communication Protocol: ttRPC over VSOCK + +Host-guest communication uses ttRPC (Tiny TTRPC — a simplified, more efficient alternative to gRPC) over VSOCK (virtual socket, available since Linux kernel 4.8). VSOCK provides a direct host-guest socket without requiring network configuration. The protocol: + +- Uses Protocol Buffers for serialization +- Supports multiplexed streams for concurrent container I/O +- Eliminates the need for the separate kata-proxy process that was required in pre-shimv2 architectures when using virtio-serial + +### 3.3 Container Creation Flow + +The step-by-step process from a pod creation request to a running container: + +1. Kubernetes kubelet or Docker sends a container creation request to the CRI runtime (containerd or CRI-O). +2. CRI runtime identifies the RuntimeClass as `kata` and invokes the `containerd-shim-kata-v2` binary. +3. containerd creates a Unix domain socket and passes it to the shim, which starts as a long-running daemon. +4. The shim reads `configuration.toml` and launches the configured hypervisor (QEMU, Cloud Hypervisor, or Dragonball in-process). +5. The hypervisor boots the VM with the guest kernel and guest image. The guest image is DAX-mapped into VM memory for zero-copy access (on x86_64 with QEMU). +6. The kata-agent starts inside the VM (either as PID 1 for initrd or via systemd for rootfs images). +7. The shim establishes a ttRPC connection to the agent over VSOCK. +8. The shim calls `CreateSandbox` on the agent, establishing the pod-level environment. +9. For each container in the pod, the shim calls `CreateContainer`, and the agent creates the appropriate namespaces, cgroups, rootfs mounts, and environment within the guest. +10. The agent spawns the workload process and returns success. +11. The shim reports container running status to containerd. + +**Shutdown**: The agent detects workload exit, captures the exit status, and returns it via `WaitProcess`. For forced shutdown, `DestroySandbox` terminates the VM via the hypervisor. + +### 3.4 Kata 3.x Architectural Evolution + +Kata 3.x represents a fundamental architectural shift from the Go-based runtime of version 2.x: + +| Component | Kata 1.x | Kata 2.x | Kata 3.x | +|---|---|---|---| +| kata-agent | Go | Rust | Rust | +| kata-runtime | Go | Go | **Rust** (runtime-rs) | +| Shim model | 2N+1 processes | shimv2 (single process, Go) | shimv2 (single process, Rust) | +| VMM integration | External process | External process | **Built-in** (Dragonball) or external | +| Policy engine | N/A | Go-based OPA | Rust-based regorus | +| Async runtime | goroutines | goroutines | **Tokio** (configurable worker threads) | + +#### 3.4.1 Rationale for the Rust Rewrite + +- **Performance**: Eliminates Go garbage collection pauses and reduces memory consumption. The runtime is a long-lived system daemon where GC pauses can impact container lifecycle latency. +- **Memory safety**: Rust's ownership model prevents null pointer dereferences, use-after-free, and data races at compile time — critical for system-level software managing VM lifecycle. +- **Async runtime**: Tokio provides Go-like concurrency with configurable worker threads (`TOKIO_RUNTIME_WORKER_THREADS`, default: 2), enabling high concurrency with predictable resource consumption. + +#### 3.4.2 Dragonball: Built-in VMM + +The most significant architectural change in 3.x is the built-in Dragonball VMM. Instead of forking a separate hypervisor process and communicating over IPC, Dragonball is linked as a Rust library directly into the runtime process: + +**Kata 2.x (separated VMM)**: +``` +runtime process ──(fork + RPC)──> VMM process (QEMU, CLH) +``` + +**Kata 3.x (built-in VMM)**: +``` +single process: runtime-rs + Dragonball VMM (shared address space) +``` + +Benefits: +- **Zero IPC overhead** between runtime and VMM +- **Unified lifecycle management** — no orphaned VMM processes on crashes +- **Simplified resource cleanup and exception handling** +- **Out-of-the-box experience** — no external VMM installation or version matching required + +Dragonball is built on the rust-vmm crate ecosystem (shared building blocks with Firecracker and Cloud Hypervisor) and is specifically optimized for container workloads. + +#### 3.4.3 Production Validation + +Alibaba Cloud has validated the Rust runtime + Dragonball in production: +- Launch 3,000 secure containers in 6 seconds on a single host +- Run 4,000+ secure containers simultaneously on a single host +- Support approximately 12 billion calls per day for Function Compute services + +--- + +## 4. Hypervisor Landscape + +### 4.1 Supported Hypervisors + +Kata Containers supports five hypervisors, all KVM-based (Type 2): + +**QEMU** — The most feature-complete option. Written in C with approximately 2 million lines of code. Supports all CPU architectures (x86_64, ARM, ppc64le, s390x), all device types (40+ emulated devices including legacy), full CPU/memory/disk hotplug, VFIO device passthrough, virtio-fs, and 9pfs. The trade-off is a larger attack surface, higher memory overhead, and slower boot times compared to newer Rust-based VMMs. QEMU is the default hypervisor for the Go runtime. + +**Cloud Hypervisor (CLH)** — A Rust-based VMM with approximately 50,000 lines of code. Supports x86_64 and aarch64, CPU/memory hotplug, virtio-fs, VFIO passthrough, and approximately 16 modern virtio devices. Boots in approximately 200ms. Provides a good balance between feature completeness and security (small Rust codebase, minimal attack surface). + +**Firecracker** — AWS's minimal VMM, also Rust-based. Extremely fast boot (approximately 125ms) and tiny footprint, but deliberately omits filesystem sharing (no virtio-fs or 9pfs), device hotplug, and VFIO passthrough. All resources must be pre-allocated at VM boot. Designed for serverless/FaaS workloads where density and startup latency are critical. + +**Dragonball** — Alibaba's built-in VMM for the Rust runtime. Runs in-process (not as a separate process), eliminating IPC overhead. Built on the rust-vmm crate ecosystem. Supports CPU/memory/disk hotplug, VFIO, and virtio-fs. Optimized for high-density container deployments. Default for runtime-rs. + +**StratoVirt** — Huawei's VMM supporting both MicroVM (lightweight) and StandardVM modes. Currently supports MicroVM mode in Kata with limited features. The community has discussed potentially deprecating StratoVirt support due to maintenance concerns (April 2025 PTG). + +### 4.2 Feature Comparison Matrix + +| Feature | QEMU | Cloud Hypervisor | Firecracker | Dragonball | StratoVirt | +|---|---|---|---|---|---| +| Language | C (~2M LOC) | Rust (~50K LOC) | Rust | Rust | Rust | +| Architectures | x86_64, ARM, ppc64le, s390x | x86_64, aarch64 | x86_64, aarch64 | x86_64, aarch64 | x86_64, aarch64 | +| Typical boot time | ~500ms | ~200ms | ~125ms | ~200ms | ~200ms | +| Memory footprint | 50-150 MB | 20-50 MB | 15-30 MB | 20-50 MB | 20-50 MB | +| CPU hotplug | Yes | Yes | No | Yes | No | +| Memory hotplug | Yes | Yes | No | Yes | No | +| Disk hotplug | Yes | Yes | No | Yes | Yes | +| VFIO passthrough | Yes | Yes | No | Yes | No | +| virtio-fs | Yes | Yes | No | Yes | Yes | +| 9pfs | Yes | No | No | No | No | +| Device emulation scope | 40+ (incl. legacy) | ~16 modern | Minimal (3) | Container-optimized | MicroVM-focused | +| Execution model | Separate process | Separate process | Separate process | In-process (library) | Separate process | +| Default for | Go runtime | — | — | Rust runtime | — | +| Confidential computing | TDX, SEV-SNP | TDX, SEV-SNP | No | Planned | No | + +### 4.3 Selection Guidelines + +| Use Case | Recommended VMM | Rationale | +|---|---|---| +| General-purpose, maximum compatibility | QEMU | Broadest architecture and feature support | +| Cloud-native production (balanced) | Cloud Hypervisor | Good feature set with small Rust attack surface | +| Serverless / FaaS (density + speed) | Firecracker | Minimal boot time, smallest footprint | +| High-density containers (Rust runtime) | Dragonball | In-process VMM, zero IPC, optimized for containers | +| Confidential computing | QEMU | Most mature TDX/SEV-SNP support | +| Legacy hardware / non-x86 architectures | QEMU | Only option for ppc64le, s390x | + +--- + +## 5. Feature Parity Analysis + +This section provides a detailed comparison of Kata Containers capabilities against native container runtimes (runc via Docker or containerd). Each subsection identifies feature gaps, behavioral differences, and performance implications. + +### 5.1 Networking + +#### 5.1.1 Native Container Networking + +Native containers use Linux network namespaces with veth (virtual Ethernet) pairs connecting containers to a bridge (e.g., `docker0` or `cni0`). The packet path is straightforward: + +``` +Container process → veth (container end) → veth (host end) → bridge → host routing +``` + +Overlay networks (VXLAN, Geneve) for multi-host communication add encapsulation/decapsulation at the host level. CNI plugins manage the full lifecycle. Performance is near-native since all operations occur within the host kernel's network stack. + +#### 5.1.2 Kata Container Networking + +Kata must bridge the gap between the host network namespace (where CNI plugins create veth pairs) and the guest VM (which uses virtio-net devices). Three networking models are supported: + +**TC-filter (tc-redirect) — Default** + +This is the current default networking model, chosen for maximum CNI plugin compatibility. The flow: + +1. CNI plugin creates a veth pair as usual (e.g., `eth0` in pod netns, `vethXXXX` on bridge). +2. Kata runtime creates a TAP device (`tap0_kata`) inside the pod network namespace. +3. Linux Traffic Control (TC) mirror rules redirect traffic between the veth's ingress/egress and the TAP device's egress/ingress. +4. The TAP device connects to a virtio-net device inside the VM. +5. Inside the VM, the kata-agent configures the network interface with the same IP/MAC as the original veth. + +``` +Container (guest) → virtio-net → TAP (host) → TC mirror → veth → bridge → host +``` + +This model introduces up to five network hops before a packet reaches its destination, adding measurable latency and CPU overhead. + +**MACVTAP — Alternative** + +Uses the Linux MACVTAP driver, which combines TUN/TAP and bridge functionality into a single module based on the MACVLAN driver. The packet path is slightly shorter than TC-filter, with similar performance characteristics. + +**Vhost-user — High Performance** + +The virtio-net backend runs in a userspace process, requiring fewer syscalls than a full VMM path. Used for high-performance scenarios and in conjunction with DPDK-based virtual switches. + +#### 5.1.3 Feature Parity Matrix + +| Feature | Native (runc) | Kata Containers | Gap Assessment | +|---|---|---|---| +| CNI plugin compatibility | Full | Good (TC-filter preserves compatibility) | Minor — some edge cases with unusual CNI configs | +| Overlay networks (VXLAN, Geneve) | Full | Full (via guest virtio-net) | Parity | +| Bridge networking | Full | Full (via TC-filter or MACVTAP) | Parity | +| `--net=host` (host networking) | Supported | **NOT supported** | **Critical gap** — VM cannot share host netns | +| `--link` (container linking) | Supported (deprecated) | **NOT supported** | Minor — deprecated in Docker | +| Network namespace sharing | Full | **Broken** — takes over interfaces | **Significant gap** | +| Service mesh (Istio/Envoy sidecar) | Full | Works (sidecar in same VM) | Parity | +| Network policies (Calico, Cilium) | Full | Full (applied at host level) | Parity | +| IPv6 | Full | Supported | Parity | +| Multicast | Full | Supported (virtio-net) | Parity | +| SR-IOV | Direct | Via VFIO passthrough | Additional configuration required | +| RDMA / InfiniBand | Direct device access | Via VFIO passthrough | Additional configuration, hypervisor-dependent | +| DNS resolution | Full | Full (resolv.conf shared via virtio-fs) | Parity | + +#### 5.1.4 Performance Impact + +Red Hat benchmarks (OpenShift Sandboxed Containers) show: +- **CPU overhead**: A sandboxed pod consumes approximately 4.5 CPU cores versus approximately 1.8 cores for a non-sandboxed pod during network-intensive workloads (approximately 2.5x overhead). +- **Latency**: Typically under 1ms additional latency per hop. +- **Throughput**: TC-filter and MACVTAP achieve comparable throughput; both are superior to the deprecated bridge mode. + +### 5.2 Storage and Volumes + +#### 5.2.1 Filesystem Sharing Mechanisms + +Container images and volumes stored on the host must be shared with the guest VM. Kata supports multiple mechanisms: + +**virtio-fs (Default since Kata 2.0)** + +A shared filesystem protocol based on FUSE, implemented as a virtio device. A `virtiofsd` daemon runs on the host, serving filesystem requests from the guest. Key characteristics: +- Near-native performance when combined with DAX (Direct Access) mapping +- Full POSIX compliance +- Used for sharing container rootfs, volumes, ConfigMaps, Secrets, and configuration files (hostname, hosts, resolv.conf) +- Available since Linux 5.4, QEMU 5.0, Kata 1.7 + +**DAX (Direct Access) Optimization**: On x86_64 with QEMU, virtio-fs can use NVDIMM/DAX mapping to map the guest image directly into VM memory without going through the guest kernel page cache. This provides zero-copy access, reduced memory consumption (pages are shared, not duplicated), and faster boot times. + +**9pfs (Legacy, Deprecated)** + +Plan 9 file protocol. Simpler implementation but significantly slower — every file operation traverses the 9P protocol and host syscalls. Has POSIX compliance limitations. Only supported on QEMU. Not recommended for production. + +**virtio-blk / virtio-SCSI** + +Block-level device sharing. The runtime detects whether container images use overlay-based filesystem drivers (shared via virtio-fs) or devicemapper/other block-based snapshotters (shared via virtio-blk/scsi). Used for persistent volumes and block-based workloads. + +**vhost-user-blk** + +Highest performance option. The virtio block backend runs in a separate userspace process, bypassing the VMM entirely. Benchmarks show near host-level I/O performance. + +#### 5.2.2 Feature Parity Matrix + +| Feature | Native (runc) | Kata Containers | Gap Assessment | +|---|---|---|---| +| OverlayFS image layers | Direct kernel access | Shared via virtio-fs | Minor overhead | +| Bind mounts | Zero overhead | Via filesystem sharing (virtio-fs) | Measurable overhead | +| Bind mounts (TEE mode) | N/A | **Files copied** into guest; changes NOT synced back | **Critical gap** for TEE | +| tmpfs | Full | Supported | Parity | +| emptyDir (tmpfs-backed) | Full | `/dev/shm` sizing **broken** (capped at 64KB) | **Bug** | +| ConfigMaps | Full, live updates | Supported; SubPath does NOT receive live updates | Gap | +| Secrets | Full, tmpfs-backed | Supported, tmpfs-backed | Parity | +| `volumeMount.subPath` | Supported | **NOT supported** | **Gap** | +| hostPath | Direct | Depends on fs sharing; no sync-back in TEE | Behavioral difference | +| PersistentVolumes (block) | Full | Via virtio-blk/scsi | Parity (minor overhead) | +| PersistentVolumes (NFS/CephFS) | Direct mount | Via virtio-fs passthrough | Overhead | +| CSI drivers | Full | Supported (host-side) | Parity | + +#### 5.2.3 I/O Performance Data + +StackHPC benchmarks using fio with varying client counts and I/O patterns: + +**With 9pfs (legacy — included for reference)**: + +| Metric | Bare Metal | Kata (9pfs) | Degradation | +|---|---|---|---| +| Sequential read bandwidth (64 clients) | Baseline | ~15% of bare metal | 6.7x worse | +| Sequential read p99 latency (64 clients) | 47,095 us | 563,806 us | 12x worse | +| Random write p99 latency (64 clients) | 159,285 us | 15,343,845 us | ~100x worse | + +**With virtio-fs + DAX**: Dramatic improvement over 9pfs. Near-native sequential read performance. Random write latency reduced significantly but still measurable overhead due to FUSE round-trips. + +**With vhost-user-blk**: Near host-level performance for block I/O workloads. Benchmarks by Xinnor show performance within 5-10% of bare metal for NVMe devices. + +### 5.3 GPU and Device Passthrough + +#### 5.3.1 Mechanism: VFIO + +Kata Containers exposes hardware devices to guest VMs using VFIO (Virtual Function I/O), which provides safe, non-privileged userspace device access via IOMMU (Input/Output Memory Management Unit). The VFIO path: + +1. Device is unbound from its host driver. +2. Device is bound to the `vfio-pci` driver. +3. The hypervisor maps the device into the VM's address space via IOMMU. +4. The guest VM accesses the device directly (near-native performance for data plane operations). + +#### 5.3.2 Feature Parity Matrix + +| Feature | Native (runc) | Kata Containers | Gap Assessment | +|---|---|---|---| +| Single GPU access | `--device /dev/nvidia0` | VFIO passthrough (requires IOMMU) | Behavioral difference (VFIO vs. direct) | +| Multi-GPU (same container) | Trivial (`--device` per GPU) | **NOT supported** | **Critical gap** | +| Fractional GPU (MIG, vGPU) | NVIDIA MIG, vGPU supported | vGPU: **NOT supported**; MIG: limited | **Critical gap** | +| GPU sharing (time-slicing) | Supported (NVIDIA MPS, time-slicing) | **NOT supported** | **Critical gap** | +| Intel GPU (GVT-g) | Supported | Supported (mediated device via VFIO) | Parity | +| NVIDIA GPU Operator | Full | Supported (containerd only, install-only, not upgrade) | Limited | +| Large BAR GPUs (e.g., Tesla P100) | Direct | Supported (Kata 1.11+, specific config) | Minor config needed | +| FPGA passthrough | Direct (`--device`) | Via VFIO | Works but requires IOMMU | +| USB device passthrough | Direct (`--device`) | Via VFIO (limited) | Limited | +| `/dev/kvm` access (nested virt) | Direct | Not meaningful (already in VM) | N/A | + +#### 5.3.3 Hypervisor-Specific VFIO Support + +| Hypervisor | VFIO Support | Notes | +|---|---|---| +| QEMU | Full | Most mature; supports all device types | +| Cloud Hypervisor | Full | Good support for modern devices | +| Firecracker | **None** | Deliberately excluded (minimal VMM) | +| Dragonball | Full | Supports GPU passthrough | +| StratoVirt | **None** | MicroVM mode limitation | + +#### 5.3.4 Confidential Computing + GPU + +NVIDIA Hopper-generation GPUs support single-GPU confidential compute passthrough when combined with Intel TDX or AMD SEV-SNP. The GPU's on-chip security engine participates in the attestation chain. NVIDIA Blackwell-generation extends this to multi-GPU confidential passthrough. + +The NVIDIA GPU Operator (versions 24.6.x through 24.9.x) supports Kata Containers with confidential container configurations, enabling GPU-accelerated workloads within TEE-protected VMs. + +#### 5.3.5 Impact on Backend.AI + +Backend.AI's accelerator plugin system (`backendai_accelerator_v21`) supports: +- Discrete allocation (whole GPU via `DiscretePropertyAllocMap`) +- Fractional allocation (sub-GPU via `FractionAllocMap`) +- Multi-GPU allocation to single containers +- NUMA-aware placement + +Kata's limitation to single-GPU VFIO passthrough per container directly conflicts with Backend.AI's multi-GPU allocation model and fractional GPU sharing via CUDA hook libraries. The VFIO approach also precludes the use of NVIDIA Container Toolkit's device mapping, which Backend.AI relies on through `generate_docker_args()` in its compute plugins. + +### 5.4 Resource Management + +#### 5.4.1 Dual-Layer Cgroup Architecture + +Kata uses a dual-layer resource management strategy: + +**Host-level cgroups**: A single cgroup per sandbox constrains the overall VM process (hypervisor + guest memory). The CRI runtime and resource manager can restrict or collect stats at the sandbox level. + +**Guest-level cgroups**: Inside the VM, the kata-agent applies per-container cgroup constraints, making containers within the same VM behave like standard Linux containers with individual resource limits. + +#### 5.4.2 Feature Parity Matrix + +| Feature | Native (runc) | Kata Containers | Gap Assessment | +|---|---|---|---| +| CPU shares/quota | Direct cgroup | Dual-layer (host + guest cgroup) | Parity (semantic) | +| CPU pinning | Direct cgroup | VM vCPU pinning | Behavioral difference | +| Memory limits | Direct cgroup | VM memory + guest cgroup | Parity (semantic) | +| Memory swap | Supported | VM-level swap | Behavioral difference | +| OOM handling | Direct cgroup OOM killer | Guest OOM (host sees VM memory) | Different OOM behavior | +| PID limits | Direct cgroup | Guest-level PID limits | Parity | +| Block I/O weight | cgroup blkio | **NOT supported** | Gap | +| Block I/O limits (bps/iops) | cgroup blkio | Supported (at VM level) | Parity | +| CPU hotplug | N/A | QEMU, CLH: yes; FC, SV: **no** | Hypervisor-dependent | +| Memory hotplug | N/A | QEMU, CLH: yes (virtio-mem); FC: **no** | Hypervisor-dependent | +| Cgroups v2 | Full | Supported (not default) | Parity (with config) | +| `--sysctl` | Supported | **NOT supported** | Gap | + +#### 5.4.3 Resource Allocation Behavior + +The VM starts with a configurable minimum resource allocation (default: 1 vCPU, 2 GB RAM defined in `configuration.toml`). When containers within the pod request additional resources, the runtime hot-plugs CPU and memory into the VM. This means: + +- There is a **base resource cost** per pod regardless of container resource requests. +- Hot-plug operations add latency to container startup (typically milliseconds for CPU, potentially longer for memory). +- Memory hotplug can fail if the guest kernel cannot online the hot-plugged memory due to insufficient initial memory. +- AMD CPU hotplug has reported failures in some configurations. + +### 5.5 Container Image Handling + +#### 5.5.1 Image Pull Models + +Kata supports two distinct image pull models: + +**Host Pull (Default)** + +The standard model matching native container behavior: +1. containerd/CRI pulls and unpacks the OCI image on the host. +2. Image layers are exposed to the guest via filesystem passthrough: + - virtio-fs for overlay-based graph drivers (default) + - virtio-blk/scsi for block-based graph drivers +3. The guest mounts the shared filesystem and constructs the container rootfs. +4. Image layers can be shared across multiple sandboxes on the same node. + +**Guest Pull (Confidential Containers)** + +Required when the host is considered untrusted: +1. The containerd snapshotter on the host intercepts the image pull. +2. Control is redirected to `image-rs` running inside the guest VM. +3. image-rs downloads, verifies signatures, and decrypts the image within the TEE. +4. Container images never appear in plaintext on the host. +5. **Trade-off**: Images are NOT shared across sandboxes — each VM downloads its own copy. + +#### 5.5.2 Nydus Image Acceleration + +Available since Kata v2.4.0, Nydus provides on-demand lazy loading of container image layers: + +- Research shows 76% of container startup time is spent on image pull, but only 6.4% of pulled data is actually read during execution. +- Nydus uses the RAFS v6 (Registry Acceleration File System) format, which stores image data in content-addressable chunks. +- Chunks are fetched on-demand from the registry as the container accesses files. +- In the Kata integration, `nydusd` replaces `virtiofsd` as the FUSE/virtiofs daemon, serving image data to the guest via virtio-fs. +- Supports P2P distribution via Dragonfly, reducing registry load by over 80%. + +### 5.6 Multi-Container Pods + +#### 5.6.1 Behavioral Equivalence + +Kata's multi-container pod behavior is semantically equivalent to native Kubernetes pods: + +- All containers in a pod share the **same VM** (the "sandbox"). +- A single kata-agent manages all container processes within the VM. +- Containers share the network namespace (same IP, same ports), IPC namespace, and optionally PID namespace. +- Each container has its own mount namespace, cgroup, and user namespace. + +The kata-agent differentiates between sandbox creation and container creation using OCI annotations: +- `io.kubernetes.cri-o.ContainerType: "sandbox"` → triggers VM creation +- `io.kubernetes.cri-o.ContainerType: "container"` → creates a container within the existing VM + +#### 5.6.2 Isolation Properties + +| Boundary | Native (runc) | Kata Containers | +|---|---|---| +| Inter-pod isolation | Namespace + cgroup (soft) | **Hardware VM boundary** (hard) | +| Intra-pod isolation | Namespace + cgroup | Namespace + cgroup (inside VM) — equivalent | +| Kernel sharing (inter-pod) | Shared host kernel | Separate guest kernels per pod | +| Kernel sharing (intra-pod) | Shared host kernel | Shared guest kernel within pod | + +### 5.7 Security Model + +#### 5.7.1 Threat Model Comparison + +| Threat | Native (runc) | Kata Containers | gVisor | +|---|---|---|---| +| Container escape via kernel CVE | **Vulnerable** (shared kernel) | **Protected** (separate kernel per pod) | **Partially protected** (reduced syscall surface) | +| Container escape via runtime CVE | Vulnerable | Protected (workload in VM, not on host) | Protected (workload in Sentry) | +| Resource exhaustion (noisy neighbor) | cgroup enforcement | VM-level + cgroup enforcement (dual-layer) | cgroup enforcement | +| Network-based attack (pod-to-pod) | Network policy (soft) | Network policy + VM boundary (hard) | Network policy (soft) | +| Privilege escalation (root in container) | Host kernel exposed | Guest kernel only; host devices NOT passed | Sentry kernel only | +| Side-channel attacks (Spectre, etc.) | Vulnerable (shared kernel/CPU) | **Protected** (separate VM; can use TEE hardware) | Partially protected | +| Supply chain (malicious image) | Vulnerable | Vulnerable (unless using CoCo with attestation) | Vulnerable | + +#### 5.7.2 Security Hardening Layers + +Kata applies defense-in-depth: + +1. **VM boundary** (KVM): Hardware-enforced memory isolation between pods. +2. **Minimal guest kernel**: Stripped-down Linux LTS with only virtio drivers, reducing attack surface. +3. **Seccomp filters**: Applied to the VMM process on the host and within the guest. +4. **AppArmor/SELinux**: Profiles applied to the VMM process and virtiofsd. +5. **Rootless VMM**: Hypervisor process can run as an unprivileged user (QEMU, CLH). +6. **Minimal guest rootfs**: Alpine-based initrd with only the kata-agent binary. +7. **Confidential computing** (optional): TEE hardware (TDX, SEV-SNP) for memory encryption and attestation. + +#### 5.7.3 Syscall Compatibility + +A critical differentiation from gVisor: Kata runs a **real Linux kernel** in the guest, providing full syscall compatibility (all ~330+ Linux syscalls). gVisor's Sentry intercepts syscalls and implements only a subset (~70-80%), causing compatibility issues with some workloads (particularly those using advanced filesystem operations, signaling, or less common syscalls). + +This is significant for AI/ML workloads, which may use advanced Linux features (io_uring, perf events, eBPF, RDMA verbs) that gVisor does not support. + +--- + +## 6. Performance Analysis + +### 6.1 Startup Latency + +| Runtime | Cold Start (typical) | With VM Templating | Notes | +|---|---|---|---| +| Native (runc) | 50-100ms | N/A | Process fork + namespace creation | +| gVisor | 50-100ms | N/A | Sentry initialization | +| Kata + QEMU | ~500ms | ~300ms | Full VM boot; most feature-complete | +| Kata + Cloud Hypervisor | ~200ms | ~125ms | Optimized boot path | +| Kata + Firecracker | ~125ms | N/A (no hotplug) | Minimal VMM; pre-allocated resources | +| Kata + Dragonball | ~200ms | ~150ms | In-process VMM; concurrent boot optimization | + +**VM Templating**: Pre-boots a VM and clones it for new sandboxes. Speeds up creation by up to 38.68% and saves approximately 72% of guest memory in high-density scenarios (100 containers saved 9 GB). Currently only available in the Go runtime; migration to runtime-rs is in progress. + +### 6.2 Memory Overhead + +| Runtime | Per-Pod Overhead | Components | +|---|---|---| +| Native (runc) | 1-5 MB | Namespace metadata, shim process | +| gVisor | 10-50 MB | Sentry process, gofer | +| Kata + QEMU | 50-150 MB | VMM process, guest kernel, agent, virtiofsd | +| Kata + Cloud Hypervisor | 20-50 MB | Smaller VMM, same guest components | +| Kata + Firecracker | 15-30 MB | Minimal VMM, minimal guest | +| Kata + Dragonball | 20-50 MB | In-process VMM, shared memory space | + +The per-pod memory overhead directly impacts container density. On a node with 256 GB RAM: +- runc: theoretical maximum ~50,000+ containers (memory-limited by workload, not runtime) +- Kata + QEMU: ~1,700 pods before overhead alone consumes all memory +- Kata + CLH/Dragonball: ~5,000 pods before overhead alone consumes all memory + +### 6.3 CPU Performance + +For CPU-bound workloads, Kata introduces approximately 4% overhead compared to bare metal. This is consistent across hypervisors because the overhead comes from VM exits for privileged instructions and timer interrupts, not from the VMM itself. + +For system-call-heavy workloads, overhead increases to 5-15% due to the additional cost of VM exits per syscall. The exact overhead depends on syscall frequency and type. + +### 6.4 I/O Performance + +Storage I/O is the area of greatest performance impact. The overhead varies dramatically by filesystem sharing mechanism: + +| Mechanism | Sequential Read (vs. bare metal) | Random Write p99 Latency (vs. bare metal) | Use Case | +|---|---|---|---| +| 9pfs (legacy) | ~15% of bare metal | ~100x worse | Not recommended | +| virtio-fs | ~70-90% of bare metal | ~3-10x worse | General purpose (default) | +| virtio-fs + DAX | ~90-98% of bare metal | ~2-5x worse | Recommended for I/O workloads | +| virtio-blk | ~90-95% of bare metal | ~2-3x worse | Block-based volumes | +| vhost-user-blk | ~95-100% of bare metal | ~1-2x worse | High-performance block I/O | + +For Backend.AI's use case of mounting distributed storage (CephFS, NetApp) via bind mounts, the I/O path would be: + +``` +Container → guest kernel → virtio-fs → virtiofsd (host) → host kernel VFS → distributed FS client → network → storage cluster +``` + +This adds one FUSE round-trip compared to the native path (container → host kernel VFS → distributed FS client → network → storage cluster), which is measurable but not catastrophic for typical AI/ML data loading patterns. + +### 6.5 Network Performance + +| Metric | Native (runc) | Kata (TC-filter) | Delta | +|---|---|---|---| +| Latency overhead | Baseline | <1ms additional | Minimal | +| CPU per packet | Baseline | ~2.5x | Significant for network-heavy | +| Throughput (single stream) | Near line rate | Near line rate | Parity | +| Throughput (many streams) | Near line rate | Reduced by CPU overhead | Depends on CPU budget | +| Overlay network (VXLAN) | Standard overhead | Standard + virtio overhead | Additive | + +The CPU overhead is the primary concern: network-intensive distributed training workloads (e.g., NCCL all-reduce over Ethernet) would consume significantly more CPU for the same data transfer compared to native containers. + +--- + +## 7. Advanced Capabilities + +### 7.1 Confidential Containers (CoCo) + +Confidential Containers is a CNCF sandbox project that combines Kata Containers with Trusted Execution Environment (TEE) hardware to protect data in use at the pod level. The enclave boundary encompasses the workload pod, kata-agent, and helper processes. Everything outside — hypervisor, other pods, control plane, host OS — is considered untrusted. + +#### 7.1.1 Supported TEE Technologies + +| TEE | Vendor | RuntimeClass | Capability | +|---|---|---|---| +| Intel TDX | Intel | `kata-qemu-tdx` | Full VM memory encryption + integrity | +| AMD SEV-SNP | AMD | `kata-qemu-snp` | Full VM memory encryption + integrity | +| IBM Secure Execution | IBM | `kata-qemu-se` | s390x secure execution | +| Intel SGX | Intel | N/A (enclave-based) | Application-level enclaves | + +#### 7.1.2 Architecture Components + +**Attester side (inside guest VM):** +- **Attestation Agent (AA)**: Generates TEE attestation reports from hardware. +- **Confidential Data Hub (CDH)**: Manages secret and key access within the TEE. +- **image-rs**: Downloads, verifies, and decrypts container images inside the guest. + +**Verifier side (external service):** +- **Key Broker Service (KBS)**: Validates guest attestation evidence and releases secrets/keys. +- **Attestation Service (AS)**: Verifies TEE evidence against reference values. +- **Reference Value Provider Service (RVPS)**: Manages trusted reference values for verification. + +#### 7.1.3 Attestation Workflow + +1. Workload or kata-agent requests a secret via CDH. +2. CDH triggers the AA to generate a TEE attestation report. +3. AA sends the report to KBS using the RCAR (Request-Challenge-Attestation-Response) protocol. +4. KBS forwards to AS for verification against reference values. +5. Only upon successful verification does KBS release the encryption key or secret. +6. Supports lazy attestation — triggered on-demand, not at boot. + +#### 7.1.4 Production Availability + +- **Azure AKS**: Confidential Containers preview with AMD SEV-SNP (DCa_cc, ECa_cc VM sizes) and Intel TDX. +- **Red Hat OpenShift**: Confidential containers on bare metal via Assisted Installer. +- **NVIDIA GPU Operator**: GPU workloads within confidential containers (Hopper GPUs). + +### 7.2 Peer Pods / Remote Hypervisor + +Peer pods extend Kata so that the pod sandbox VM runs **outside** the Kubernetes worker node as a first-class cloud VM, created via cloud provider IaaS APIs. This addresses two key limitations: + +1. **No nested virtualization required**: The VM is a first-level instance, not nested inside a worker node VM. +2. **TEE VMs on demand**: Cloud APIs can provision TEE-enabled VMs (e.g., AMD SEV-SNP instances) for confidential workloads. + +The Cloud API Adaptor (CAA) DaemonSet runs on each worker node, translating sandbox lifecycle commands into cloud provider API calls (EC2, Azure Compute, GCE, Libvirt). A VxLAN tunnel provides network connectivity between the pod's network namespace on the worker and the peer pod VM. + +Persistent volume support has been developed for peer pods, addressing storage attachment to remote VMs. + +### 7.3 Nydus Image Acceleration + +Nydus is a CNCF project providing a container image format (RAFS — Registry Acceleration File System) that supports on-demand lazy loading. In the Kata integration: + +1. `nydusd` replaces `virtiofsd` as the FUSE/virtiofs daemon. +2. On sandbox creation, nydusd starts and the VM mounts the virtiofs share. +3. On container creation, the shim instructs nydusd to mount a RAFS filesystem for the container's image layers. +4. The guest constructs an overlay filesystem combining the lazy-loaded RAFS layer with snapshot directories. +5. Image data is fetched from the registry on-demand as files are accessed. + +Performance impact: Significantly reduces cold start time for large images. Instead of pulling a 10GB ML framework image entirely before starting, only the immediately needed files are fetched, with remaining data loaded in the background. + +### 7.4 Monitoring and Observability + +#### 7.4.1 Metrics Architecture + +Kata 2.0+ implements a pull-based metrics system integrated with Prometheus: + +- **Zero overhead when not monitored**: Metrics are collected only when Prometheus scrapes the endpoint. +- **Stateless**: No metrics held in memory between scrapes. +- **Source**: Primarily from `/proc` filesystem for minimal performance impact. + +#### 7.4.2 kata-monitor + +A per-node management agent that: +- Aggregates metrics from running sandboxes, attaching `sandbox_id` labels. +- Attaches Kubernetes metadata (`cri_uid`, `cri_name`, `cri_namespace`). +- Provides a single Prometheus scrape target per node. +- The shim's metrics socket is at `vc/sbs/${PODID}/shim-monitor.sock`. + +#### 7.4.3 Metric Categories + +1. **Kata Agent Metrics**: Process statistics from guest `/proc//[io, stat, status]`. +2. **Guest OS Metrics**: Guest system-level data from `/proc/[stats, diskstats, meminfo, vmstats]` and `/proc/net/dev`. +3. **Hypervisor Metrics**: VMM process resource consumption. +4. **Shim Metrics**: Runtime process observations. + +Performance characteristics: +- End-to-end latency (Prometheus to kata-monitor): ~20ms average +- Agent RPC latency (shim to agent): ~3ms average +- Metrics payload per sandbox: ~144KB uncompressed, ~10KB gzipped + +#### 7.4.4 External Integrations + +- Grafana Cloud (Prometheus-based) +- Datadog (containerd endpoint-based, zero additional setup) +- Red Hat OpenShift Monitoring (integrated for sandboxed containers) + +--- + +## 8. Known Limitations and Unsupported Features + +### 8.1 Comprehensive Limitation Inventory + +| Feature | Status | Impact Level | Notes | +|---|---|---|---| +| `--net=host` (host networking) | Not supported | High | Fundamental VM limitation; cannot share host network namespace | +| `--link` (container linking) | Not supported | Low | Deprecated in Docker; replaced by user-defined networks | +| `--shm-size` | Not implemented | Medium | `/dev/shm` sizing broken with tmpfs emptyDir; capped at 64KB | +| `--sysctl` | Not implemented | Medium | Cannot tune guest kernel parameters from host | +| `checkpoint` / `restore` (CRIU) | Not implemented | High | No live migration or suspend/resume; open issue since April 2021 | +| Live VM migration | Not implemented | High | Underlying hypervisors support it but integration is unsolved | +| Block I/O weight (cgroups) | Not supported | Low | Other blkio controls (bps, iops) work | +| `events` (OOM notifications) | Partial | Medium | OOM and Intel RDT stats incomplete | +| `volumeMount.subPath` | Not supported | Medium | Kubernetes volume limitation | +| Podman integration | Not supported | Medium | Docker supported since 22.06 | +| Multi-GPU passthrough | Not supported | **Critical** | Single GPU only via VFIO | +| NVIDIA vGPU | Not supported | **Critical** | Cannot share a physical GPU across multiple Kata VMs | +| GPU time-slicing (MPS) | Not supported | **Critical** | Requires host-side MPS daemon access | +| Network namespace sharing | Broken | Medium | Takes over interfaces, breaking connectivity | +| Full rootless operation | Partial | Medium | Only VMM is rootless; shim + virtiofsd still need root | +| Docker Compose | Limited | Low | Basic support; advanced features may not work | +| `docker exec` with TTY | Supported | — | Works via kata-agent ExecProcess API | +| Privileged containers | Behavioral difference | Medium | Host devices NOT passed by default; must be explicitly configured | + +### 8.2 Hypervisor-Specific Feature Gaps + +| Feature | QEMU | Cloud Hypervisor | Firecracker | Dragonball | StratoVirt | +|---|---|---|---|---|---| +| CPU hotplug | Yes | Yes | **No** | Yes | **No** | +| Memory hotplug | Yes | Yes | **No** | Yes | **No** | +| Disk hotplug | Yes | Yes | **No** | Yes | Yes | +| VFIO passthrough | Yes | Yes | **No** | Yes | **No** | +| virtio-fs | Yes | Yes | **No** | Yes | Yes | +| 9pfs | Yes | No | No | No | No | +| VM templating | Yes | Yes | N/A | In progress | N/A | +| Confidential computing | TDX, SEV-SNP | TDX, SEV-SNP | No | Planned | No | +| Rootless VMM | Yes | Yes | Jailer mechanism | In progress | No | + +### 8.3 Compatibility Issues + +- **Docker 26.0.0**: Broke Kata networking (`io.containerd.run.kata.v2`), demonstrating fragility in integration points. Workarounds were required. +- **AMD CPU hotplug**: Reported failures in some configurations (GitHub issue #2875). +- **Memory hotplug with low initial memory**: Can fail if guest kernel cannot online hot-plugged memory. +- **Cgroups v2**: Supported but not default; requires explicit configuration in `configuration.toml` and guest kernel command line. + +--- + +## 9. Production Adoption + +### 9.1 Major Adopters + +**Ant Group (Alibaba)** +- Using Kata Containers in production since 2019. +- Thousands of tasks scheduled on Kata across thousands of nodes. +- Co-locates online applications and batch jobs. +- Built AntCWPP (Cloud Workload Protection Platform) combining Kata with eBPF for defense-in-depth: eBPF programs attached to LSM hooks and network control points inside the Kata VM kernel. +- Successfully eliminated container escape risks, detected reverse shells, fileless malware, and policy drift. +- Reduced per-payment energy consumption by 50% through improved workload co-location enabled by strong isolation. + +**IBM Cloud** +- Deploys Kata for Cloud Shell and CI/CD Pipeline SaaS offerings. +- CI/CD platform scaled 30-40x over five years with sustained ~10% monthly growth. +- Uses Kubernetes with Tekton as pipeline orchestrator; Kata plugs directly into the existing stack. +- Key challenge identified: filesystem/snapshotter setup for the Kata guest is complex and needs simplification. +- Anticipates increasing demand as AI-driven automation generates more workloads requiring secure isolation. + +**Northflank** +- Runs over 2 million microVMs per month inside Kubernetes. +- Uses Kata Containers with Cloud Hypervisor in production. +- Multi-tenant platform where security is essential for running untrusted images. +- Developed custom tooling for running Kata on GKE with custom node images. +- Had to customize virtiofs settings and enable non-default kernel features, requiring custom Kata builds. + +**Microsoft Azure** +- Pod Sandboxing feature on AKS uses Kata Containers. +- Confidential Containers preview with AMD SEV-SNP and Intel TDX. +- Reported as "extremely well-received by internal and external AKS customers" since February 2023 preview. + +**AWS** +- Documents Kata integration with EKS for enhanced workload isolation. +- Requires bare-metal EC2 instances (e.g., `m5.metal`, `c5.metal`) as worker nodes. + +**Baidu** +- Runs Kata in production for Function Computing, Cloud Container Instances, and Edge Computing. + +### 9.2 Lessons Learned from Production Deployments + +1. **Seamless Kubernetes integration** is Kata's greatest operational strength — workloads require minimal configuration changes beyond specifying `runtimeClassName`. +2. **Filesystem/snapshotter complexity** is the biggest barrier to adoption (cited by IBM Cloud). +3. **Custom Kata builds may be required** for advanced use cases (virtiofs tuning, kernel features — cited by Northflank). +4. **VMM selection significantly impacts operational profile**: Cloud Hypervisor is preferred for performance, QEMU for feature completeness. +5. **Defense in depth is essential**: Combining Kata with eBPF (as Ant Group does) provides both strong isolation and fine-grained runtime security. +6. **Steep learning curve**: Operational knowledge of both container and VM management is required. +7. **Cloud provider constraints**: Many cloud providers do not support nested virtualization on standard instances, requiring either bare-metal instances (expensive) or peer pods (complex). + +--- + +## 10. Implications for Backend.AI + +### 10.1 Compatibility Assessment + +This section maps Kata Containers' capabilities against Backend.AI's four core subsystems: + +#### 10.1.1 Scheduling and Orchestration + +| Backend.AI Feature | Kata Compatibility | Assessment | +|---|---|---| +| Session lifecycle (PENDING→RUNNING→TERMINATED) | Compatible | Kata is transparent to the scheduler; RuntimeClass selection is the only change | +| Multi-node sessions (cluster_mode=MULTI_NODE) | Compatible | Overlay networking works with Kata (additional overhead) | +| Sokovan scheduler (FIFO, LIFO, DRF) | Compatible | No scheduler changes needed; resource accounting must include VM overhead | +| Agent-Manager RPC (ZeroMQ/Callosum) | Compatible | Agent runs on host; manages Kata containers via containerd | +| Session priority and deprioritization | Compatible | No changes needed | +| Hot-plug resources during session | Partially compatible | Depends on hypervisor; QEMU/CLH support hot-plug | +| Idle session detection and timeout | Compatible | Metrics available via kata-monitor | + +**Key consideration**: Resource accounting must include the per-pod VM overhead (15-150MB memory, CPU for VMM). The scheduler's `available_slots` calculation on agents must subtract this overhead per expected concurrent session. + +#### 10.1.2 Overlay Networking + +| Backend.AI Feature | Kata Compatibility | Assessment | +|---|---|---| +| Docker Swarm overlay networks | Compatible | Kata's TC-filter/MACVTAP works with overlay | +| Multi-container SSH (cluster_ssh_port_mapping) | Compatible | SSH within VM overlay works | +| Port exposure (service ports) | Compatible | Kata network plugin expose_ports() equivalent | +| RDMA / InfiniBand | Partially compatible | Requires VFIO passthrough; hypervisor-dependent | +| MPI over overlay (OMPI) | Compatible | Kata sets OMPI_MCA_btl_tcp_if_exclude automatically | + +**Key consideration**: Network-intensive distributed training (NCCL all-reduce) will consume approximately 2.5x more CPU per packet due to the virtio-net path. For GPU-heavy training where network is the bottleneck, this overhead may be acceptable since GPUs consume most wall-clock time. + +#### 10.1.3 Distributed Storage + +| Backend.AI Feature | Kata Compatibility | Assessment | +|---|---|---| +| VFolder bind mounts (VFolderMount → Docker bind mount) | Compatible | Via virtio-fs passthrough (overhead) | +| CephFS / NetApp / WekaFS mounts | Compatible | Host-side FS client + virtio-fs to guest | +| Intrinsic mounts (/home/work, /home/config) | Compatible | Via virtio-fs | +| Multiple vfolders per session | Compatible | Multiple bind mounts via virtio-fs | +| Read-only / read-write permissions | Compatible | Virtio-fs supports RO/RW | +| /home/config/resource.txt (KernelResourceSpec) | Compatible | Written via virtio-fs shared mount | + +**Key consideration**: Every I/O operation traverses the virtio-fs FUSE path, adding measurable latency. For AI/ML workloads with large dataset loading (e.g., reading training data from CephFS), this overhead is most noticeable during data pipeline phases. GPU compute phases are unaffected. + +#### 10.1.4 Accelerator Management + +| Backend.AI Feature | Kata Compatibility | Assessment | +|---|---|---| +| Single GPU allocation (DiscretePropertyAllocMap) | Compatible | Via VFIO passthrough | +| Multi-GPU allocation (2+ GPUs per container) | **NOT compatible** | Kata limits to single GPU via VFIO | +| Fractional GPU (FractionAllocMap, cuda.shares) | **NOT compatible** | VFIO passes whole GPU; no hook library support | +| CUDA hook libraries (generate_hooks()) | **NOT compatible** | Host-side hook injection does not work across VM boundary | +| NVIDIA Container Toolkit (generate_docker_args()) | **NOT compatible** | Toolkit's device mapping requires host kernel driver sharing | +| GPU metrics (gather_node_measures()) | Partially compatible | VFIO GPU metrics must come from guest-side monitoring | +| NUMA-aware allocation (AffinityHint) | Partially compatible | VM vCPU pinning can respect NUMA; GPU NUMA irrelevant with VFIO | +| ROCm, Habana, IPU, TPU plugins | Partially compatible | VFIO passthrough for each; single device only | +| Xilinx FPGA (fractional) | **NOT compatible** | Fractional allocation requires host-side arbitration | + +**This is the most significant incompatibility.** Backend.AI's core value proposition includes fractional GPU sharing and multi-GPU allocation, both of which are fundamentally incompatible with Kata's VFIO passthrough model. + +### 10.2 Integration Feasibility Summary + +| Criterion | Feasibility | Blocking? | +|---|---|---| +| Single-GPU workloads (inference) | **Feasible** | No | +| Multi-GPU workloads (training) | **Not feasible** | **Yes** | +| Fractional GPU workloads | **Not feasible** | **Yes** | +| CPU-only workloads | **Feasible** | No | +| Storage-intensive workloads | **Feasible with overhead** | No | +| Network-intensive distributed training | **Feasible with overhead** | No | +| Interactive notebooks (startup latency) | **Feasible** (200-500ms overhead) | Marginal | +| High-density deployments | **Feasible with reduced density** | Depends on use case | +| Confidential AI workloads | **Feasible** (compelling use case) | No | + +### 10.3 Potential Integration Architecture + +If Backend.AI were to support Kata as an optional runtime for specific workload classes, the integration would involve: + +1. **RuntimeClass selection**: Add `runtime_class` to SessionCreationSpec, allowing users to opt into Kata isolation for specific sessions. +2. **Resource accounting adjustment**: Agents subtract per-session VM overhead from `available_slots` when Kata runtime is selected. +3. **Accelerator plugin modification**: When Kata is selected, constrain allocation to single whole-GPU VFIO mode. Disable fractional allocation and multi-GPU for Kata sessions. +4. **Storage path**: No changes needed — VFolderMount → bind mount path works through virtio-fs transparently. Accept I/O overhead. +5. **Network path**: No changes needed — Docker Swarm overlay works with Kata's TC-filter. Accept CPU overhead for network-intensive workloads. +6. **Monitoring**: Integrate kata-monitor metrics into Backend.AI's agent statistics collection. + +--- + +## 11. Conclusions and Recommendations + +### 11.1 Summary of Findings + +Kata Containers provides genuine hardware-enforced isolation for containerized workloads while maintaining OCI compliance and Kubernetes integration. The Kata 3.x Rust rewrite and built-in Dragonball VMM have significantly improved performance and operational characteristics. + +However, critical feature gaps exist for Backend.AI's primary use cases: + +1. **Multi-GPU and fractional GPU allocation are not supported**, which directly conflicts with Backend.AI's core accelerator management system. +2. **Storage I/O overhead is measurable** but acceptable for most AI/ML workload patterns (data loading is typically not the bottleneck). +3. **Network overhead** (2.5x CPU for network-intensive workloads) is relevant for distributed training but may be acceptable when GPU compute dominates wall-clock time. +4. **Cold start overhead** (125-500ms) is noticeable for interactive sessions but unlikely to be a dealbreaker. +5. **Checkpoint/restore is not supported**, preventing session migration or suspend/resume features. + +### 11.2 Recommendations + +**Short-term**: Do not adopt Kata as a general-purpose runtime for Backend.AI sessions. The GPU incompatibilities are fundamental and cannot be worked around without architectural changes to either Backend.AI or Kata. + +**Medium-term**: Consider supporting Kata as an **optional isolation mode** for specific workload classes: +- CPU-only workloads in multi-tenant environments +- Single-GPU inference workloads where security isolation is required +- Confidential computing workloads (CoCo with TDX/SEV-SNP) for privacy-sensitive AI + +**Long-term**: Monitor these Kata developments: +- Multi-GPU VFIO passthrough (NVIDIA Blackwell enables this for confidential containers) +- CDI (Container Device Interface) integration for standardized device management +- Runtime-rs reaching full feature parity with the Go runtime (Kata 4.0 target) +- Improvements to vGPU and MIG support within VMs + +### 11.3 Alternative Approaches + +For multi-tenant isolation without Kata's GPU limitations: +- **gVisor**: Provides kernel-level isolation without VMs, but has syscall compatibility limitations that may affect AI/ML workloads. +- **NVIDIA MIG + namespace isolation**: Hardware GPU partitioning with standard container isolation; no VM overhead but weaker isolation boundary. +- **Firecracker (direct)**: For serverless/FaaS-style workloads without GPU requirements; excellent density and startup time. +- **Kata + GPU operator evolution**: NVIDIA is actively developing confidential multi-GPU support; this may resolve the GPU limitation within 1-2 years. + +--- + +## 12. References + +### Official Documentation +1. Kata Containers Architecture Design. https://github.com/kata-containers/kata-containers/blob/main/docs/design/architecture/README.md +2. Kata Containers Virtualization Design. https://github.com/kata-containers/kata-containers/blob/main/docs/design/virtualization.md +3. Kata Containers Hypervisor Comparison. https://github.com/kata-containers/kata-containers/blob/main/docs/hypervisors.md +4. Kata Containers Limitations. https://github.com/kata-containers/kata-containers/blob/main/docs/Limitations.md +5. Kata Containers Guest Assets. https://github.com/kata-containers/kata-containers/blob/main/docs/design/architecture/guest-assets.md +6. Kata Containers 2.0 Metrics Design. https://github.com/kata-containers/kata-containers/blob/main/docs/design/kata-2-0-metrics.md +7. Kata Nydus Design Document. https://github.com/kata-containers/kata-containers/blob/main/docs/design/kata-nydus-design.md +8. Kata Containers Rootless VMM Guide. https://github.com/kata-containers/kata-containers/blob/main/docs/how-to/how-to-run-rootless-vmm.md +9. Kata Containers with Kubernetes. https://github.com/kata-containers/kata-containers/blob/main/docs/how-to/run-kata-with-k8s.md + +### Blog Posts and Case Studies +10. Getting Rust-y: Introducing Kata Containers 3.0.0. https://katacontainers.io/blog/getting-rust-y-introducing-kata-containers-3-0-0/ +11. Kata 3.0 Technical Deep Dive (Alibaba Cloud). https://www.alibabacloud.com/blog/kata-3-0-is-coming-start-experiencing-the-out-of-the-box-secure-container_599575 +12. Kata Containers 3.5.0 Release Overview. https://katacontainers.io/blog/kata-containers-3-5-0-release-overview/ +13. Kata Containers Project Updates Q4 2024. https://katacontainers.io/blog/kata-containers-project-updates-q4-2024/ +14. Kata Community PTG Updates October 2025. https://katacontainers.io/blog/kata-community-ptg-updates-october-2025/ +15. IBM Cloud Kata Case Study. https://katacontainers.io/blog/kata-containers-ibm-cloud-updates/ +16. Northflank Kata Case Study. https://katacontainers.io/blog/kata-containers-northflank-case-study/ +17. Ant Group Kata + eBPF Whitepaper. https://katacontainers.io/collateral/kata-containers-ant-group_whitepaper.pdf +18. Ant Group Container Security with eBPF. https://ebpf.foundation/ant-group-secures-their-platform-with-kata-containers-and-ebpf-for-fine-grained-control/ + +### Performance Benchmarks +19. StackHPC Kata I/O Performance Analysis. https://www.stackhpc.com/kata-io-1.html +20. Xinnor vhost-user-blk Performance. https://xinnor.io/blog/bridging-the-storage-performance-gap-in-kata-containers-with-vhost-user-and-xiraid-opus-using-phison-x200-nvme-drives/ +21. OpenShift Sandboxed Containers Network Performance. https://www.redhat.com/en/blog/openshift-sandboxed-containers-network-performance +22. Runtime Comparison: gVisor vs Kata vs Firecracker (2025). https://onidel.com/blog/gvisor-kata-firecracker-2025 + +### Cloud Provider Documentation +23. AWS: Enhancing Kubernetes Workload Isolation with Kata. https://aws.amazon.com/blogs/containers/enhancing-kubernetes-workload-isolation-and-security-using-kata-containers/ +24. Azure AKS Pod Sandboxing. https://learn.microsoft.com/en-us/azure/aks/use-pod-sandboxing +25. Azure AKS Confidential Containers Overview. https://learn.microsoft.com/en-us/azure/aks/confidential-containers-overview +26. NVIDIA GPU Operator with Kata. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/24.9.2/gpu-operator-kata.html +27. NVIDIA GPU Operator with Confidential Containers. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/24.9.1/gpu-operator-confidential-containers.html + +### Confidential Computing +28. Confidential Containers Design Overview. https://confidentialcontainers.org/docs/architecture/design-overview/ +29. Red Hat: Introducing Confidential Containers on Bare Metal. https://www.redhat.com/en/blog/introducing-confidential-containers-bare-metal +30. Intel: Confidential Containers Made Easy. https://www.intel.com/content/www/us/en/developer/articles/technical/confidential-containers-made-easy.html +31. Cloud API Adaptor (Peer Pods) Architecture. https://github.com/confidential-containers/cloud-api-adaptor/blob/main/docs/architecture.md + +### Comparison Articles +32. Kata vs Firecracker vs gVisor (Northflank). https://northflank.com/blog/kata-containers-vs-firecracker-vs-gvisor +33. Edera Isolation Comparison. https://edera.dev/stories/kata-vs-firecracker-vs-gvisor-isolation-compared +34. Kata Containers for Docker in 2024. https://medium.com/@filippfrizzy/kata-containers-for-docker-in-2024-1e98746237ca diff --git a/proposals/BEP-1051-kata-containers-agent.md b/proposals/BEP-1051-kata-containers-agent.md new file mode 100644 index 00000000000..5d53eaa43fc --- /dev/null +++ b/proposals/BEP-1051-kata-containers-agent.md @@ -0,0 +1,346 @@ +--- +Author: Kyujin Cho (kyujin@lablup.com) +Status: Draft +Created: 2026-02-27 +Created-Version: 26.3.0 +Target-Version: +Implemented-Version: +--- + + + +# Kata Containers Agent Backend + +## Related Issues + +- Technical Report: [TR-2026-001](../docs/reports/kata-containers-feature-parity-analysis.md) + +## Motivation + +Backend.AI's current container backends (Docker, Kubernetes) rely on Linux namespaces and cgroups for isolation. These share the host kernel — a single kernel vulnerability in any container can compromise the entire host and all co-located workloads. This is insufficient for multi-tenant GPU environments where untrusted code runs alongside sensitive models. + +Kata Containers 3.x addresses this by running each container inside a lightweight VM with its own guest kernel, providing hardware-enforced isolation via KVM. The Rust-based runtime achieves cold start times of 125-500ms (VMM-dependent, under optimized configurations). The VMM process overhead is 15-60MB per VM (Cloud Hypervisor ~15MB, QEMU ~60MB); total per-VM host memory is this overhead plus the guest's configured memory allocation. These overheads are acceptable for AI/ML workloads where GPU compute dominates wall-clock time. + +For GPU workloads, VFIO (Virtual Function I/O) passthrough assigns whole GPUs directly to guest VMs via IOMMU, achieving near-native compute performance. This model naturally aligns with discrete GPU allocation (`cuda.device`) — fractional GPU sharing is neither supported nor needed, as the target workloads require dedicated GPU resources with strong isolation guarantees. + +Additionally, Kata's integration with Confidential Containers (CoCo) and TEE hardware (Intel TDX, AMD SEV-SNP) enables privacy-preserving AI workloads where even the host operator cannot access model weights or training data in memory. + +## Document Index + +| Document | Description | Phase | +|----------|-------------|-------| +| [Configuration & Deployment](BEP-1051/configuration-deployment.md) | `[kata]` config section, hypervisor selection, host requirements | 1 | +| [KataAgent Backend](BEP-1051/kata-agent-backend.md) | KataAgent, KataKernel, KataKernelCreationContext | 1 | +| [Storage Compatibility](BEP-1051/storage-compatibility.md) | Direct guest-side NFS/Lustre/WekaFS mounts for VFolders, virtio-fs for scratch/config, I/O analysis | 1 | +| [Networking](BEP-1051/networking.md) | Calico CNI integration, inter-VM networking, network policy, hostname resolution | 1 | +| [VFIO Accelerator Plugin](BEP-1051/vfio-accelerator-plugin.md) | CUDAVFIOPlugin, IOMMU group detection, device passthrough | 2 | +| [Scheduler Integration](BEP-1051/scheduler-integration.md) | Agent backend tracking, scaling group policy, VM overhead | 3 | +| [Migration & Compatibility](BEP-1051/migration-compatibility.md) | Additive rollout, backward compatibility, rollback plan | All | + +## Design Overview + +```mermaid +classDiagram + class AbstractAgent~KernelObjectType, KernelCreationContextType~ + AbstractAgent <|-- DockerAgent : existing + AbstractAgent <|-- KubernetesAgent : existing + AbstractAgent <|-- KataAgent : proposed + class DockerAgent~DockerKernel, DockerKernelCreationContext~ + class KubernetesAgent~KubernetesKernel, KubernetesKernelCreationContext~ + class KataAgent~KataKernel, KataKernelCreationContext~ +``` + +KataAgent manages CoCo-enabled containers via containerd with the Kata shim (`io.containerd.kata.v2`). All Kata VMs run with TEE protection (TDX/SEV-SNP) by default — the host is always untrusted. Container lifecycle differs from Docker: VM boot with attestation → guest-side image pull via `image-rs` → container spawn inside guest. GPU devices are assigned via VFIO passthrough at VM creation time, binding PCI devices directly into the guest's address space through IOMMU. Co-located InfiniBand HCAs are auto-attached for GPUDirect RDMA. + +VFolder storage is mounted directly inside the guest VM using the guest's own NFS/Lustre/WekaFS kernel client — preserving RDMA data paths and achieving 100% native I/O throughput. `/home/work` is a directory on the guest VM's disk (part of the guest rootfs), and vfolder subdirectories are bind-mounted into `/home/work/{vfolder}` from the guest-side NFS/Lustre mounts. The agent delivers volume mount specs via `storage-mounts.json` in `/home/config/` (virtio-fs config channel); a guest prestart hook reads these and performs the mounts before the container starts. Storage topology is visible in guest `/proc/mounts` (same as Docker — isolation is enforced by Backend.AI's vfolder permission model, not by concealing mount details). virtio-fs is used only for the config directory (`/home/config` with environ.txt, intrinsic-ports.json, storage-mounts.json). Network traffic traverses a virtio-net path with TC-filter mirroring — Kata's TC filter transparently redirects traffic between host-side veth interfaces and guest-side tap devices, making VMs transparent to the CNI layer. Calico CNI provides multi-host routing (BGP/VXLAN), IP allocation, and network policy enforcement for inter-session isolation. Multi-container sessions use Calico endpoint labeling for policy-based cluster communication with `/etc/hosts`-based hostname resolution. + +## Component Integration: Kata Kernel Creation Flow + +This section describes the end-to-end component interactions when Backend.AI creates a Kata kernel (CoCo mode). Steps that differ from DockerAgent are annotated. + +### Kernel Creation Sequence + +```mermaid +sequenceDiagram + participant M as Manager + participant A as KataAgent + participant C as containerd + Kata shim + participant G as Guest VM (CoCo TEE) + + M->>A: create_kernels(RPC) + + Note over A: [1] init_kernel_context() + Note over A: [2] check_image() - no host-side check (CoCo) + Note over A: [3] allocate_resources() - VFIO + RDMA + Note over A: [4] prepare_scratch() - write storage-mounts.json + Note over A: [5] apply_network() - Calico CNI + Note over A: [6] mount_vfolders() - guest-internal bind mounts + Note over A: [7] prepare_container() - OCI spec + VFIO annotations + + A->>C: [8] CreateSandbox() + Note over C: Configure VFIO devices, virtio-fs, boot VM with TDX/SEV-SNP + C->>G: Boot VM + + rect rgb(240, 248, 255) + Note over G: [8b] VM Boot + Note over G: TEE attestation + Note over G: Read storage-mounts.json + Note over G: Configure storage NIC IP (IPoIB) + Note over G: Mount NFS/Lustre (native, RDMA) + end + + C->>G: [8c] CreateContainer() + + rect rgb(240, 248, 255) + Note over G: [8d] Container start + Note over G: image-rs pulls encrypted OCI image + Note over G: Bind-mount vfolder subdirs + Note over G: entrypoint.sh, kernel runner starts + end + + G-->>A: [9] ZMQ TCP connect (via Calico network) + A->>G: [10] check_status() poll + G-->>A: kernel runner ready + service metadata + A-->>M: return kernel_data +``` + +### Key Differences from DockerAgent + +| Step | DockerAgent | KataAgent (CoCo) | +|---|---|---| +| [1] Context | `DockerKernelCreationContext` + aiodocker | `KataKernelCreationContext` + containerd gRPC | +| [2] Image | Host-side `docker.images.pull()` | No host-side check; guest-side `image-rs` pull after VM boot | +| [3] GPU | Docker `DeviceRequests` | VFIO `_kata_vfio_devices` + `clique_id` + auto-attached HCA | +| [4] Scratch | environ.txt, resource.txt | Same + `storage-mounts.json` (volume specs + storage NIC IP) | +| [6] VFolders | Host bind-mount → container | Guest-internal bind-mount (volume pre-mounted by boot script) | +| [7] Spec | Docker container config | OCI spec + VFIO annotations + CoCo attestation config | +| [8] Start | `docker.containers.create()` + `start()` | CreateSandbox → VM boot → TEE attestation → storage mount → CreateContainer → image pull → entrypoint | +| [9-10] ZMQ | Same (ZMQ TCP through host network) | Same (ZMQ TCP through Calico network — kernel runner is unaware of VM) | + +### Timeout Considerations + +Kata kernel creation has additional latency compared to Docker: + +| Phase | Docker | Kata (CoCo) | +|---|---|---| +| Image pull | Host-side Docker pull | Guest-side `image-rs` pull (inside TEE, after VM boot) | +| Container start | ~1-2s | VM boot (3-10s) + TEE attestation (~1s) + storage mount (~1-3s) + image pull | +| GPU setup | Docker DeviceRequests | VFIO passthrough + PCIe BAR mapping (5-30s for multi-GPU) | +| Kernel runner ready | ~2-5s | Same (after container starts) | +| **Total** | **~5-10s** | **~15-60s** (depending on GPU count and image size) | + +The `kernel_init_polling_timeout_sec` and `kernel_init_timeout_sec` configs must be increased for Kata deployments to accommodate VM boot, TEE attestation, storage mount, and guest-side image pull latencies. + +## Implementation Plan + +> All phases assume CoCo (Confidential Containers) by default. The host is always untrusted. There is no non-CoCo Kata mode. + +### Phase 1: CoCo Agent Foundation +- Monolithic attested guest rootfs build pipeline (NVIDIA driver, MLNX_OFED, storage clients, krunner, DCGM/Node Exporter, storage mount boot script) +- `KataConfig` Pydantic model and `[kata]` config section (including volume mount points for `storage-mounts.json`) +- `KataAgent`, `KataKernel`, `KataKernelCreationContext` classes +- Container lifecycle via containerd gRPC API with Kata shim (CoCo runtime class) +- CoCo integration: TDX/SEV-SNP attestation, guest-side image pull via `image-rs`, KBS for sealed secrets +- CPU-only workload support (no GPU yet) +- VFolder storage via direct guest-side NFS/Lustre/WekaFS mount (native kernel client, RDMA-preserving); agent writes `storage-mounts.json` to `/home/config/`, guest boot script performs mounts +- virtio-fs for scratch/config directories only (`/home/config` with environ.txt, intrinsic-ports.json, storage-mounts.json) +- Calico CNI integration for multi-host inter-VM networking (BGP/VXLAN) +- Calico network policy enforcement for inter-session traffic isolation +- `/etc/hosts` injection for cluster hostname resolution +- DCGM Exporter + Node Exporter inside guest, scraped by Prometheus over Calico network + +### Phase 2: VFIO GPU + InfiniBand Passthrough +- `CUDAVFIOPlugin` compute plugin with IOMMU group detection and `max_gpus_per_vm` safety limit +- VFIO device binding and passthrough configuration with Kata VRA PCIe topology replication (`switch-port` mode, `clique-id` annotations) +- `DiscretePropertyAllocMap` with `cuda.device` slot type +- `RDMAVFIOPlugin` as non-schedulable co-plugin: auto-attaches co-located InfiniBand HCA when GPU is allocated +- GPUDirect RDMA validation (GPU + HCA co-passthrough, `nvidia-peermem` in guest) +- GPU device attestation (H100+ CC mode) and HCA attestation (ConnectX-7+) + +### Phase 3: Scheduler Integration +- `backend` column on `AgentRow` with Alembic migration +- Homogeneous scaling group policy (one backend type per group) +- VM memory overhead accounting in `available_slots` +- Agent heartbeat protocol extension +- Remote attestation endpoint exposed via manager API + +## Decision Log + +| Date | Decision | Rationale | Alternatives Considered | +|------|----------|-----------|------------------------| +| 2026-02-27 | Full AbstractAgent backend, not RuntimeClass | RuntimeClass is K8s-only; Kata lifecycle (VM boot, VSOCK, virtio-fs) differs fundamentally from Docker API; Backend.AI already has the backend abstraction | RuntimeClass flag on DockerAgent; Docker `--runtime kata` flag | +| 2026-02-27 | VFIO-only for GPU, no fractional allocation | VFIO provides hardware isolation matching the VM boundary; fractional sharing (MPS, hook libraries) requires host kernel driver sharing which breaks isolation; target workloads need whole GPUs | nvidia-container-runtime inside guest VM (complex, shares driver stack); MIG (NVIDIA-specific, limited configurations) | +| 2026-02-27 | Separate CUDAVFIOPlugin, not modified CUDAPlugin | VFIO generates fundamentally different config (PCI addresses, IOMMU groups) vs Docker DeviceRequests; plugin discovery should keep them independent; avoids conditional logic in existing plugin | Single CUDAPlugin with backend-aware mode flag | +| 2026-02-27 | Homogeneous scaling groups | Simpler scheduling — no need for backend-aware agent filtering within a group; existing scaling group mechanism already routes sessions to agent pools | Mixed-backend groups with `allowed_backends` filter; per-session backend preference flag | +| 2026-02-27 | VM overhead as agent-level deduction | Overhead is predictable (configurable per-VM); deducting at registration avoids changing slot semantics visible to users; scheduler sees accurate available capacity | New `kata.vm_overhead` slot type; per-kernel memory inflation; soft margin in scheduler | +| 2026-02-27 | No new storage interface; reuse existing Mount abstraction with virtio-fs | Kata shim automatically translates bind mounts to virtio-fs shares — the agent passes the same mount specs and the runtime handles the VM boundary transparently; avoids duplicating mount logic | New `MountTypes.VIRTIO_FS` type (unnecessary abstraction); agent-managed virtiofsd (complexity without benefit) | +| 2026-02-27 | Agent socket skipped entirely for Kata | The agent socket (`/opt/kernel/agent.sock`) is a ZMQ REP socket used only by C binaries (jail pid translation, `is-jail-enabled`), not by the Python kernel runner. Since jail is skipped for Kata and PID translation is irrelevant (VM boundary isolates PIDs), the socket, socat relay, and handler are all unnecessary. The primary agent↔kernel-runner channel (ZMQ PUSH/PULL) is already TCP-based and works across the VM boundary without changes. | TCP replacement (unnecessary — the agent socket isn't needed, not just unreachable); VSOCK (solves wrong problem) | +| 2026-02-27 | ~~virtio-fs for all storage backends; no direct guest mounts~~ **Superseded (2026-03-25)**: VFolder data now uses direct guest-side NFS/Lustre mounts (native kernel client) to preserve RDMA; virtio-fs retained for scratch/config only | Original rationale: no vendor Kata integration; CephFS benchmarks favored virtio-fs. Superseded because virtio-fs breaks RDMA data paths for high-performance storage backends. | — | +| 2026-02-27 | Calico CNI for inter-VM networking | Production deployments require multi-host networking and inter-session traffic isolation; Calico provides both via BGP/VXLAN routing and network policy; works with Kata's TC filter transparently; supports standalone containerd (etcd datastore) and Kubernetes; consistent CNI layer across deployment modes | CNI bridge plugin (single-host only, no network policy); Cilium (MTU mismatch issues with Kata); Flannel (no network policy); Docker overlay (not available without Docker) | +| 2026-02-27 | `/etc/hosts` injection for cluster hostname resolution | Docker embedded DNS not available with containerd; `/etc/hosts` via virtio-fs is simple, immediate, and sufficient for small clusters (2-8 containers) | CoreDNS sidecar (heavyweight); containerd-managed DNS (not available); mDNS/Avahi (complex, unreliable) | +| 2026-03-03 | ~~krunner binaries shared via virtio-fs (Phase 1), bake into guest rootfs (Phase 2)~~ **Superseded (2026-03-25)**: CoCo-by-default requires all binaries baked into the attested guest rootfs from day one (host is untrusted — cannot share executables via virtio-fs) | Original rationale: Phase 1 simplicity. Superseded by CoCo trust model — runtime loading from untrusted host eliminated. | — | +| 2026-03-03 | Kata entrypoint.sh skips LD_PRELOAD, jail, and agent.sock | The runner's `entrypoint.sh` sets up `LD_PRELOAD` with libbaihook.so (unnecessary — guest kernel provides accurate sysconf), chowns `agent.sock` (unnecessary — agent socket not mounted), and references jail paths. Kata variant skips these via env var check (`BACKENDAI_CONTAINER_BACKEND=kata`) or a separate entrypoint file. All other operations (user/group creation, SSH setup, dotfiles, password generation, su-exec) remain identical. | Unified entrypoint with no conditional (would fail on missing files); completely separate entrypoint (too much duplication) | +| 2026-03-03 | resource.txt is agent-recovery-only, not consumed by kernel runner | `resource.txt` contains serialized `KernelResourceSpec` (slot allocations, mount lists, device mappings). The agent reads it for recovery after restarts (`resources.py:887-910`, `scratch/utils.py:100-103`) and resource usage tracking. The kernel runner never reads this file — it gets its configuration from `environ.txt` (environment variables for child processes) and `intrinsic-ports.json` (ZMQ/service port assignments). The VM hypervisor enforces resource limits at the VM level. | N/A (factual clarification, not a design choice) | +| 2026-03-03 | Containerd client API (`containers.v1`, `tasks.v1`) for single-container; Sandbox API (`sandbox.v1`) for multi-container | CRI `RuntimeService` is Kubernetes-specific and should not be used by a non-Kubernetes service. The `ctr` CLI is fragile for async use. The containerd client gRPC API is the correct programmatic interface. For multi-container sessions sharing a VM, the Sandbox API (containerd v2+) provides `create_sandbox()` → `create_container(sandbox_id=...)`, which is required because `containers.v1`/`tasks.v1` alone create one VM per container. | CRI `RuntimeService` (K8s-only); `ctr`/`kata-runtime` CLI (fragile, not async); `containers.v1`/`tasks.v1` only (cannot share VM across containers) | +| 2026-03-03 | virtiofsd is one process per sandbox, not per mount | Kata runs a single virtiofsd per sandbox serving a shared directory tree. Individual bind mounts are subdirectories, not separate processes. The hybrid storage model (virtio-blk for read-only infra) is motivated by I/O performance (avoiding FUSE overhead for static files), not by virtiofsd process count. | N/A (factual correction based on Kata architecture review) | +| 2026-03-25 | VFolder storage via direct guest-side NFS/Lustre mount (native kernel client), not virtio-fs | virtio-fs introduces FUSE overhead and breaks RDMA data paths for high-performance storage backends (Lustre over InfiniBand, WekaFS RDMA, GPFS). The correct architecture mirrors Docker: mount the **storage volume** (e.g., `/mnt/vfstore`) directly inside the guest VM using the guest's own NFS/Lustre/WekaFS kernel client (preserving RDMA, 100% native throughput), then bind-mount the specific vfolder subdirectory into the container — identical to how DockerAgent operates on the host. The agent writes `storage-mounts.json` to `/home/config/` (virtio-fs config channel) containing volume mount specs (fs_type, server, export path, mount options); a guest boot script reads this and performs the mounts before the container starts. Storage topology is visible in guest `/proc/mounts` — this is acceptable because Backend.AI's isolation model is the same as Docker: container users can see mount info but cannot mount their own disks (no `CAP_SYS_ADMIN`, no mount binary). Isolation is enforced by vfolder permissions, not by concealing mount details. Supersedes the earlier decision "virtio-fs for all storage backends" (2026-02-27) for vfolder data; virtio-fs remains for scratch/config directories only. | virtio-fs for all storage (kills RDMA, adds FUSE overhead); FUSE shim for topology concealment (adds complexity, single point of failure — unnecessary since mount concealment is not required); 9p/virtio-9p (worse performance than virtio-fs) | +| 2026-03-25 | GPUDirect RDMA via Kata VRA: GPU + InfiniBand HCA co-passthrough with PCIe topology replication | Multi-node GPU workloads require GPUDirect RDMA (direct GPU↔NIC DMA without CPU). Kata VRA architecture supports this via: (1) `hotplug_vfio = "switch-port"` to replicate host PCIe switch topology inside the guest, (2) `clique-id` CDI annotations to group GPU and NIC for P2P, (3) ACS/ATS configuration for Direct Translated P2P. Guest image must include full NVIDIA + MLNX_OFED driver stack with `nvidia-peermem`. Architecturally feasible but not yet demonstrated end-to-end in Kata (upstream issue [#10796](https://github.com/kata-containers/kata-containers/issues/10796)). NVLink/NVSwitch for intra-node GPU-GPU P2P works via VFIO BAR mapping of NVLink registers. | No RDMA support (unacceptable for multi-node training); SR-IOV VF passthrough only (currently broken in Kata [#11910](https://github.com/kata-containers/kata-containers/issues/11910)); userspace RDMA without GPUDirect (CPU bottleneck) | +| 2026-03-25 | CoCo by default for all Kata workloads | The host is always untrusted. All Kata VMs run with TEE protection (TDX/SEV-SNP). This collapses the original 4-phase plan — CoCo is the starting point. Eliminates conditional logic for CoCo vs non-CoCo modes and resolves several open questions (see below). | Non-CoCo Kata mode for initial development (defers trust model, but creates throwaway code); CoCo as opt-in (adds configuration complexity for a mode that should always be on) | +| 2026-03-25 | Guest-side image pull via `image-rs` (no host-side pull) | Host is untrusted and must not see image contents. `image-rs` inside the TEE pulls encrypted images, decrypts with keys from KBS after remote attestation. Host-side containerd pull is not an option. Agent's role reduces to: pass image reference + attestation config to Kata shim, wait for kernel runner ZMQ connection. Registry credentials delivered via KBS (attestation-gated), not via host. | Host-side containerd pull (breaks CoCo trust model); credentials via IVSHMEM (less standard than KBS for CoCo) | +| 2026-03-25 | CUDAVFIOPlugin implements current `generate_docker_args()` API with `_kata_vfio_devices` return | BEP-1016 is Draft with no implementation timeline. CoCo does not affect plugin interface — VFIO device assignment is a host-side operation regardless of guest encryption. The `_kata_vfio_devices` convention maps directly to future `WorkloadConfig` fields. Migrate to BEP-1016 `create_lifecycle_hook()` when available. | Wait for BEP-1016 (unknown delay); implement BEP-1016 as part of BEP-1051 (scope creep) | +| 2026-03-25 | No `backend` field on `ScalingGroupOpts`; homogeneous groups by convention | CoCo-by-default eliminates the kata/kata-confidential distinction. Only two backend types exist: Docker and Kata (always CoCo). `AgentRow.backend` column provides admin visibility. Homogeneous groups are even more sufficient with no CoCo/non-CoCo split. | `allowed_backends` field (unnecessary guard with simplified backend model); `attestation_policy` field (low-priority future option) | +| 2026-03-25 | Exclude shared IOMMU groups with warning; `max_gpus_per_vm` safety limit; validate TEE constraints empirically | Target hardware is DGX/HGX with isolated IOMMU groups (one GPU per group). Shared groups are rare on server hardware — exclude with `log.warning()`. Multi-GPU from separate groups is supported. `max_gpus_per_vm` config (default 8) prevents extreme VM boot times from PCIe BAR mapping overhead. TDX Connect / SEV-SNP ASID interaction with multi-device VFIO requires empirical validation on target hardware — not a design blocker. | Atomic device bundles for shared groups (complex scheduler change for rare case); no GPU limit (risks 30+ minute boot) | +| 2026-03-25 | Monolithic attested guest rootfs; all drivers baked in from day one | CoCo trust model requires the guest rootfs to be part of the attestation chain — its measurement (hash) is verified during remote attestation before secrets (KBS keys, storage credentials) are released. Runtime module loading from host is eliminated (untrusted source). krunner binaries, NVIDIA drivers, MLNX_OFED, storage clients, DCGM/Node Exporter, and storage mount boot script are all baked into a single versioned image (`kata-rootfs-{version}.qcow2`). Attestation measurement in KBS must be updated on every image rebuild. | Layered with attested chain (incremental updates but more complex attestation); runtime module loading (eliminated — host is untrusted) | +| 2026-03-25 | Whole-device InfiniBand HCA passthrough only; SR-IOV deferred | CoCo attestation favors PF (Physical Function) — PF identity is directly verifiable by the guest via device attestation (ConnectX-7+). SR-IOV VF attestation is less mature, and VF passthrough is currently broken in Kata ([#11910](https://github.com/kata-containers/kata-containers/issues/11910)). Whole-device passthrough maps naturally to DGX/HGX topology (8 GPUs, 8 HCAs). | SR-IOV VF passthrough (broken upstream, immature VF attestation); wait for SR-IOV fix (blocks multi-node GPU interconnect) | +| 2026-03-25 | Agent-side volume mount configuration via `storage-mounts.json` | The `[kata]` config section lists storage volume mount points with filesystem type and mount options. KataAgent matches each `VFolderMount.host_path` to a configured volume by longest-prefix match, then writes `storage-mounts.json` to `/home/config/` (virtio-fs config channel) before VM boot. A guest boot script reads this file and performs NFS/Lustre mounts. No manager or storage-proxy API changes needed — the agent is self-contained. Trade-off: duplicates storage topology information (must be manually synced with storage-proxy config), but volume mount points change infrequently and are infrastructure-level configuration. | Storage-proxy API extension (authoritative source, but requires storage-proxy + manager changes); manager RPC field (single source of truth, but changes RPC protocol) | +| 2026-03-25 | `RDMAVFIOPlugin` as non-schedulable co-plugin for InfiniBand HCA passthrough | On DGX/HGX hardware, every GPU node has co-located InfiniBand HCAs — RDMA is a node property, not a separately schedulable resource. The plugin discovers HCAs and validates IOMMU groups but does not expose schedulable slots. When a GPU is allocated, `KataKernelCreationContext` auto-attaches the co-located HCA with matching `clique_id`. No scheduler changes needed. MNNVL (NVL72) requires a separate BEP. | Unified `CUDAVFIOPlugin` (conflates GPU + NIC; misleading name); separate schedulable `RDMAVFIOPlugin` (adds scheduler complexity for a resource that's always present on target hardware) | +| 2026-03-25 | `containerd exec` blocked by CoCo policy — GPU metrics via DCGM Exporter + Node Exporter over Calico network | The kata-agent has a built-in OPA policy engine (Regorus) that evaluates every tRPC API call from the host. CoCo policy (`allow-all-except-exec-process.rego`) blocks `ExecProcessRequest` — the host cannot execute commands inside the guest. However, the CoCo policy controls tRPC API calls only, NOT network traffic. DCGM Exporter (`:9400`) and Node Exporter (`:9100`) run inside the guest as systemd services (baked into attested rootfs) and are scraped by Prometheus over the standard Calico network — pure TCP/IP, no kata-agent involvement. VM-level cgroup stats remain visible to the host via containerd task metrics API. | Allow exec in CoCo policy (breaks trust model); host-side GPU metrics via sysfs (unavailable when GPU is bound to vfio-pci); VSOCK-based metrics daemon (unnecessary — network path works and is simpler) | + +## Resolved Questions + +> **Design premise (2026-03-25):** All Kata Container workloads run on CoCo (Confidential Containers) by default. The host is always untrusted. There is no "non-CoCo Kata" mode. This collapses the original 4-phase plan — CoCo is the starting point, not a future phase. + +The following questions were resolved by the CoCo-by-default premise and earlier design decisions. See Decision Log entries dated 2026-03-25. + +| # | Question | Resolution | +|---|----------|------------| +| Q1 | Image pulling — host or guest? | Guest-side pull via `image-rs` only. Host is untrusted. Registry credentials via KBS after remote attestation. | +| Q2 | Metrics collection in CoCo VMs | DCGM Exporter + Node Exporter inside guest, scraped by Prometheus over Calico network. `containerd exec` blocked by CoCo policy. | +| Q3 | Multi-GPU IOMMU validation | Exclude shared groups with warning; `max_gpus_per_vm` safety limit. TEE constraints validated empirically on target DGX/HGX hardware. | +| Q4 | BEP-1016 relationship | Use current `generate_docker_args()` with `_kata_vfio_devices`; migrate to BEP-1016 when available. CoCo is orthogonal. | +| Q5 | ScalingGroupOpts backend field | Homogeneous-by-convention. CoCo-by-default eliminates kata/kata-confidential distinction. | +| Q6 | CoCo bind mount / vfolder storage | Direct guest-side NFS/Lustre mount (native kernel client, RDMA-preserving). Mount specs via `storage-mounts.json`. | +| Q9 | Guest base image pipeline | Monolithic attested rootfs; all drivers baked in. Runtime loading eliminated (untrusted host). | +| Q10 | InfiniBand HCA — whole-device vs SR-IOV | Whole-device only. PF attestation is simpler; SR-IOV broken upstream and VF attestation immature. | +| Q7 | InfiniBand HCA plugin architecture | `RDMAVFIOPlugin` as non-schedulable co-plugin. Auto-attaches co-located HCA when GPU is allocated — RDMA is implicit for agents that have it. No scheduler changes needed. | +| Q8 | Agent volume mount point awareness | Agent-side configuration. `[kata]` config lists volume mount points. Agent writes `storage-mounts.json` to `/home/config/`; guest boot script mounts volumes. No manager/storage-proxy changes. | + +### ~~Q2: Metrics Collection in CoCo VMs~~ — Resolved + +GPU and container metrics collected via **DCGM Exporter** (`:9400`) and **Node Exporter** (`:9100`) running inside the guest VM, scraped by Prometheus over the standard Calico network. + +**Why `containerd exec` doesn't work:** The kata-agent's built-in OPA policy engine (Regorus) evaluates every tRPC API call from the host. The recommended CoCo policy (`allow-all-except-exec-process.rego`) blocks `ExecProcessRequest` — the host cannot execute commands inside the guest. `ReadStreamRequest` (stdout/stderr) is also policy-controllable. The policy is baked into the attested guest rootfs and the host cannot modify it. + +**Why network-based scraping works:** The CoCo kata-agent policy controls **tRPC API calls** (exec, copy file, read stream), NOT **network traffic**. The VM has a Calico-assigned IP and a virtio-net interface that carries standard TCP/IP traffic — the same path workloads use for any network I/O. Prometheus exporters inside the guest listen on standard ports and are reachable over the Calico network. No kata-agent involvement — pure network I/O. + +``` +Prometheus (infra) Guest VM (CoCo) +┌──────────────┐ ┌──────────────────────────────┐ +│ scrape │── HTTP GET ──────────→ │ :9400 DCGM Exporter (GPU) │ +│ targets: │ (Calico network, │ :9100 Node Exporter (CPU, │ +│ - vm_ip:9400 │ virtio-net, │ mem, disk, net) │ +│ - vm_ip:9100 │ standard TCP/IP) │ │ +└──────────────┘ │ kata-agent policy: N/A │ + │ (network traffic bypasses │ + │ tRPC policy engine) │ + └──────────────────────────────┘ +``` + +**Metrics architecture:** + +| Metric source | Exporter | Port | What it provides | +|---|---|---|---| +| GPU | [DCGM Exporter](https://github.com/NVIDIA/dcgm-exporter) | `:9400` | GPU utilization, memory usage, temperature, power, ECC errors, NVLink throughput | +| CPU / Memory / Disk / Network | [Node Exporter](https://github.com/prometheus/node_exporter) | `:9100` | Guest-visible CPU, memory, disk I/O, network I/O — accurate because guest kernel sees VM's allocated resources | +| VM boot time | Host-side (agent) | N/A | Delta between `create_task()` and first kernel runner ZMQ connection — purely host-side measurement, no guest cooperation needed | +| VM cgroup stats | Host-side (containerd) | N/A | containerd task metrics API returns VM-level cgroup stats — describes hypervisor process, allowed by CoCo policy | + +**Guest base image additions:** Both `dcgm-exporter` and `node_exporter` binaries are baked into the attested guest rootfs and started as systemd services. DCGM Exporter requires the NVIDIA driver and DCGM libraries (already in the image for GPU compute). Node Exporter is a single static binary (~20 MB). + +**Backend.AI integration:** The host agent does NOT collect metrics via the stats pipeline (`gather_container_measures()`) for GPU and per-container details. Instead: +- Prometheus scrapes DCGM + Node Exporter at the VM's Calico IP +- Backend.AI's monitoring/billing systems read from Prometheus (or Grafana) — this aligns with existing observability infrastructure +- The agent's intrinsic plugins report only host-observable metrics: VM cgroup stats (containerd task metrics) and VM boot time + +**Security note:** Network traffic between Prometheus and the exporters traverses the host network stack (virtio-net + TC filter). GPU utilization metrics reveal workload intensity but not workload content — acceptable for most CoCo threat models. If metrics are considered sensitive, TLS can be enabled on the exporters with certificates provisioned via the KBS attestation flow. + +See Decision Log (2026-03-25) for the `containerd exec` policy finding. + +### ~~Q7: InfiniBand HCA Plugin Architecture~~ — Resolved + +**Decision:** `RDMAVFIOPlugin` as non-schedulable co-plugin (Option C). The plugin discovers InfiniBand HCAs (Mellanox vendor ID `0x15b3`, device class `0x0207`, bound to `vfio-pci`) and validates their IOMMU groups, but does NOT expose schedulable slots. When `CUDAVFIOPlugin` allocates a GPU, `KataKernelCreationContext` queries the RDMA plugin for co-located HCAs (same PCIe root complex / switch) and auto-attaches them with matching `clique_id` annotations for Kata VRA P2P topology replication. + +**Rationale:** On target hardware (DGX/HGX), every GPU node has co-located InfiniBand HCAs — RDMA capability is a property of the node, not a separately schedulable resource. Making RDMA implicit avoids scheduler complexity (no need to co-allocate `cuda.device` + `rdma.device`), matches the 1:1 GPU↔HCA topology of DGX/HGX systems, and requires no changes to existing scheduling logic. Agents without HCAs simply have no RDMA devices to auto-attach — GPU-only workloads work unchanged. + +**Scope:** Traditional GPU servers (HGX/DGX + InfiniBand) only. Rack-scale NVLink systems (NVL72) require a separate BEP — see "Future: MNNVL Support" below. + +### Future: MNNVL Support (Separate BEP Required) + +Rack-scale NVLink systems (GB200 NVL72, Vera Rubin NVL72) use **Multi-Node NVLink (MNNVL)** — a fundamentally different interconnect architecture from InfiniBand-based GPU servers. This requires a separate BEP, not an extension of BEP-1051. + +**Why a separate BEP:** + +| Aspect | BEP-1051 (InfiniBand) | MNNVL (future BEP) | +|---|---|---| +| GPU interconnect | InfiniBand (400 Gbps) | NVLink fabric (1.8 TB/s per GPU — 36x IB) | +| Isolation model | Kata VM + VFIO passthrough | IMEX domains (per-workload NVLink partition) — may not need VMs | +| CC trust domain | Per-VM (CPU TEE + GPU) | Rack-scale (72 GPUs + NVLink encryption hardware) | +| Multi-node orchestration | VFIO device assignment + clique-id | IMEX daemon lifecycle + Fabric Manager | +| Agent architecture | KataAgent (containerd + Kata shim) | Potentially new agent type for MNNVL | + +**Key technical findings** (from analysis of [NVIDIA k8s-dra-driver-gpu](https://github.com/NVIDIA/k8s-dra-driver-gpu) source): + +- **IMEX is a standalone daemon** (`nvidia-imex`), NOT Kubernetes-specific. Runs as a systemd service, configured via `nodes.cfg` (peer IP list) and `imexd.cfg`. Backend.AI can orchestrate it directly. +- **Fabric Manager** (`nv-fabricmanager`) is also standalone — assigns cluster UUIDs and NVLink clique IDs to GPUs via NVSwitch topology configuration. +- **Clique ID is hardware-derived** — read from GPU via NVML `GetGpuFabricInfo()`, NOT assigned by software. Format: `"clusterUUID.cliqueID"`. +- **2048 IMEX channels** per node (`/dev/nvidia-caps-imex-channels/channel0..2047`) — device nodes injected into containers for cross-node GPU memory import/export. +- **NVIDIA's DRA + ComputeDomains are thin K8s wrappers** around these standalone daemons. Backend.AI would replace the K8s orchestration layer (ComputeDomain CR → DaemonSet → kubelet plugin) with its own session-based IMEX lifecycle management. +- **No GPU partitioning at fabric level** — NVLink fabric is all-to-all within a clique. Cross-node grouping is purely "select N nodes from same clique." Per-node GPU partitioning (MIG) is orthogonal. + +**What the future MNNVL BEP must address:** +1. New agent type or KataAgent extension for IMEX domain management +2. Scheduler awareness of NVLink clique topology (nodes sharing a clique ID) +3. IMEX daemon lifecycle tied to session lifecycle (start peers → wait for quorum → inject channels → cleanup) +4. Whether Kata VMs are needed on NVL72 (NVLink CC + IMEX isolation may suffice without VM boundary) +5. Interaction between MNNVL NVLink (GPU traffic) and InfiniBand (storage traffic) on the same system + +### ~~Q8: Agent Volume Mount Point Awareness~~ — Resolved + +**Decision:** Agent-side configuration (Option B). The `[kata]` config section lists storage volume mount points with filesystem type and mount options. KataAgent matches each `VFolderMount.host_path` to a configured volume by longest-prefix match, then writes `storage-mounts.json` to `/home/config/` (virtio-fs config channel) before VM boot. A guest boot script reads this file and performs NFS/Lustre mounts. Vfolder bind mounts are passed to the OCI container spec unchanged — they operate as guest-internal bind mounts since the volume is already mounted inside the guest. + +**Rationale:** No manager or storage-proxy API changes needed — the agent is self-contained and independently configurable per deployment. The storage-proxy and agent are typically co-located on the same host (or tightly coupled in deployment), so keeping volume mount config in the agent's `[kata]` section is operationally natural. The trade-off (manual sync with storage-proxy config) is acceptable because volume mount points change infrequently and are infrastructure-level configuration, not per-session state. + +## References + +- [BEP-1002: Agent Architecture](BEP-1002-agent-architecture.md) — resource management and kernel runner patterns +- [BEP-1016: Accelerator Interface v2](BEP-1016-accelerator-interface-v2.md) — lifecycle hook API proposal +- [BEP-1000: Redefining Accelerator Metadata](BEP-1000-redefining-accelerator-metadata.md) — device metadata model +- [BEP-1044: Multi-Agent Device Split](BEP-1044-multi-agent-device-split.md) — multi-agent GPU allocation +- [TR-2026-001: Kata Containers Feature Parity Analysis](../docs/reports/kata-containers-feature-parity-analysis.md) +- [Kata Containers Architecture](https://github.com/kata-containers/kata-containers/blob/main/docs/design/architecture/README.md) +- [Linux VFIO Documentation](https://docs.kernel.org/driver-api/vfio.html) +- [Calico CNI Plugin Configuration](https://docs.tigera.io/calico/latest/reference/configure-cni-plugins) +- [Calico Network Policy](https://docs.tigera.io/calico/latest/network-policy/get-started/calico-policy/calico-network-policy) +- [Kata VRA (Virtualization Reference Architecture)](https://github.com/kata-containers/kata-containers/blob/main/docs/design/kata-vra.md) — GPU/accelerator passthrough design with GPUDirect P2P/RDMA support +- [NVIDIA GPUDirect RDMA Documentation](https://docs.nvidia.com/cuda/gpudirect-rdma/index.html) +- [MNNVL User Guide](https://docs.nvidia.com/multi-node-nvlink-systems/mnnvl-user-guide/overview.html) — Multi-Node NVLink architecture, IMEX daemon, Fabric Manager +- [Enabling Multi-Node NVLink on Kubernetes](https://developer.nvidia.com/blog/enabling-multi-node-nvlink-on-kubernetes-for-gb200-and-beyond/) — DRA driver, ComputeDomains concept (Kubernetes-native, but underlying daemons are standalone) +- [NVIDIA Confidential Computing](https://www.nvidia.com/en-us/data-center/solutions/confidential-computing/) — Blackwell TEE-IO, NVLink encryption, rack-scale CC diff --git a/proposals/BEP-1051/configuration-deployment.md b/proposals/BEP-1051/configuration-deployment.md new file mode 100644 index 00000000000..59de9e4d51a --- /dev/null +++ b/proposals/BEP-1051/configuration-deployment.md @@ -0,0 +1,346 @@ + + +# BEP-1051: Configuration and Deployment + +## Summary + +This document defines the configuration schema additions for the Kata backend, including hypervisor selection, VFIO settings, VM resource defaults, and host deployment prerequisites. The `[kata]` section is added to the agent's unified config (`AgentUnifiedConfig` at `src/ai/backend/agent/config/unified.py`). + +## Current Design + +The agent backend is selected via `local_config.agent_common.backend`, which reads the `AgentBackend` enum (`src/ai/backend/agent/types.py:36`). The `get_agent_discovery()` function (`types.py:84`) dynamically imports the backend package. + +Docker-specific configuration lives in the `[container]` section: + +```toml +[container] +stats-type = "docker" +sandbox-type = "jail" +scratch-root = "/tmp/scratches" +``` + +There is no VM or hypervisor configuration in the current schema. + +## Proposed Design + +### Backend Selection + +```toml +[agent] +backend = "kata" # "docker" | "kubernetes" | "kata" +``` + +When `backend = "kata"`, the agent imports `ai.backend.agent.kata` and instantiates `KataAgentDiscovery`. + +### Kata Configuration Section + +```toml +[kata] +# --- Hypervisor --- +hypervisor = "cloud-hypervisor" # "cloud-hypervisor" | "qemu" | "dragonball" + +# --- VM Defaults --- +default-vcpus = 2 # Initial vCPUs per VM (hot-plugged as needed) +default-memory-mb = 2048 # Initial memory per VM in MB +vm-overhead-mb = 64 # Per-VM VMM process + guest kernel + kata-agent overhead (MB), deducted from host capacity + +# --- Guest VM Image (NOT the container image — see note below) --- +kernel-path = "/opt/kata/share/kata-containers/vmlinux.container" +initrd-path = "" # Empty = use rootfs image instead of initrd +rootfs-path = "/opt/kata/share/kata-containers/kata-containers.img" + +# --- Storage --- +shared-fs = "virtio-fs" # "virtio-fs" | "virtio-9p" (9p deprecated) +virtiofsd-path = "/opt/kata/libexec/virtiofsd" +virtio-fs-cache-size = 0 # DAX window in MB; 0 = disabled (recommended default) + +# --- Networking --- +network-model = "tcfilter" # "tcfilter" | "macvtap" + +# --- VFIO --- +enable-iommu = true +hotplug-vfio = "root-port" # "root-port" | "bridge-port" | "no-port" + +# --- Containerd --- +containerd-socket = "/run/containerd/containerd.sock" +kata-runtime-class = "kata" # RuntimeClass name registered in containerd + +# --- Confidential Computing (CoCo-by-default) --- +confidential-guest = true # CoCo is always enabled; no non-CoCo mode +guest-attestation = "tdx" # "tdx" | "sev-snp" (required) +kbs-endpoint = "http://kbs.internal:8080" # Key Broker Service for sealed secrets + +# --- Storage (guest-side direct mount) --- +kernel-modules = ["nfs", "nfsv4"] # Modules loaded during create_sandbox() +guest-hook-path = "/usr/share/oci/hooks" # OCI prestart hook path in guest rootfs + +[[kata.storage-mounts]] +fs-type = "nfs4" +source = "storage-server:/exports/vfstore" +mountpoint = "/mnt/vfstore" +options = "vers=4.1,rsize=1048576,wsize=1048576" + +[[kata.storage-mounts]] +fs-type = "lustre" +source = "lustre-mgs@tcp:/scratch" +mountpoint = "/mnt/lustre" +options = "" + +[kata.storage-nic] +device = "ib0" +address = "10.0.100.5/24" +gateway = "10.0.100.1" +``` + +### Guest VM Image vs Container Image + +Kata Containers uses **two separate filesystem layers** that must not be confused: + +1. **Guest VM rootfs** (`rootfs-path` above): A minimal mini-OS image containing only the kata-agent, systemd, and essential utilities. This is the VM's boot disk — shared across all VMs, read-only, and mounted via DAX on a `/dev/pmem*` device inside the guest. It is **not** the user's container image. This is an infrastructure-level asset analogous to a VM template. + +2. **Container image** (e.g., `cr.backend.ai/stable/python-tensorflow:2.15-py312-cuda12.3`): The user-selected OCI image that Backend.AI's image management system resolves. Under CoCo-by-default, images are pulled and decrypted **inside the guest** using `image-rs` — the host is untrusted and must not see image contents. Registry credentials are delivered via KBS (Key Broker Service) after remote attestation. + +```mermaid +flowchart TB + boot["Guest VM boots from kata-containers.img (attested rootfs)"] + boot --> attest["TEE attestation completes"] + attest --> pull["image-rs pulls encrypted OCI image from registry"] + pull --> decrypt["Decrypts image with keys from KBS (attestation-gated)"] + decrypt --> mount["kata-agent mounts container rootfs inside guest"] + mount --> run["Container process runs on OCI image filesystem"] +``` + +**Image management changes for CoCo:** The image registry and image selection flow are unchanged (manager resolves image references identically). The pull mechanism differs — instead of host-side containerd pull, `image-rs` inside the guest TEE handles pull + integrity verification + decryption. The agent cannot verify image presence on the host; it passes the image reference to the Kata shim and waits for the kernel runner ZMQ connection. + +### Pydantic Config Model + +```python +class KataConfig(BaseConfigSchema): + hypervisor: Literal["cloud-hypervisor", "qemu", "dragonball"] = Field( + default="cloud-hypervisor", + validation_alias=AliasChoices("hypervisor"), + ) + + # VM defaults + default_vcpus: int = Field( + default=2, + validation_alias=AliasChoices("default_vcpus", "default-vcpus"), + ) + default_memory_mb: int = Field( + default=2048, + validation_alias=AliasChoices("default_memory_mb", "default-memory-mb"), + ) + vm_overhead_mb: int = Field( + default=64, + description="VMM process + guest kernel + kata-agent overhead (MB)", + validation_alias=AliasChoices("vm_overhead_mb", "vm-overhead-mb"), + ) + + # Guest image + kernel_path: Path = Field( + default=Path("/opt/kata/share/kata-containers/vmlinux.container"), + validation_alias=AliasChoices("kernel_path", "kernel-path"), + ) + initrd_path: Path | None = Field( + default=None, + validation_alias=AliasChoices("initrd_path", "initrd-path"), + ) + rootfs_path: Path = Field( + default=Path("/opt/kata/share/kata-containers/kata-containers.img"), + validation_alias=AliasChoices("rootfs_path", "rootfs-path"), + ) + + # Storage + shared_fs: Literal["virtio-fs", "virtio-9p"] = Field( + default="virtio-fs", + validation_alias=AliasChoices("shared_fs", "shared-fs"), + ) + virtiofsd_path: Path = Field( + default=Path("/opt/kata/libexec/virtiofsd"), + validation_alias=AliasChoices("virtiofsd_path", "virtiofsd-path"), + ) + virtio_fs_cache_size: int = Field( + default=0, + validation_alias=AliasChoices("virtio_fs_cache_size", "virtio-fs-cache-size"), + ) + + # Networking + network_model: Literal["tcfilter", "macvtap"] = Field( + default="tcfilter", + validation_alias=AliasChoices("network_model", "network-model"), + ) + + # VFIO + enable_iommu: bool = Field( + default=True, + validation_alias=AliasChoices("enable_iommu", "enable-iommu"), + ) + hotplug_vfio: Literal["root-port", "bridge-port", "no-port"] = Field( + default="root-port", + validation_alias=AliasChoices("hotplug_vfio", "hotplug-vfio"), + ) + + # Containerd + containerd_socket: Path = Field( + default=Path("/run/containerd/containerd.sock"), + validation_alias=AliasChoices("containerd_socket", "containerd-socket"), + ) + kata_runtime_class: str = Field( + default="kata", + validation_alias=AliasChoices("kata_runtime_class", "kata-runtime-class"), + ) + + # Confidential computing (CoCo-by-default) + confidential_guest: bool = Field( + default=True, + description="CoCo is always enabled; no non-CoCo Kata mode", + validation_alias=AliasChoices("confidential_guest", "confidential-guest"), + ) + guest_attestation: Literal["tdx", "sev-snp"] = Field( + default="tdx", + description="TEE attestation type (required)", + validation_alias=AliasChoices("guest_attestation", "guest-attestation"), + ) + kbs_endpoint: str = Field( + default="", + description="Key Broker Service endpoint for sealed secrets and image decryption keys", + validation_alias=AliasChoices("kbs_endpoint", "kbs-endpoint"), + ) + + # Storage (guest-side direct mount) + storage_mounts: list[StorageMountSpec] = Field( + default_factory=list, + description="Storage volumes to mount inside the guest VM via native NFS/Lustre client", + validation_alias=AliasChoices("storage_mounts", "storage-mounts"), + ) + storage_nic: StorageNicSpec | None = Field( + default=None, + description="Storage network interface (VFIO passthrough IPoIB/Ethernet) IP configuration", + validation_alias=AliasChoices("storage_nic", "storage-nic"), + ) + + # Guest hooks + guest_hook_path: str = Field( + default="/usr/share/oci/hooks", + description="Path inside guest rootfs where OCI prestart hooks are scanned", + validation_alias=AliasChoices("guest_hook_path", "guest-hook-path"), + ) + kernel_modules: list[str] = Field( + default_factory=lambda: ["nfs", "nfsv4"], + description="Kernel modules to load inside guest during sandbox creation", + validation_alias=AliasChoices("kernel_modules", "kernel-modules"), + ) + + +class StorageMountSpec(BaseConfigSchema): + """A storage volume to mount directly inside the guest VM.""" + fs_type: str = Field(description="Filesystem type: nfs, nfs4, lustre, wekafs") + source: str = Field(description="Mount source (e.g., 'storage-server:/exports/vfstore')") + mountpoint: str = Field(description="Guest-side mount point (e.g., '/mnt/vfstore')") + options: str = Field(default="", description="Mount options (e.g., 'vers=4.1,rdma')") + + +class StorageNicSpec(BaseConfigSchema): + """Storage network interface IP configuration for guest VM.""" + device: str = Field(description="Guest-side network device name (e.g., 'ib0', 'eth1')") + address: str = Field(description="IP address with prefix (e.g., '10.0.100.5/24')") + gateway: str = Field(default="", description="Gateway address (optional)") +``` + +Integration in `AgentUnifiedConfig`: + +```python +class AgentUnifiedConfig(BaseConfigSchema): + # ... existing fields ... + kata: KataConfig | None = None # Only present when backend = "kata" +``` + +### Startup Validation + +At `KataAgent.__ainit__()`, validate: + +1. `kernel_path` exists and is readable +2. `rootfs_path` (or `initrd_path`) exists and is readable +3. `virtiofsd_path` exists and is executable +4. `containerd_socket` exists and is connectable +5. Kata runtime class is registered in containerd config +6. If `enable_iommu`: IOMMU is enabled in the kernel (`/sys/class/iommu` is non-empty) +7. `vhost_vsock` kernel module is loaded + +Fail fast with descriptive error messages on any validation failure. + +### Hypervisor Comparison + +| Feature | Cloud Hypervisor | QEMU | Dragonball | +|---------|-----------------|------|------------| +| Boot time | ~125-200ms | ~300-500ms | ~100ms (est., no published benchmark) | +| Memory overhead | ~15MB | ~60MB | ~10MB | +| VFIO passthrough | Yes | Yes | Yes | +| virtio-fs | Yes | Yes | Yes | +| CPU/mem hotplug | Yes | Yes | Yes | +| Confidential (TDX/SEV) | Yes | Yes | No | +| Code size | ~50K LOC (Rust) | ~2M LOC (C) | Rust (in-process) | +| Maturity | Production | Production | Production (Alibaba) | + +**Default: Cloud Hypervisor** — best balance of feature completeness, security (small Rust codebase), and performance. QEMU recommended only for confidential computing on older hardware or non-x86 architectures. + +## Host Deployment Requirements + +### Kernel and Hardware + +- Linux kernel >= 5.10 with KVM enabled +- Intel VT-x/VT-d or AMD-V/AMD-Vi (IOMMU) +- Kernel parameters: `intel_iommu=on iommu=pt` (Intel) or `amd_iommu=on iommu=pt` (AMD) +- Kernel modules: `kvm`, `kvm_intel`/`kvm_amd`, `vfio`, `vfio_pci`, `vfio_iommu_type1`, `vhost_vsock` + +### Software + +- Kata Containers 3.x (`kata-runtime`, guest kernel, guest rootfs/initrd) +- containerd >= 1.7 with Kata shim (`io.containerd.kata.v2`) +- virtiofsd (included in Kata installation) + +### containerd Configuration + +```toml +# /etc/containerd/config.toml +[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata] + runtime_type = "io.containerd.kata.v2" + privileged_without_host_devices = true + + [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata.options] + ConfigPath = "/opt/kata/share/defaults/kata-containers/configuration.toml" +``` + +### VFIO GPU Binding + +GPUs intended for VFIO passthrough must be unbound from the `nvidia` driver and bound to `vfio-pci`: + +```bash +# Identify GPU PCI addresses +lspci -nn | grep -i nvidia + +# Bind to vfio-pci (example for 0000:41:00.0) +echo "0000:41:00.0" > /sys/bus/pci/devices/0000:41:00.0/driver/unbind +echo "10de 2684" > /sys/bus/pci/drivers/vfio-pci/new_id + +# Verify +ls -la /dev/vfio/ +``` + +For persistent binding, use `driverctl` or modprobe configuration. + +## Implementation Notes + +- `KataConfig` is validated at agent startup; the agent refuses to start if prerequisites are not met +- The `generate-sample` CLI command should include the `[kata]` section with comments +- Hypervisor binary paths can be auto-detected from `kata-runtime env` output +- The `[container]` section's `scratch-root` and `port-range` settings are reused by KataAgent diff --git a/proposals/BEP-1051/kata-agent-backend.md b/proposals/BEP-1051/kata-agent-backend.md new file mode 100644 index 00000000000..40dfcfd1324 --- /dev/null +++ b/proposals/BEP-1051/kata-agent-backend.md @@ -0,0 +1,512 @@ + + +# BEP-1051: KataAgent Backend + +## Summary + +KataAgent is the third `AbstractAgent` implementation that manages containers inside lightweight VMs via Kata Containers 3.x. It communicates with containerd's gRPC API to create containers using the Kata runtime shim, replacing Docker API calls with containerd's native client API (containers, tasks, and sandbox services). + +## Current Design + +### Backend Abstraction + +The existing backend enum and dynamic import pattern (`src/ai/backend/agent/types.py`): + +```python +class AgentBackend(enum.StrEnum): + DOCKER = "docker" + KUBERNETES = "kubernetes" + DUMMY = "dummy" + +def get_agent_discovery(backend: AgentBackend) -> AbstractAgentDiscovery: + agent_mod = importlib.import_module(f"ai.backend.agent.{backend.value}") + return cast(AbstractAgentDiscovery, agent_mod.get_agent_discovery()) +``` + +`DockerAgent` (`src/ai/backend/agent/docker/agent.py`) manages containers via `aiodocker`: +- Creates containers with `docker.containers.create(container_config)` +- Merges accelerator plugin args via `update_nested_dict(container_config, plugin_args)` +- Monitors events via `docker.events.subscribe()` +- Shares storage via Docker bind mounts (zero overhead, direct kernel VFS) + +### Three-Package Architecture Inside Containers + +Backend.AI runs three software layers inside every container: + +| Package | Role | Lifecycle | +|---------|------|-----------| +| `ai.backend.runner` (`entrypoint.sh` + static binaries) | Bootstrap: user/group setup, SSH keys, LD_PRELOAD, password generation | Runs once at container start, then `exec`s the main program | +| `ai.backend.kernel` (`BaseRunner` subclasses) | Long-running process (`bai-krunner`): receives commands from agent, executes user code, manages services | Runs for container lifetime | +| `ai.backend.agent` (host-side) | Writes config files to scratch dirs, communicates with kernel runner via ZMQ TCP | Runs on host | + +**Boot sequence:** + +```mermaid +sequenceDiagram + participant A as Agent (host) + participant G as Container (guest in Kata) + + Note over A: [1] get_intrinsic_mounts() + Note over A: [2] Resource allocation + Note over A: [3] prepare_scratch() - create dirs + Note over A: [4] Write environ.txt, resource.txt,
intrinsic-ports.json, storage-mounts.json, SSH keys + Note over A: [5] apply_network(), prepare_ssh(),
mount_vfolders(), process_mounts() + A->>G: [6] Create container with mounts + + Note over G: [7] entrypoint.sh runs + Note over G: Create user/group, dotfiles, password + Note over G: exec su-exec -> bai-krunner + + Note over G: [8] BaseRunner.__init__() + Note over G: Read /home/config/environ.txt + + Note over G: [9] BaseRunner._init() + Note over G: Read intrinsic-ports.json + Note over G: Bind ZMQ PULL/PUSH sockets + + Note over G: [10] main_loop() + Note over G: Start sshd + ttyd, enter ZMQ loop + + G-->>A: [11-12] ZMQ TCP connect (kernel runner ready) +``` + +### Config Files in /home/config + +The agent writes several files to `scratch_dir/config/` before container creation. These are mounted read-only at `/home/config` inside the container: + +| File | Writer | Reader | Purpose | +|------|--------|--------|---------| +| `environ.txt` | Agent (`docker/agent.py:844-854`) | **Kernel runner** (`BaseRunner.__init__`, line 187) | `key=value` pairs loaded into `child_env` and `os.environ` for all subprocess execution | +| `resource.txt` | Agent (`docker/agent.py:856-867`) | **Agent only** (recovery: `resources.py:887-910`, `scratch/utils.py:100-103`) | Serialized `KernelResourceSpec` — slot allocations, mounts, device mappings. NOT read by the kernel runner | +| `intrinsic-ports.json` | Agent | **Kernel runner** (`BaseRunner._init`, line 244) | Maps service names to ports: `{"replin": 2000, "replout": 2001, "sshd": 2200, "ttyd": 7681}` | +| `ssh/id_cluster` | Agent | **Kernel runner** (`intrinsic.py:89-130`) | Private key for inter-container SSH in multi-node sessions | +| `ssh/id_cluster.pub` | Agent | **Kernel runner** (`intrinsic.py:130-140`) | Public key appended to `~/.ssh/authorized_keys` | +| `ssh/port-mapping.json` | Agent | **Kernel runner** (`intrinsic.py:100-112`) | Cluster node SSH port mapping → written to `~/.ssh/config` | +| `docker-creds.json` | Agent | Container bootstrap | Docker-specific; **NOT used for Kata** — images pulled guest-side via `image-rs` with KBS-delivered credentials | +| `environ_base.txt` | Agent (backup copy) | Agent (recovery) | Baseline copy of environ.txt for hot-update diffing | +| `resource_base.txt` | Agent (backup copy) | Agent (recovery) | Baseline copy of resource.txt for hot-update diffing | + +**Key insight for Kata**: `environ.txt` and `intrinsic-ports.json` must be present **before** the kernel runner process starts — they are read during initialization. `resource.txt` is read only by the agent (host-side) for recovery after restarts and for tracking resource usage — it never crosses into the guest's runtime. + +### Agent Socket (agent.sock) — Skipped for Kata + +The Docker agent socket (`/opt/kernel/agent.sock`) is a ZMQ REP socket for jail/C binaries (PID translation, jail status). **Not used for Kata** — the jail sandbox is not applicable (VM boundary is stronger) and PID translation is irrelevant (guest PIDs are isolated). The primary agent↔kernel-runner communication (ZMQ PUSH/PULL) is TCP-based and works over the Calico network without any changes. + +## Proposed Design + +### Package Structure + +``` +src/ai/backend/agent/kata/ +├── __init__.py # KataAgentDiscovery + get_agent_discovery() +├── agent.py # KataAgent + KataKernelCreationContext +├── kernel.py # KataKernel +├── resources.py # load_resources(), scan_available_resources() +├── intrinsic.py # CPU/Memory compute plugins for Kata +└── containerd_client.py # Async containerd gRPC client wrapper +``` + +### AgentBackend Enum Extension + +```python +# src/ai/backend/agent/types.py +class AgentBackend(enum.StrEnum): + DOCKER = "docker" + KUBERNETES = "kubernetes" + KATA = "kata" # NEW + DUMMY = "dummy" +``` + +### KataAgentDiscovery + +```python +# src/ai/backend/agent/kata/__init__.py +class KataAgentDiscovery(AbstractAgentDiscovery): + def get_agent_cls(self) -> type[AbstractAgent[Any, Any]]: + return KataAgent + + async def load_resources(self, etcd, local_config): + return await load_resources(etcd, local_config) + + async def scan_available_resources(self, compute_device_types): + return await scan_available_resources(compute_device_types) + + async def prepare_krunner_env(self, local_config): + # CoCo: all krunner binaries and Python libraries are baked into + # the attested guest rootfs. The host is untrusted and must not + # be a source of executables. No Docker volumes or virtio-fs + # binary sharing needed. + return await prepare_krunner_env_kata(local_config) + +def get_agent_discovery() -> AbstractAgentDiscovery: + return KataAgentDiscovery() +``` + +### KataAgent + +```python +class KataAgent(AbstractAgent[KataKernel, KataKernelCreationContext]): +``` + +Key method overrides: + +**`__ainit__()`** — Initialize containerd client, validate Kata runtime availability: + +```python +async def __ainit__(self): + await super().__ainit__() + kata_config = self.local_config.kata + self._containerd = ContainerdClient(kata_config.containerd_socket) + await self._containerd.connect() + + # Validate Kata runtime is registered + runtime_info = await self._containerd.get_runtime_info( + kata_config.kata_runtime_class + ) + if not runtime_info: + raise AgentError("Kata runtime not found in containerd") + + self._hypervisor = kata_config.hypervisor + log.info("KataAgent initialized with hypervisor: {}", self._hypervisor) +``` + +**Container lifecycle** — The core difference from DockerAgent: + +```mermaid +flowchart LR + subgraph Docker["DockerAgent"] + d1["docker.containers.create()"] --> d2["docker.containers.start()"] + end + subgraph Kata["KataAgent"] + k1["containerd.create_container()"] --> k2["containerd.create_task()"] --> k3["task.start()"] + k3 -.->|internally| k4["Kata shim boots VM"] --> k5["config share setup"] --> k6["kata-agent"] --> k7["container"] + end +``` + +**`destroy_kernel()`** — Stop and remove via containerd: + +```python +async def destroy_kernel(self, kernel_id, container_id): + await self._containerd.stop_task(container_id) + await self._containerd.delete_task(container_id) + await self._containerd.delete_container(container_id) + # Kata shim automatically tears down VM when last container is removed +``` + +**Event monitoring** — Subscribe to containerd events instead of Docker: + +```python +async def _handle_container_events(self): + async for event in self._containerd.subscribe_events(): + if event.topic == "/tasks/exit": + kernel_id = self._extract_kernel_id(event) + await self._handle_kernel_exit(kernel_id, event.exit_status) +``` + +**No agent socket handler** — `handle_agent_socket()` is not started. The jail sandbox and PID translation operations it serves are both irrelevant inside a VM. This eliminates the socat relay container dependency. + +### KataKernelCreationContext + +Key overrides that differ from `DockerKernelCreationContext`: + +**`prepare_scratch()`** — Create directories shared via virtio-fs: + +The VM boot disk (`kata-containers.img`) is a read-only, shared mini-OS containing only the kata-agent. Scratch directories serve a different purpose — they carry per-session configuration and workspace: + +- `/home/config` (RO): Agent-written config files consumed by entrypoint.sh and kernel runner at startup +- `/home/work` (RW): User's persistent workspace and vfolder mount point + +```python +async def prepare_scratch(self): + scratch_dir = self.scratch_root / str(self.kernel_id) + config_dir = scratch_dir / "config" + config_dir.mkdir(parents=True, exist_ok=True) + (scratch_dir / "work").mkdir(parents=True, exist_ok=True) +``` + +**`write_config_files()`** — Write environ.txt and resource.txt: + +```python +async def write_config_files(self, environ, resource_spec): + # environ.txt — consumed by BaseRunner.__init__() inside the guest. + # Every key=value pair becomes part of the child process environment. + with StringIO() as buf: + for k, v in environ.items(): + buf.write(f"{k}={v}\n") + for env in self.computer_docker_args.get("Env", []): + buf.write(f"{env}\n") + (self.config_dir / "environ.txt").write_bytes(buf.getvalue().encode("utf8")) + + # resource.txt — read only by the agent (host-side) for recovery. + # The kernel runner does NOT read this file. + with StringIO() as buf: + resource_spec.write_to_file(buf) + for dev_type, device_alloc in resource_spec.allocations.items(): + device_plugin = self.computers[dev_type].instance + kvpairs = await device_plugin.generate_resource_data(device_alloc) + for k, v in kvpairs.items(): + buf.write(f"{k}={v}\n") + (self.config_dir / "resource.txt").write_bytes(buf.getvalue().encode("utf8")) + + # intrinsic-ports.json — consumed by BaseRunner._init__() for + # ZMQ socket binding and service port assignment. + ports = { + "replin": self._repl_in_port, + "replout": self._repl_out_port, + "sshd": 2200, + "ttyd": 7681, + } + (self.config_dir / "intrinsic-ports.json").write_bytes( + json.dumps(ports).encode("utf8") + ) + + # storage-mounts.json — consumed by the guest prestart hook + # (OCI hook at /usr/share/oci/hooks/prestart/setup-storage.sh). + # Contains volume mount specs and storage NIC IP for the guest + # boot script to configure before the container starts. + kata_config = self.local_config.kata + storage_config = { + "storage_nic": ( + { + "device": kata_config.storage_nic.device, + "address": kata_config.storage_nic.address, + "gateway": kata_config.storage_nic.gateway, + } + if kata_config.storage_nic + else None + ), + "volumes": [ + { + "fs_type": m.fs_type, + "source": m.source, + "mountpoint": m.mountpoint, + "options": m.options, + } + for m in kata_config.storage_mounts + ], + } + (self.config_dir / "storage-mounts.json").write_bytes( + json.dumps(storage_config).encode("utf8") + ) +``` + +The guest prestart hook (`/usr/share/oci/hooks/prestart/setup-storage.sh`, baked into the attested guest rootfs) reads this file during `create_container()` and: +1. Configures the storage NIC IP (IPoIB or Ethernet) via `ip addr add` +2. Mounts each volume using the guest kernel's native NFS/Lustre client +3. Volumes are then available for container bind mounts at the specified mount points + +**`get_intrinsic_mounts()`** — Skip mounts that are unnecessary or incompatible with Kata: + +```python +async def get_intrinsic_mounts(self) -> Sequence[Mount]: + mounts = [] + # KEEP: config dir via virtio-fs (host-originated config delivery) + mounts.append(Mount(MountTypes.BIND, self.config_dir, Path("/home/config"), + MountPermission.READ_ONLY)) + # /home/work is on the guest disk (part of guest rootfs), NOT a virtio-fs mount. + # VFolder subdirectories are bind-mounted into /home/work/{vfolder} + # from guest-side NFS/Lustre mounts by the OCI prestart hook. + # CHANGE: /tmp as guest-side tmpfs (no need to cross VM boundary) + mounts.append(Mount(MountTypes.TMPFS, None, Path("/tmp"), + MountPermission.READ_WRITE)) + # SKIP: /home/work — already exists on guest disk + # SKIP: timezone — baked into guest rootfs + # SKIP: lxcfs — guest kernel provides accurate /proc and /sys + # SKIP: agent.sock — only used by jail/C binaries, both skipped for Kata + # SKIP: domain socket proxies — UDS cannot cross VM boundary + # SKIP: deep learning samples — deprecated Docker volume + return mounts +``` + +**`mount_krunner()`** — No-op under CoCo; all executables in attested rootfs: + +```python +async def mount_krunner(self, resource_spec, environ): + # CoCo-by-default: all krunner executables (su-exec, entrypoint.sh, + # dropbearmulti, etc.) and Python libraries (ai.backend.kernel, + # ai.backend.helpers) are baked into the attested guest rootfs. + # The host is untrusted — no host→guest executable sharing via virtio-fs. + # + # SKIP: krunner binary mounts — already in guest rootfs + # SKIP: Python library mounts — already in guest rootfs + # SKIP: libbaihook.so + LD_PRELOAD — guest kernel provides accurate sysconf + # SKIP: jail binary — VM boundary is stronger isolation + # SKIP: accelerator LD_PRELOAD hooks — VFIO passthrough makes GPU + # natively visible in guest; no hook libraries needed + pass +``` + +**`apply_accelerator_allocation()`** — Collect VFIO device info from plugins: + +```python +async def apply_accelerator_allocation(self): + vfio_devices = [] + for dev_type, device_alloc in self.resource_spec.allocations.items(): + computer = self.computers[dev_type].instance + plugin_args = await computer.generate_docker_args(None, device_alloc) + # CUDAVFIOPlugin returns {"_kata_vfio_devices": [...]} + if "_kata_vfio_devices" in plugin_args: + vfio_devices.extend(plugin_args["_kata_vfio_devices"]) + self._vfio_devices = vfio_devices +``` + +**`spawn()`** — Create container via containerd with Kata runtime: + +```python +async def spawn(self): + container_config = { + "image": self.image_ref, + "runtime": {"name": "io.containerd.kata.v2"}, + "env": self.environ, + "mounts": self._build_container_mounts(), # config via virtio-fs + vfolder bind mounts (guest-internal) + "labels": self._build_labels(), + "annotations": { + "io.katacontainers.config.hypervisor.hotplug_vfio_on_root_bus": "true", + **self._build_vfio_annotations(), + }, + } + container = await self._containerd.create_container( + container_id=str(self.kernel_id), + config=container_config, + ) + task = await self._containerd.create_task(container.id) + await task.start() + return container.id +``` + +### Kata-Specific entrypoint.sh + +The runner's `entrypoint.sh` needs a Kata-specific variant. The Docker version performs several operations that are incompatible or unnecessary: + +| Operation | Docker | Kata | Reason | +|-----------|--------|------|--------| +| Create user/group | Yes | **Yes** | Same user setup needed inside guest | +| `LD_PRELOAD` → `/etc/ld.so.preload` | Yes | **No** | libbaihook.so is not mounted; guest kernel provides accurate sysconf | +| `chown agent.sock` | Yes | **No** | Agent socket is not mounted | +| Extract dotfiles | Yes | **Yes** | Same user environment customization | +| SSH key setup | Yes | **Yes** | Same inter-container SSH support | +| Password generation | Yes | **Yes** | Same session security model | +| `exec su-exec` | Yes | **Yes** | Same privilege drop to user UID/GID | + +The Kata variant skips the `LD_PRELOAD` block and the `chown agent.sock` line, but is otherwise identical. + +### KataKernel + +```python +class KataKernel(AbstractKernel): + container_id: ContainerId + sandbox_id: str # Kata sandbox (VM) identifier + + async def create_code_runner(self): + # The agent↔kernel-runner communication channel is ZMQ TCP, + # which already works across the VM boundary without modification. + # + # Docker: agent connects to tcp://localhost:{host_mapped_port} + # (Docker maps container port to a host port) + # Kata: agent connects to tcp://{guest_ip}:{container_port} + # (guest IP assigned by Calico CNI, reachable via TC filter) + # + # The kernel runner inside the guest binds: + # insock = tcp://*:{repl_in_port} (ZMQ PULL) + # outsock = tcp://*:{repl_out_port} (ZMQ PUSH) + # These ports come from /home/config/intrinsic-ports.json. + return DockerCodeRunner( + kernel_id=self.kernel_id, + kernel_host=self._guest_ip, + repl_in_port=self._repl_in_port, + repl_out_port=self._repl_out_port, + exec_timeout=self.exec_timeout, + client_features=client_features, + ) + + async def check_status(self): + status = await self._containerd.get_task_status(self.container_id) + return status == "running" +``` + +The `DockerCodeRunner` class (or a renamed `CodeRunner` base) is reusable because its only transport is ZMQ over TCP — `get_repl_in_addr()` returns `tcp://{host}:{port}`. The difference is the address: Docker uses `localhost` with port mapping; Kata uses the guest's Calico-assigned IP with direct port access. + +### ContainerdClient + +Async wrapper around containerd's gRPC API. For **single-container sessions**, each container gets its own Kata sandbox (VM) via the standard `containers.v1` / `tasks.v1` APIs. For **multi-container sessions** (clusters on a single host), the containerd Sandbox API (`sandbox.v1`, available in containerd v2+) is used to create a shared sandbox first, then add containers into it: + +```python +class ContainerdClient: + def __init__(self, socket_path: Path): + self._socket = socket_path + + async def connect(self): ... + async def close(self): ... + + # Image operations + async def list_images(self) -> list[ImageInfo]: ... + async def pull_image(self, ref: str, auth: dict | None = None): ... + + # Sandbox operations (containerd v2+ Sandbox API) + # Required for multi-container sessions sharing a single Kata VM. + # Without the Sandbox API, each container creates a separate VM. + async def create_sandbox(self, sandbox_id: str, runtime: str, + config: dict) -> SandboxInfo: ... + async def stop_sandbox(self, sandbox_id: str): ... + async def delete_sandbox(self, sandbox_id: str): ... + + # Container operations + async def create_container(self, container_id: str, config: dict, + sandbox_id: str | None = None) -> ContainerInfo: ... + async def delete_container(self, container_id: str): ... + + # Task operations (a "task" is a running process in containerd) + async def create_task(self, container_id: str) -> TaskInfo: ... + async def start_task(self, container_id: str): ... + async def stop_task(self, container_id: str, timeout: int = 10): ... + async def delete_task(self, container_id: str): ... + async def get_task_status(self, container_id: str) -> str: ... + + # Exec (run command inside running container) + async def exec_in_container(self, container_id: str, + command: list[str]) -> ExecResult: ... + + # Events + async def subscribe_events(self) -> AsyncIterator[Event]: ... + + # Runtime info + async def get_runtime_info(self, runtime_class: str) -> dict | None: ... +``` + +This client uses `grpcio` (or `grpclib` for async) to communicate with containerd's Unix socket. The proto definitions come from the containerd API (`containerd.services.containers.v1`, `containerd.services.tasks.v1`, `containerd.services.sandbox.v1`). + +**API choice rationale:** The containerd client API (`containers.v1`, `tasks.v1`, `sandbox.v1`) is the correct programmatic interface for a non-Kubernetes service. CRI `RuntimeService` is the Kubernetes kubelet's API and should not be used directly. The Kata shim v2 API (`containerd.runtime.v2.Task`) is internal to containerd and not called by external clients. + +## Interface / API + +| Class | Extends | Key Responsibility | +|-------|---------|-------------------| +| `KataAgent` | `AbstractAgent[KataKernel, KataKernelCreationContext]` | Container lifecycle via containerd + Kata shim; no agent socket handler | +| `KataKernel` | `AbstractKernel` | Guest VM container state; code runner connection via ZMQ TCP to guest IP | +| `KataKernelCreationContext` | `AbstractKernelCreationContext` | Config file writing (environ.txt, resource.txt, intrinsic-ports.json); VFIO device collection; mount filtering (skip lxcfs, jail, agent.sock, libbaihook) | +| `KataAgentDiscovery` | `AbstractAgentDiscovery` | Plugin loading, resource scanning, krunner env | +| `ContainerdClient` | (standalone) | Async containerd gRPC wrapper | + +## Implementation Notes + +- Kata runtime must be installed on the host; the agent does not install it +- Image format: standard OCI images work unchanged; containerd handles the pull and unpack +- The `[container]` section settings (`scratch-root`, `port-range`) are shared between Docker and Kata backends +- Intrinsic CPU/Memory plugins (`intrinsic.py`) can largely reuse the Docker versions; the main difference is that resource limits apply to the VM (host cgroup) rather than directly to the container process +- Log collection: containerd tasks expose stdout/stderr via FIFO pipes, similar to Docker's log stream +- `entrypoint.sh` needs a Kata variant that skips the `LD_PRELOAD` setup and `agent.sock` chown. This can be a conditional branch in the existing script (check an env var like `BACKENDAI_CONTAINER_BACKEND=kata`) or a separate entrypoint file +- The `DockerCodeRunner` class can be reused directly for Kata — its only dependency is ZMQ TCP addressing, which works identically with a guest IP instead of localhost +- `resource.txt` is written to `/home/config/` for agent-side recovery consistency, but the kernel runner never reads it. The hypervisor enforces resource limits (vCPU, memory) at the VM level; environ.txt communicates resource metadata to the kernel runner's child environment diff --git a/proposals/BEP-1051/migration-compatibility.md b/proposals/BEP-1051/migration-compatibility.md new file mode 100644 index 00000000000..ffde4d1f52b --- /dev/null +++ b/proposals/BEP-1051/migration-compatibility.md @@ -0,0 +1,323 @@ + + +# BEP-1051: Migration and Compatibility + +## Summary + +Introducing KataAgent is an additive change — new enum value, new package, new config section, new DB column with a safe default. Existing Docker and Kubernetes deployments are completely unaffected. This document details the rollout strategy, compatibility guarantees, and rollback plan. + +## Backward Compatibility + +### No Breaking Changes + +| Component | Change | Impact on Existing Deployments | +|-----------|--------|-------------------------------| +| `AgentBackend` enum | Add `KATA = "kata"` | None — existing `DOCKER`/`KUBERNETES` values unchanged | +| `AgentRow` | Add `backend` column | None — `server_default="docker"` for all existing rows | +| Agent heartbeat | Add `backend` field | None — manager defaults to `"docker"` if field missing | +| Accelerator plugins | New `cuda_vfio` entry point | None — only loaded if agent config allows it | +| Scaling groups | No schema change | None — new groups created for Kata, existing groups unchanged | +| Manager RPC | No protocol change | None — `create_kernel` payload is identical | +| User API | No change | None — users select scaling groups, not backends | +| `AgentMeta` / `AgentInfo` | Add `backend` field with default | None — default `"docker"` preserves behavior | + +### Package Independence + +The `ai.backend.agent.kata` package is a new addition: + +``` +src/ai/backend/agent/ +├── docker/ ← unchanged +├── kubernetes/ ← unchanged +├── dummy/ ← unchanged +└── kata/ ← new (only imported when backend="kata") +``` + +The `ai.backend.accelerator.cuda_vfio` package is similarly independent: + +``` +src/ai/backend/accelerator/ +├── cuda_open/ ← unchanged (Docker agents continue using this) +└── cuda_vfio/ ← new (only loaded by Kata agents) +``` + +### Config Compatibility + +The `[kata]` config section is only read when `[agent] backend = "kata"`. Docker agents ignore it entirely. Existing `agent.toml` files require no changes. + +## Migration Steps + +### Phase 1: Infrastructure Preparation + +Before deploying any Backend.AI changes: + +1. **Prepare Kata hosts** (dedicated machines or a subset of existing agent nodes): + - Install Kata Containers 3.x (`kata-runtime`, guest kernel, rootfs image) + - Configure containerd with Kata shim (`io.containerd.kata.v2`) + - Enable IOMMU in kernel: `intel_iommu=on iommu=pt` (Intel) or `amd_iommu=on iommu=pt` (AMD) + - Load kernel modules: `vfio`, `vfio_pci`, `vfio_iommu_type1`, `vhost_vsock` + - Bind target GPUs to `vfio-pci` driver + - Verify: `kata-runtime check` passes, `/dev/vfio/` contains group devices + +2. **Verify IOMMU groups** on each host: + ```bash + for d in /sys/kernel/iommu_groups/*/devices/*; do + echo "IOMMU Group $(basename $(dirname $(dirname $d))):" \ + "$(lspci -nns $(basename $d))" + done + ``` + Confirm each GPU is in its own IOMMU group (or shares only with its audio companion). + +### Phase 2: Backend.AI Deployment + +1. **Run Alembic migration**: + - Adds `agents.backend` column with `server_default="docker"` + - Non-destructive; can be run while existing agents are online + - Existing agent rows automatically get `backend="docker"` + +2. **Deploy updated manager**: + - Manager accepts the new `backend` field in agent heartbeats + - Falls back to `"docker"` for agents that don't send it (backward compatible) + +3. **Create Kata scaling group(s)**: + ```sql + INSERT INTO scaling_groups (name, driver, ...) VALUES ('kata-gpu', 'static', ...); + ``` + Or via admin API / WebUI. + +4. **Deploy Kata agent on prepared hosts**: + - Config file with `[agent] backend = "kata"` and `[kata]` section + - Agent registers with the Kata scaling group + - Agent heartbeat includes `backend="kata"` + +### Phase 3: Validation + +1. **Agent registration**: Verify Kata agents appear in `agents` table with `backend="kata"`: + ```sql + SELECT id, backend, scaling_group, available_slots FROM agents WHERE backend = 'kata'; + ``` + +2. **Session creation**: Create a test session targeting the Kata scaling group: + - CPU-only session first (Phase 1 validation) + - Single GPU session (Phase 2 validation) + - Multi-GPU session (if IOMMU groups allow) + +3. **GPU verification**: Inside the Kata session: + ```bash + nvidia-smi # Should show passthrough GPU(s) + python -c "import torch; print(torch.cuda.device_count())" + ``` + +4. **Storage verification**: Confirm vfolder mounts are accessible: + ```bash + ls /home/work/ # Should show mounted vfolders + echo "test" > /home/work/test.txt # Write should succeed + ``` + +5. **Existing Docker sessions**: Verify no impact: + - Create sessions in Docker scaling groups — should work identically + - Check Docker agent heartbeats — `backend="docker"` (or missing, defaulted) + - Confirm no scheduler behavior changes for Docker groups + +### Phase 4: Gradual Rollout + +1. **Start small**: One Kata agent, one scaling group, internal testing only +2. **Assign specific projects/users** to the Kata scaling group for controlled testing +3. **Monitor**: + - VM boot times (target: < 500ms) + - Memory overhead (should match `kata.vm-overhead-mb` config) + - GPU compute performance (should be near-native via VFIO) + - I/O performance on vfolder mounts (direct guest-side NFS/Lustre) +4. **Scale out**: Add more Kata agents as confidence grows + +## Rollback Plan + +Rollback is straightforward because all changes are additive: + +1. **Remove Kata agents** from the Kata scaling group (or set `schedulable=false`) +2. **Reassign users/projects** back to Docker scaling groups +3. **Optionally remove Kata agents** from the cluster entirely +4. **No schema rollback needed**: The `agents.backend` column stays (all values are `"docker"` after Kata agents are removed); the Alembic migration does not need to be reversed + +The `backend` column, `AgentBackend.KATA` enum value, and `cuda_vfio` plugin are inert when no Kata agents are deployed — they add no runtime overhead or behavioral changes. + +## Future Extensibility + +### CoCo Architecture (Built-In from Phase 1) + +CoCo (Confidential Containers) is enabled by default for all Kata workloads — there is no non-CoCo Kata mode. The following are part of the baseline architecture, not a future phase: + +- `confidential_guest = true` and `guest_attestation = "tdx"` (or `"sev-snp"`) in Kata config — defaults +- Guest-side image pull via `image-rs` — images downloaded and decrypted inside the TEE +- Key Broker Service (KBS) integration for sealed secrets and registry credentials +- Remote attestation endpoint exposed via the manager API +- All executables baked into the attested guest rootfs — no host→guest sharing of binaries + +#### VFolder Storage: Direct Guest-Side NFS/Lustre Mount + +virtio-fs introduces FUSE overhead on the host side and — critically — breaks RDMA data paths for high-performance storage backends (Lustre over InfiniBand, WekaFS RDMA, GPFS). Since Backend.AI's target deployments use RDMA-capable network-attached storage, virtio-fs is unsuitable for vfolder data I/O. + +The correct architecture mirrors Docker: mount the **storage volume** directly inside the guest VM using the guest's own NFS/Lustre/WekaFS kernel client (preserving RDMA, 100% native throughput), then bind-mount the specific vfolder subdirectory into the container. + +```mermaid +flowchart LR + subgraph Docker["Docker (current)"] + SC1["Storage cluster"] -->|"NFS/Lustre (RDMA)"| Host["/mnt/vfstore (host)"] + Host -->|bind-mount| DC["container:/home/work/data"] + end +``` + +```mermaid +flowchart LR + subgraph Kata["Kata (proposed)"] + SC2["Storage cluster"] -->|"NFS/Lustre (RDMA, direct)"| GuestVM["/mnt/vfstore (guest VM)"] + GuestVM -->|bind-mount| KC["container:/home/work/data"] + end +``` + +**How mount specs and storage network config reach the guest:** + +1. The KataAgent's `[kata]` config lists storage volume mount points (`[[kata.storage-mounts]]`) and storage NIC IP (`[kata.storage-nic]`). See [configuration-deployment.md](configuration-deployment.md) for the full schema. +2. Before VM boot, `KataKernelCreationContext.write_config_files()` writes `storage-mounts.json` to `/home/config/` (virtio-fs config channel — same mechanism as environ.txt and intrinsic-ports.json). +3. During `create_sandbox()`, the Kata shim loads kernel modules specified in `kernel-modules` config (e.g., `["nfs", "nfsv4"]`) via `modprobe`, and the `guest_hook_path` is scanned for OCI prestart hooks. +4. During `create_container()`, the OCI prestart hook (`/usr/share/oci/hooks/prestart/setup-storage.sh`, baked into the attested guest rootfs) executes and: + - Reads `storage-mounts.json` from the virtio-fs shared `/home/config/` + - Configures the storage NIC IP via `ip addr add` + - Mounts each storage volume using the guest kernel's native NFS/Lustre client +5. After the hook completes, the OCI container spec's bind mounts (`/mnt/vfstore//subpath` → `/home/work/data`) are applied as guest-internal bind mounts — identical to Docker. + +**`storage-mounts.json` format:** + +```json +{ + "storage_nic": { + "device": "ib0", + "address": "10.0.100.5/24", + "gateway": "10.0.100.1" + }, + "volumes": [ + { + "fs_type": "nfs4", + "source": "storage-server:/exports/vfstore", + "mountpoint": "/mnt/vfstore", + "options": "vers=4.1,rsize=1048576,wsize=1048576" + } + ] +} +``` + +**Guest prestart hook** (`/usr/share/oci/hooks/prestart/setup-storage.sh`, baked into attested rootfs): + +```bash +#!/bin/bash +# Baked into attested guest rootfs — tamper-proof under CoCo. +# Executed by kata-agent as an OCI prestart hook during create_container(). +# Kernel modules (nfs, nfsv4, lustre) are already loaded by create_sandbox(). + +CONFIG="/run/kata-containers/shared/containers/*/rootfs/home/config/storage-mounts.json" +# Resolve the glob — virtio-fs shared directory has one container entry +CONFIG_FILE=$(ls $CONFIG 2>/dev/null | head -1) +[ -z "$CONFIG_FILE" ] && exit 0 + +# Configure storage NIC IP (IPoIB or Ethernet) +NIC_DEV=$(jq -r '.storage_nic.device // empty' "$CONFIG_FILE") +if [ -n "$NIC_DEV" ]; then + NIC_ADDR=$(jq -r '.storage_nic.address' "$CONFIG_FILE") + NIC_GW=$(jq -r '.storage_nic.gateway // empty' "$CONFIG_FILE") + ip addr add "$NIC_ADDR" dev "$NIC_DEV" 2>/dev/null + ip link set "$NIC_DEV" up + [ -n "$NIC_GW" ] && ip route add default via "$NIC_GW" dev "$NIC_DEV" table 100 +fi + +# Mount each storage volume +jq -c '.volumes[]' "$CONFIG_FILE" | while read -r vol; do + FS_TYPE=$(echo "$vol" | jq -r '.fs_type') + SOURCE=$(echo "$vol" | jq -r '.source') + MOUNTPOINT=$(echo "$vol" | jq -r '.mountpoint') + OPTIONS=$(echo "$vol" | jq -r '.options // empty') + mkdir -p "$MOUNTPOINT" + if [ -n "$OPTIONS" ]; then + mount -t "$FS_TYPE" -o "$OPTIONS" "$SOURCE" "$MOUNTPOINT" + else + mount -t "$FS_TYPE" "$SOURCE" "$MOUNTPOINT" + fi +done +``` + +**Storage NIC IP allocation:** The storage NIC (InfiniBand HCA or dedicated Ethernet NIC) is VFIO-passthrough to the VM as a whole device. In single-tenancy deployments (one VM per agent node — the target for Phase 1), the VM simply takes the static storage IP that would otherwise belong to the host. The IP, subnet, and gateway are configured in the agent's `[kata]` config and written into `storage-mounts.json`. No dynamic IP allocation is needed — the storage network topology is static infrastructure-level configuration. For future multi-tenancy (multiple VMs per node via SR-IOV VFs), each VF would need its own IP, requiring DHCP or an IP pool manager — this is deferred along with SR-IOV support. + +**Storage topology visibility:** Container users can see mount info in `/proc/mounts` (server addresses, export paths). This is acceptable — the same isolation model as Docker. Backend.AI's security boundary is the vfolder permission model (users can only access vfolders they're authorized for), not mount concealment. Container users cannot mount their own disks (no `CAP_SYS_ADMIN`). + +**What uses virtio-fs (still):** +- Scratch/config directories: `/home/config` (environ.txt, intrinsic-ports.json, storage-mounts.json) +- These are small, non-performance-critical, host-originated configuration data + +**Guest VM base image requirements:** + +The Kata guest rootfs (or initrd) must include filesystem client drivers for every storage backend that Backend.AI claims to support. This is a build-time dependency — the base image is a managed infrastructure artifact, not a user-provided image. + +| Storage Backend | Required Guest Kernel Module / Userspace | Notes | +|-----------------|------------------------------------------|-------| +| NFS (VAST, NetApp, Pure, Dell EMC, Hammerspace) | `nfs`, `nfsv4` kernel modules; `mount.nfs` helper | Most universal; covers majority of supported backends | +| Lustre (DDN EXAScaler) | `lustre` kernel module (out-of-tree, DKMS) | Must match Lustre server version; RDMA (o2ib) requires OFED | +| WekaFS | `wekafs` kernel module + userspace agent | Proprietary; requires WekaFS client package and DPDK for RDMA | +| CephFS | `ceph` kernel module | Native kernel client; alternatively use NFS gateway | +| GPFS (IBM Storage Scale) | `mmfs` kernel module + userspace daemon | Proprietary; requires cluster membership config | + +The base image build pipeline must: +1. Start from a minimal Linux rootfs (same as the standard Kata guest rootfs) +2. Install a storage mount boot script as a systemd service (`Before=local-fs.target`) that reads `/home/config/storage-mounts.json` and mounts volumes +3. Include kernel modules for all supported storage backends (compiled against the guest kernel version) +4. Include any required userspace helpers (`mount.nfs`, `mount.lustre`, WekaFS agent, etc.) +5. Version the image: `kata-rootfs-{krunner_version}-{storage_version}.qcow2` to enable atomic rollover when drivers are updated + +Not all deployments need all drivers. A modular approach (base image + per-backend overlay or module loading) can reduce image size, but the simplest initial approach is a single "fat" image with all supported clients. + +**Limitations:** +- Guest kernel must include storage client modules (managed via the base image build pipeline above) +- Only applicable to network-attached storage — host-local storage still requires virtio-fs +- Proprietary storage clients (WekaFS, GPFS) require vendor licensing and may have distribution restrictions for guest images + +### BEP-1016 Alignment + +When Accelerator Interface v2 (BEP-1016) is implemented: + +- Migrate `CUDAVFIOPlugin.generate_docker_args()` → `create_lifecycle_hook()` +- The `WorkloadConfig` struct replaces the `_kata_vfio_devices` convention: + ```python + class WorkloadConfig: + mounts: list[MountInfo] + env_vars: dict[str, str] + resource_data: dict[str, str] + # VFIO devices expressed as mounts or a dedicated field + ``` +- Both `CUDAPlugin` and `CUDAVFIOPlugin` implement the same interface +- No user-facing or scheduler changes needed + +### Mixed Scaling Groups (Future Option) + +If operational needs require mixed Docker/Kata agents in the same scaling group: + +- Add `allowed_backends: list[str]` to `ScalingGroupOpts` +- Add backend filtering to `AgentSelector`: + ```python + candidates = [a for a in agents if a.backend in sgroup.allowed_backends] + ``` +- Add `preferred_backend` to `SessionCreationSpec` (optional, user-specified) +- This is NOT proposed for the initial implementation — homogeneous groups are simpler and sufficient + +## References + +- [Configuration & Deployment](configuration-deployment.md) — host requirements checklist +- [Scheduler Integration](scheduler-integration.md) — `AgentRow.backend` column details +- [Kata Containers Installation Guide](https://github.com/kata-containers/kata-containers/blob/main/docs/install/README.md) diff --git a/proposals/BEP-1051/networking.md b/proposals/BEP-1051/networking.md new file mode 100644 index 00000000000..a20e46e2de3 --- /dev/null +++ b/proposals/BEP-1051/networking.md @@ -0,0 +1,339 @@ + + +# BEP-1051: Multi-Container Networking + +## Summary + +Backend.AI supports multi-container sessions (clusters) where containers communicate via a shared bridge network with hostname-based DNS resolution. For KataAgent, each container runs in its own VM, so inter-container networking becomes inter-VM networking. This document analyzes Kata's networking model, specifies Calico CNI as the required networking layer, and details the integration design for both standalone containerd and Kubernetes deployments. + +## Current Design: DockerAgent Networking + +DockerAgent uses Docker bridge networks for container communication. For single-container sessions, the default bridge provides connectivity with service ports (SSH, ttyd, Jupyter) mapped to host ports from a configured range (default: 30000-31000). For multi-container sessions, a dedicated bridge is created per cluster via `create_local_network()` (`src/ai/backend/agent/agent.py:3496`), and containers join with cluster hostnames as DNS aliases (`src/ai/backend/agent/docker/agent.py:680`). Docker's embedded DNS resolves these hostnames for inter-container communication. + +Backend.AI has a network plugin system (`src/ai/backend/agent/plugin/network.py`) with `AbstractNetworkAgentPlugin` supporting `join_network()`, `leave_network()`, and `expose_ports()`. Built-in plugins include `OverlayNetworkPlugin` (Docker overlay for multi-host) and `HostNetworkPlugin` (host network mode). + +## Kata Containers Networking Model + +### Per-VM Network Path + +Kata transparently bridges the host-side network namespace into the guest VM using TC (Traffic Control) filter redirection: + +```mermaid +flowchart TB + subgraph HOST["Host Side"] + CNI["CNI assigns IP"] --> veth["eth0 (veth pair)"] + veth -->|"TC filter: ingress/egress"| tap["tap0_kata (TAP device)"] + end + subgraph VM["Guest VM (KVM boundary)"] + geth["eth0 (virtio-net, same IP as host veth)"] + end + tap -->|virtio-net| geth +``` + +This TC filter mechanism is Kata's default and requires no additional configuration. The guest VM's `eth0` gets the same IP address that the CNI assigned to the host-side veth, making the VM transparent to the network infrastructure. + +### Sandbox Model: Multiple Containers Per VM + +A Kata **sandbox** is a single VM instance. The kata-agent inside the guest can manage multiple containers within the same sandbox — each container gets its own PID namespace (via libcontainer) but **shares the VM's network namespace**. All containers in the same sandbox communicate via `localhost`, consistent with Kubernetes pod semantics. + +The flow for multi-container sandboxes: +1. CRI creates a sandbox → Kata boots the VM, kata-agent starts +2. CRI calls `CreateContainer` → kata-agent creates a container inside the existing VM (image rootfs shared via virtio-fs or hot-plugged as a block device) +3. Additional `CreateContainer` calls → more containers in the same VM, sharing network + +For Backend.AI's **multi-container sessions** (cluster mode), the mapping depends on deployment model: +- **Single-host cluster**: All containers in the session can share one Kata sandbox (one VM) using the containerd Sandbox API (`sandbox.v1`, containerd v2+). The agent calls `create_sandbox()` first, then adds containers into it via `create_container(sandbox_id=...)`. Containers share `localhost` and the VM's network namespace. Note: the standard `containers.v1` / `tasks.v1` APIs create one VM per container — the Sandbox API is required for sharing. +- **Multi-host cluster**: Each host runs a separate Kata sandbox (separate VM). Containers across VMs communicate via Calico CNI — each VM gets its own IP, and inter-VM traffic flows through BGP/VXLAN routing. This is the expected model for distributed training workloads. + +### What Kata Provides Built-In + +| Feature | Support | Notes | +|---------|---------|-------| +| TC filter veth↔tap redirection | Built-in (default) | Transparent to CNI plugins | +| MACVTAP mode | Built-in (legacy) | Alternative to TC filter, similar performance | +| virtio-net device | Built-in | Para-virtualized NIC inside guest | +| CNI plugin compatibility | Full | Works with Calico, Cilium, Flannel, bridge | +| Multi-container sandbox | Supported | Containers share VM network namespace via kata-agent | +| `--net=host` | **Not supported** | VM isolation prevents host network access | +| `--net=container:` | **Not supported** | Cannot share network namespace across VMs | +| Docker Compose custom networks | **Limited** | Docker DNS service not fully compatible | + +### What Kata Does NOT Provide + +Kata has **no built-in inter-VM networking solution**. Each VM is an isolated network endpoint. Inter-VM communication depends entirely on the underlying CNI plugin — with Kubernetes, the cluster's CNI handles pod-to-pod networking transparently; without Kubernetes, a CNI plugin must be configured explicitly on each host. + +### CNI Compatibility Assessment + +| CNI Plugin | Kata Compatibility | Multi-Host | Network Policy | Assessment | +|------------|-------------------|------------|---------------|------------| +| **Calico** | Works with TC filter | BGP / VXLAN / IPIP | Full (L3/L4) | **Selected** — multi-host + policy + standalone support | +| Cilium | Known [MTU mismatch](https://docs.cilium.io/en/stable/network/kubernetes/kata/) | VXLAN / Geneve | Full (L3/L4/L7) | MTU issues with Kata; eBPF requires Kubernetes | +| Flannel | Works with TC filter | VXLAN | None | No network policy; insufficient for multi-tenant isolation | +| CNI bridge | Works with TC filter | **Single-host only** | None | Development/testing only | + +## Proposed Design: Calico CNI Integration + +### Why Calico + +KataAgent requires Calico as the CNI networking layer for the following reasons: + +1. **Multi-host networking**: Production deployments span multiple agent hosts. Calico provides cross-host routing via BGP (direct peering or route reflectors) or VXLAN overlay — containers on different hosts can communicate transparently. +2. **Network policy enforcement**: In a multi-tenant GPU environment, sessions from different users must be isolated at the network level. Calico's Felix agent programs iptables/eBPF rules to enforce `NetworkPolicy`-style rules between Kata VMs, preventing unauthorized inter-session traffic. +3. **Kata TC filter transparency**: Calico creates veth pairs in the container network namespace. Kata's TC filter redirects traffic from the veth to a tap device and into the VM. From Calico's perspective, the Kata VM is indistinguishable from a regular container — no special integration needed. +4. **Standalone containerd support**: Calico supports non-Kubernetes deployments using etcd as its datastore. The `calico-node` agent (Felix + BIRD) runs on each host, and the Calico CNI plugin is invoked by containerd. This matches KataAgent's containerd-based architecture. +5. **Kubernetes compatibility**: When KataAgent is deployed on Kubernetes, the cluster's existing Calico installation handles networking — no additional configuration needed. This provides a consistent networking model across deployment modes. + +### Calico Architecture with Kata + +```mermaid +flowchart TB + subgraph HostA["Agent Host A"] + VM1["Kata VM (sess-1)"] -->|virtio-net| tap1[tap0] -->|TC filter| veth1[veth] + VM2["Kata VM (sess-2)"] -->|virtio-net| tap2[tap0] -->|TC filter| veth2[veth] + veth1 & veth2 --> FelixA["Felix (policy) + BIRD (BGP)"] + end + subgraph HostB["Agent Host B"] + VM3["Kata VM (sess-3)"] -->|virtio-net| tap3[tap0] -->|TC filter| veth3[veth] + VM4["Kata VM (sess-4)"] -->|virtio-net| tap4[tap0] -->|TC filter| veth4[veth] + veth3 & veth4 --> FelixB["Felix (policy) + BIRD (BGP)"] + end + FelixA <-->|"BGP peering / VXLAN"| FelixB +``` + +**Felix** programs iptables rules on each host to enforce network policies between Kata VMs. **BIRD** exchanges routing information so that VMs on Host A can reach VMs on Host B via BGP or VXLAN encapsulation. + +### Deployment Modes + +#### Standalone containerd (non-Kubernetes) + +Calico runs in standalone mode with etcd as the datastore: + +**Components per host:** +- `calico-node` service (Felix + BIRD daemons) +- Calico CNI plugin binary (`/opt/cni/bin/calico`) +- Calico IPAM plugin binary (`/opt/cni/bin/calico-ipam`) + +**Infrastructure:** +- etcd cluster (can share Backend.AI's existing etcd, or separate) +- `calicoctl` CLI for IP pool and network policy management + +**CNI configuration** (`/etc/cni/net.d/10-calico.conflist`): + +```json +{ + "name": "kata-calico", + "cniVersion": "1.0.0", + "plugins": [ + { + "type": "calico", + "datastore_type": "etcdv3", + "etcd_endpoints": "http://127.0.0.1:2379", + "ipam": { + "type": "calico-ipam", + "assign_ipv4": "true" + }, + "policy_type": "calico", + "log_level": "info" + }, + { + "type": "portmap", + "capabilities": {"portMappings": true}, + "snat": true + } + ] +} +``` + +**IP pool configuration** (via `calicoctl`): + +```yaml +apiVersion: projectcalico.org/v3 +kind: IPPool +metadata: + name: kata-sessions +spec: + cidr: 10.244.0.0/16 + ipipMode: Never + vxlanMode: CrossSubnet # VXLAN for cross-subnet, direct for same-subnet + natOutgoing: true + nodeSelector: all() +``` + +#### Kubernetes deployment + +When KataAgent is deployed on a Kubernetes cluster with Calico already installed, no additional CNI configuration is needed. The existing Calico installation handles: + +- IP allocation via Calico IPAM +- Cross-node routing via BGP or VXLAN +- Network policy enforcement via Felix + +`create_local_network()` is a no-op (same as KubernetesAgent) — Kubernetes Services and DNS handle inter-pod discovery. + +### Network Policy for Session Isolation + +Calico network policies prevent unauthorized traffic between sessions. The agent labels each Kata VM's network endpoint with session metadata, and a default-deny policy isolates sessions from each other: + +```yaml +# Default deny: sessions cannot reach other sessions +apiVersion: projectcalico.org/v3 +kind: GlobalNetworkPolicy +metadata: + name: default-deny-inter-session +spec: + selector: has(ai.backend.session-id) + types: + - Ingress + - Egress + ingress: + - action: Deny + source: + notSelector: ai.backend.session-id == "{{session_id}}" + egress: + - action: Allow # Allow outbound (storage, internet) +``` + +```yaml +# Allow intra-cluster: containers in the same session can communicate +apiVersion: projectcalico.org/v3 +kind: GlobalNetworkPolicy +metadata: + name: allow-intra-session +spec: + selector: has(ai.backend.session-id) + order: 100 + types: + - Ingress + ingress: + - action: Allow + source: + selector: ai.backend.session-id == "{{session_id}}" +``` + +Note: The `{{session_id}}` placeholders above are **templates** — Calico does not support variable substitution in policy YAML. The agent generates per-session policies with actual session IDs substituted programmatically before applying via `calicoctl apply`. + +KataAgent applies these labels when creating containers via containerd, and Calico's Felix agent enforces the rules in real-time on the host's iptables. + +### Single-Container Sessions + +For single-container sessions, Calico provides connectivity: + +1. Containerd invokes the Calico CNI plugin when creating the container's network namespace +2. Calico assigns an IP from the configured IP pool and programs routes on the host +3. Kata's TC filter redirects traffic from the Calico-created veth to a tap device +4. The guest VM gets the Calico-assigned IP, reachable from other hosts via BGP/VXLAN +5. Service ports are exposed via the portmap CNI plugin + +### Multi-Container Sessions (Clusters) + +For multi-container sessions, all containers in the cluster receive IPs from the same Calico IP pool and are labeled with the same `session-id`. The network policy allows intra-session traffic while blocking inter-session traffic. + +`create_local_network()` for Calico standalone mode ensures the session's network policy rules are applied: + +```python +async def create_local_network(self, network_name: str) -> None: + """Apply Calico network policy for inter-VM cluster communication.""" + session_id = network_name # network_name is derived from session_id + # Apply session-scoped network policy via calicoctl + await self._apply_calico_policy(session_id) + +async def destroy_local_network(self, network_name: str) -> None: + """Remove session-scoped Calico network policy.""" + session_id = network_name + await self._remove_calico_policy(session_id) +``` + +Containers within the same cluster can reach each other by IP. Hostname resolution is handled by `/etc/hosts` injection (see below). + +### DNS / Hostname Resolution + +Docker provides embedded DNS for container-to-container name resolution. Kata with standalone containerd does **not** have this. KataAgent must handle hostname resolution explicitly. + +**Approach: `/etc/hosts` injection** + +After all containers in a cluster are created and have IPs assigned, KataAgent writes `/etc/hosts` entries into each container's scratch directory (shared via virtio-fs): + +```python +async def _inject_cluster_hosts( + self, + cluster_info: ClusterInfo, + containers: dict[str, ContainerInfo], # hostname → container info +) -> None: + """Write /etc/hosts with cluster hostname mappings into each container.""" + hosts_entries = [] + for hostname, info in containers.items(): + hosts_entries.append(f"{info.ip_address}\t{hostname}") + hosts_content = "\n".join([ + "127.0.0.1\tlocalhost", + "::1\tlocalhost ip6-localhost ip6-loopback", + *hosts_entries, + ]) + for hostname, info in containers.items(): + hosts_path = info.scratch_dir / "config" / "hosts" + hosts_path.write_text(hosts_content) +``` + +The `/home/config/hosts` file is bind-mounted (via virtio-fs) as `/etc/hosts` inside the guest. This provides immediate hostname resolution without requiring a DNS server. + +**Alternative considered: CoreDNS sidecar** — too heavyweight for per-cluster DNS; `/etc/hosts` is simpler and sufficient for the small number of containers in a Backend.AI cluster (typically 2-8). + +### Service Port Exposure + +Service ports (SSH, ttyd, Jupyter, etc.) are mapped from container ports to host ports via the CNI portmap plugin (included in the Calico conflist above). The portmap plugin programs DNAT rules to forward host ports to the VM's Calico-assigned IP — the standard CNI approach, transparent to Kata's TC filter. + +### Network Plugin Compatibility + +| Plugin | Docker | Kata | Notes | +|--------|--------|------|-------| +| Default | Docker bridge API | Calico CNI | Multi-host routing + network policy | +| Overlay | Docker overlay driver | Calico VXLAN mode | Cross-subnet encapsulation | +| Host | `--net=host` | **Not supported** | VM isolation prevents this | +| RDMA | Device mount | SR-IOV VF passthrough (future) | Out of scope for Phase 1 | + +## Interface / API + +| Abstraction | Location | Change for Kata | +|-------------|----------|-----------------| +| `create_local_network()` | `AbstractAgent` method | Apply Calico session-scoped network policy | +| `destroy_local_network()` | `AbstractAgent` method | Remove Calico session-scoped network policy | +| `apply_network()` | `KernelCreationContext` method | Assign Calico endpoint labels, configure CNI network | +| `AbstractNetworkAgentPlugin` | `src/ai/backend/agent/plugin/network.py` | Kata-specific Calico plugin implementation | +| Service port mapping | Agent port pool | Use CNI portmap plugin | +| Calico endpoint labeling | New in KataAgent | Label VMs with session-id for policy matching | + +## Implementation Notes + +- **Calico deployment**: Standalone mode requires `calico-node` (Felix + BIRD) on each agent host and an etcd cluster. The etcd instance can be shared with Backend.AI's existing etcd if present, or deployed separately. For Kubernetes deployments, Calico is assumed to be pre-installed as the cluster CNI. +- **Calico CNI binaries**: `calico` and `calico-ipam` plugins must be installed in `/opt/cni/bin/`. The portmap plugin (from standard CNI plugins) is also required for service port exposure. +- **TC filter**: Kata's default networking mode since v1.6. No configuration needed — the Kata shim sets up TC rules automatically when it detects a veth interface in the container namespace. This is fully transparent to Calico. +- **IP pool sizing**: The default `10.244.0.0/16` pool provides ~65k IPs. For large deployments with many concurrent sessions, consider multiple IP pools or a larger CIDR. +- **Network policy lifecycle**: Session-scoped Calico network policies are created in `create_local_network()` and removed in `destroy_local_network()`. The agent interacts with the Calico datastore (etcd or Kubernetes API) via `calicoctl` or the Calico Python client library. +- **Endpoint labeling**: Each Kata VM's network endpoint must be labeled with `ai.backend.session-id`, `ai.backend.scaling-group`, and `ai.backend.cluster-role` for policy matching. Labels are applied via Calico's workload endpoint API. +- For Kubernetes deployments, `create_local_network()` is a no-op (same as KubernetesAgent) — the cluster CNI and Kubernetes NetworkPolicy handle pod networking. +- The `gwbridge_subnet` detection and `OMPI_MCA_btl_tcp_if_exclude` logic from DockerAgent should be adapted for Calico's host-local interface to support MPI workloads. +- VSOCK could theoretically be used for inter-VM communication (each VM has a unique CID), but this requires custom socket code in user workloads and is not practical for transparent hostname-based networking. + +## Host Requirements (Calico) + +In addition to the Kata host requirements in [configuration-deployment.md](configuration-deployment.md), Calico requires: + +| Requirement | Details | +|-------------|---------| +| `calico-node` service | Felix (policy agent) + BIRD (BGP daemon), runs as systemd unit or container | +| CNI plugins | `/opt/cni/bin/calico`, `/opt/cni/bin/calico-ipam`, `/opt/cni/bin/portmap` | +| CNI config | `/etc/cni/net.d/10-calico.conflist` | +| etcd (standalone mode) | etcd v3 cluster accessible from all agent hosts | +| `calicoctl` | CLI for IP pool and network policy management | +| Network connectivity | BGP peering (port 179) or VXLAN (UDP 4789) between agent hosts | +| IP forwarding | `net.ipv4.ip_forward = 1` (usually already enabled for Kata) | diff --git a/proposals/BEP-1051/scheduler-integration.md b/proposals/BEP-1051/scheduler-integration.md new file mode 100644 index 00000000000..631fb2dd7ad --- /dev/null +++ b/proposals/BEP-1051/scheduler-integration.md @@ -0,0 +1,249 @@ + + +# BEP-1051: Scheduler Integration + +## Summary + +Manager and scheduler changes to track agent backend types, route sessions to Kata agents via scaling group assignment, and account for per-VM memory overhead in resource scheduling. All changes are additive — existing Docker scheduling behavior is unchanged. + +## Current Design + +### Agent Registration + +Agents register via heartbeat events. The manager creates/updates `AgentRow` (`src/ai/backend/manager/models/agent/row.py:62`): + +```python +class AgentRow(Base): + id: Mapped[str] # Agent ID + status: Mapped[AgentStatus] + region: Mapped[str] + scaling_group: Mapped[str] # FK to scaling_groups.name + available_slots: Mapped[ResourceSlot] # Total capacity + occupied_slots: Mapped[ResourceSlot] # Currently used + architecture: Mapped[str] # e.g., "x86_64" + compute_plugins: Mapped[dict] # Loaded plugin metadata + # No backend type field exists +``` + +### Agent Selection + +The scheduler selects agents via `AgentSelector` implementations. `AgentInfo` (`src/ai/backend/manager/sokovan/scheduler/provisioner/selectors/selector.py`) contains: + +```python +@dataclass +class AgentInfo: + id: AgentId + available_slots: ResourceSlot + occupied_slots: ResourceSlot + architecture: str + # No backend type field +``` + +The selector filters by `available_slots >= requested_slots` within the target scaling group. + +### Scaling Groups + +`ScalingGroupOpts` (`src/ai/backend/manager/models/scaling_group/row.py`) configures: + +```python +class ScalingGroupOpts: + allowed_session_types: list[SessionTypes] + agent_selection_strategy: AgentSelectionStrategy # DISPERSED, CONCENTRATED, etc. + # No backend preference +``` + +Sessions are assigned to a scaling group; the scheduler then picks an agent within that group. + +## Proposed Design + +### Agent Backend Tracking + +Add a `backend` column to `AgentRow`: + +```python +class AgentRow(Base): + # ... existing columns ... + backend: Mapped[str] = mapped_column( + "backend", + sa.String(length=16), + nullable=False, + server_default="docker", + default="docker", + ) +``` + +**Alembic migration:** + +```python +def upgrade(): + op.add_column( + "agents", + sa.Column( + "backend", + sa.String(length=16), + nullable=False, + server_default="docker", + ), + ) +``` + +This is non-destructive: all existing agents get `backend="docker"` via the server default. + +### Agent Heartbeat Extension + +The agent heartbeat payload already includes metadata (architecture, version, compute_plugins). Add `backend`: + +```python +# In agent heartbeat handler (manager side) +async def handle_agent_heartbeat(self, agent_id, heartbeat_data): + backend = heartbeat_data.get("backend", "docker") # backward-compatible default + await self._update_agent_row(agent_id, backend=backend, ...) +``` + +On the agent side, the heartbeat includes `AgentBackend.value`: + +```python +# In AbstractAgent heartbeat sender +heartbeat_data = { + "backend": self.local_config.agent_common.backend.value, + "architecture": platform.machine(), + # ... existing fields ... +} +``` + +### Homogeneous Scaling Groups + +The recommended approach: each scaling group contains agents of a single backend type. No new columns on `ScalingGroupRow`. + +**Why homogeneous:** + +1. **Simplicity**: The existing scheduling flow is unchanged — sessions target a scaling group, the scheduler picks an agent within it. Backend type is implicit from the group. +2. **Operational clarity**: Operators know exactly which groups run Kata and which run Docker. +3. **No agent filtering changes**: The `AgentSelector` does not need backend-aware filtering. +4. **Resource semantics match**: `cuda.device` means the same thing in both backends (1 whole GPU), but the attachment mechanism differs. Keeping backends in separate groups prevents accidental mixing. + +**Deployment example:** + +``` +Scaling Groups: +├── default → DockerAgent instances (general compute) +├── gpu-docker → DockerAgent instances with GPUs (NVIDIA runtime) +└── gpu-kata → KataAgent instances with GPUs (VFIO + CoCo, always confidential) +``` + +Users/projects are assigned to scaling groups as before; selecting `gpu-kata` routes the session to a Kata agent. + +### VM Memory Overhead Accounting + +Each Kata VM consumes host memory for the hypervisor process, guest kernel, and kata-agent. This overhead must be reflected in schedulable resources. + +**Approach: Agent-side deduction** + +The Kata agent deducts VM overhead from `available_slots.mem` at registration: + +```python +# In KataAgent.scan_available_resources() +async def scan_available_resources(self, compute_device_types): + base_slots = await super().scan_available_resources(compute_device_types) + + kata_config = self.local_config.kata + # Reserve memory for max concurrent VMs + # Each session = 1 VM, so max_sessions * overhead + max_concurrent = self._estimate_max_concurrent_sessions(base_slots) + overhead_bytes = max_concurrent * kata_config.vm_overhead_mb * 1024 * 1024 + + adjusted_mem = base_slots.get(SlotName("mem"), Decimal(0)) - Decimal(overhead_bytes) + if adjusted_mem < 0: + adjusted_mem = Decimal(0) + + result = dict(base_slots) + result[SlotName("mem")] = adjusted_mem + return result +``` + +**Why agent-side deduction:** + +1. The scheduler sees accurate `available_slots` — no special-case logic needed +2. VM overhead is a property of the agent's configuration, not the session +3. The overhead amount is configurable via `kata.vm-overhead-mb` +4. No changes to `ResourceSlot` semantics visible to users or the API + +### Resource Slot Compatibility + +| Slot Name | Docker (CUDAPlugin) | Kata (CUDAVFIOPlugin) | Scheduler View | +|-----------|---------------------|------------------------|----------------| +| `cpu` | cgroup CPU shares | VM vCPU allocation | Identical | +| `mem` | cgroup memory limit | VM memory (minus overhead) | Identical | +| `cuda.device` | NVIDIA Container Toolkit | VFIO passthrough | Identical | +| `cuda.shares` | Fractional (hook libraries) | **Not available** | N/A for Kata agents | + +The scheduler treats `cuda.device` identically regardless of backend. The CUDAVFIOPlugin does not define `cuda.shares`, so it simply does not appear in Kata agents' `available_slots`. + +### AgentMeta / AgentInfo Updates + +Add `backend` to data transfer objects used by the scheduler: + +```python +# src/ai/backend/manager/repositories/scheduler/types/agent.py +@dataclass +class AgentMeta: + id: AgentId + addr: str + architecture: str + available_slots: ResourceSlot + scaling_group: str + backend: str = "docker" # NEW, backward-compatible default + +# src/ai/backend/manager/sokovan/scheduler/provisioner/selectors/selector.py +@dataclass +class AgentInfo: + id: AgentId + available_slots: ResourceSlot + occupied_slots: ResourceSlot + architecture: str + backend: str = "docker" # NEW +``` + +These fields are informational in the homogeneous scaling group model — the selector does not need to filter by backend. They are useful for: +- Admin visibility (GQL queries: "show me all Kata agents") +- Future mixed-group support (if ever needed) +- Monitoring and dashboards + +### GraphQL Exposure + +Add `backend` to the `Agent` GQL type: + +```python +@strawberry.type +class AgentNode: + # ... existing fields ... + backend: str # "docker" | "kubernetes" | "kata" +``` + +This lets the admin UI display which agents use which backend. + +## Interface / API + +| Change | Location | Type | +|--------|----------|------| +| `AgentRow.backend` column | `src/ai/backend/manager/models/agent/row.py` | DB schema (Alembic) | +| `AgentMeta.backend` field | `src/ai/backend/manager/repositories/scheduler/types/agent.py` | Dataclass | +| `AgentInfo.backend` field | `src/ai/backend/manager/sokovan/scheduler/provisioner/selectors/selector.py` | Dataclass | +| Heartbeat `backend` field | Agent heartbeat sender + Manager heartbeat handler | Protocol | +| `AgentNode.backend` | GQL Agent type | API | + +## Implementation Notes + +- The Alembic migration is non-destructive: adds a column with `server_default="docker"` +- Agent heartbeat backward compatibility: if `backend` is missing from heartbeat, default to `"docker"` +- No changes to the RPC protocol between Manager and Agent for `create_kernel` / `destroy_kernel` — the payload is the same; only the agent-side handling differs +- The `compute_plugins` JSONB column on `AgentRow` already captures which plugins are loaded (e.g., `cuda` vs `cuda_vfio`), providing implicit backend information. The explicit `backend` column adds clarity. diff --git a/proposals/BEP-1051/storage-compatibility.md b/proposals/BEP-1051/storage-compatibility.md new file mode 100644 index 00000000000..2d9affa351e --- /dev/null +++ b/proposals/BEP-1051/storage-compatibility.md @@ -0,0 +1,466 @@ + + +# BEP-1051: Storage and Volume Mount Compatibility + +## Summary + +Backend.AI's existing `Mount` abstraction (bind mount with source path, target path, and permission) works unchanged for KataAgent at the container level. Storage mounts are split into two paths: (1) VFolder data is mounted directly inside the guest VM using the guest's own NFS/Lustre/WekaFS kernel client (preserving RDMA, 100% native throughput), and containers bind-mount vfolder subdirectories from this guest-side mount — identical to Docker's model. (2) Scratch/config directories (`/home/config`) use virtio-fs for host→guest sharing of non-performance-critical, host-originated configuration data. No new storage management interface is required. This document details the mount compatibility layer, identifies mounts that require Kata-specific handling, and analyzes the I/O performance implications. + +## Current Design: Docker Bind Mount Path + +When `DockerAgent` creates a container, all storage mounts follow a single path: + +``` +Host filesystem ──(Linux bind mount)──→ Container mount namespace +``` + +This is zero-overhead because Docker containers share the host kernel's VFS. The mount specification flows through: + +1. Manager resolves `VFolderMount.host_path` via Storage Proxy (`get_mount_path()` RPC) +2. Agent constructs `Mount(MountTypes.BIND, host_path, kernel_path, permission)` +3. `process_mounts()` converts to Docker API format: `{"Type": "bind", "Source": "...", "Target": "..."}` +4. Docker creates the bind mount in the container's mount namespace + +Key files: +- `VFolderMount` type: `src/ai/backend/common/types.py:1299` +- `mount_vfolders()`: `src/ai/backend/agent/agent.py:557` (inherited from `AbstractKernelCreationContext`) +- `get_intrinsic_mounts()`: `src/ai/backend/agent/docker/agent.py:512` +- `process_mounts()`: `src/ai/backend/agent/docker/agent.py:764` + +### Mount Inventory (DockerAgent) + +| Category | Mount Point(s) | Source | Type | Perm | +|----------|---------------|--------|------|------| +| Scratch workspace | `/home/config` | `scratch_dir/config` | BIND | RO | +| | `/home/work` | `scratch_dir/work` | BIND | RW | +| | `/tmp` | `scratch_dir_tmp` (tmpfs) | BIND | RW | +| Timezone | `/etc/localtime`, `/etc/timezone` | Host files | BIND | RO | +| lxcfs | `/proc/{cpuinfo,meminfo,stat,...}` | `/var/lib/lxcfs/proc/*` | BIND | RW | +| | `/sys/devices/system/cpu{,/online}` | `/var/lib/lxcfs/sys/...` | BIND | RW | +| Agent IPC | `/opt/kernel/agent.sock` | Agent Unix socket | BIND | RW | +| Domain socket proxy | `{host_sock_path}` | `ipc_base_path/proxy/*.sock` | BIND | RW | +| Core dump | `debug.coredump.core_path` | `debug.coredump.path` | BIND | RW | +| Deep learning samples | `/home/work/samples` | Docker volume `deeplearning-samples` | VOLUME | RO | +| krunner volume | `/opt/backend.ai` | Docker volume per distro/arch | VOLUME | RO | +| krunner binaries | `/opt/kernel/{su-exec,entrypoint.sh,...}` | Agent package resources (15+ files) | BIND | RO | +| LD_PRELOAD hooks | `/opt/kernel/libbaihook.so` | Agent package resources | BIND | RO | +| Jail sandbox | `/opt/kernel/jail` | Agent package resources | BIND | RO | +| Python libraries | `/opt/backend.ai/.../ai/backend/{kernel,helpers}` | Agent package resources | BIND | RO | +| Accelerator hooks | `/opt/kernel/{hook}.so` | Compute plugin paths | BIND | RO | +| VFolders | `/home/work/{vfolder}` | NFS/CephFS/local path | BIND | RO/RW | + +Total: **20-30+ mounts** per container (varies by accelerator and vfolder count). + +## Proposed Design: virtio-fs Compatibility Layer + +### How Kata Handles Bind Mounts + +When a container is created via containerd with the Kata runtime class, the Kata shim **automatically** translates bind mount specifications into virtio-fs shares: + +```mermaid +flowchart TB + subgraph HOST["Host Side"] + hostfs["Host filesystem path (/mnt/vfolders/project)"] + virtiofsd["virtiofsd daemon (FUSE server, 1 per sandbox)"] + hostfs --> virtiofsd + end + subgraph VM["Guest VM (KVM boundary)"] + vfsmount["virtio-fs mount (guest kernel FUSE client)"] + container["Container mount namespace (/home/work/project)"] + vfsmount --> container + end + virtiofsd -->|"virtio-fs device"| vfsmount +``` + +The Kata shim handles this translation internally: +1. Reads bind mount specs from the OCI runtime config +2. Configures virtiofsd to serve the host directories +3. Inside the guest, kata-agent mounts the virtio-fs share +4. kata-agent creates the container's mount namespace with the target paths + +**No changes to the `Mount` abstraction are needed.** The `KataKernelCreationContext.process_mounts()` method passes the same `Mount(BIND, source, target, permission)` objects to containerd, and the Kata shim does the rest. + +### KataAgent process_mounts() Implementation + +```python +async def process_mounts(self, mounts: Sequence[Mount]) -> None: + """Convert Backend.AI mounts to containerd mount specs. + + For Kata containers, bind mounts are automatically translated + to virtio-fs shares by the Kata shim — we just pass them through + as standard OCI bind mounts. + """ + oci_mounts = [] + for mount in mounts: + if mount.type == MountTypes.BIND: + oci_mounts.append({ + "destination": str(mount.target), + "source": str(mount.source), + "type": "bind", + "options": self._build_mount_options(mount), + }) + elif mount.type == MountTypes.TMPFS: + oci_mounts.append({ + "destination": str(mount.target), + "type": "tmpfs", + "options": ["nosuid", "nodev", "size=65536k"], + }) + self._container_mounts.extend(oci_mounts) +``` + +### Intrinsic Mount Evaluation for Kata + +Each mount is evaluated against the fundamental difference: Docker containers share the host kernel, while Kata VMs run a separate guest kernel inside KVM. This changes which host-side resources are visible, which kernel interfaces are shared, and which IPC mechanisms work across the boundary. + +#### Mounts That Work Transparently (KEEP) + +**Scratch config directory** (`/home/config` RO): The agent creates this on the host and the Kata shim shares it via virtio-fs. `/home/config` contains per-session configuration files written by the agent before container start: + +**Work directory** (`/home/work` RW): This is a directory on the guest VM's own disk (part of the guest rootfs). It is NOT a virtio-fs mount. VFolder subdirectories are bind-mounted into `/home/work/{vfolder}` from guest-side NFS/Lustre mounts. `/home/config` contains: +- `environ.txt` — read by `BaseRunner.__init__()` to populate the child process environment (`child_env` and `os.environ`) +- `intrinsic-ports.json` — read by `BaseRunner._init()` for ZMQ socket binding and intrinsic service port assignment +- `resource.txt` — read only by the **agent** (host-side) for recovery/resource tracking; NOT consumed by the kernel runner +- `ssh/` — cluster SSH keys and port mappings, read by the kernel runner's `init_sshd_service()` + +`/home/work` is the user's persistent workspace and vfolder mount point. In the Kata design, this path is provided by a **direct guest-side mount** of the underlying storage system (NFS/Lustre/WekaFS) into the VM. The guest kernel's own filesystem client mounts the storage volume (preserving RDMA), and containers bind-mount vfolder subdirectories from the guest filesystem — identical to Docker's host-level mount model. + +**Timezone files** (`/etc/localtime`, `/etc/timezone`): Baked into the attested guest rootfs with the host's timezone. The guest rootfs is a managed infrastructure image, so timezone configuration is set at build time. + +**VFolder mounts** (`/home/work/{vfolder}`): User storage is **not** exposed via virtio-fs, because that breaks RDMA and adds FUSE overhead on high-performance storage backends. Instead, the storage volume is mounted natively inside the guest using the NFS/Lustre/WekaFS kernel client. Containers bind-mount vfolder subdirectories from this guest-side path. The same `Mount(BIND, host_path, target, permission)` spec is used at the container level, but `host_path` refers to the guest-visible path (e.g., `/mnt/vfstore//subpath`). Mount specs are delivered to the guest via `storage-mounts.json` in `/home/config/` (virtio-fs config channel). + +**krunner binaries** (`/opt/kernel/su-exec`, `entrypoint.sh`, `dropbearmulti`, `sftp-server`, `tmux`, `ttyd`, etc.) and **Python libraries** (`ai.backend.kernel`, `ai.backend.helpers`): All executables and libraries are **baked into the attested guest rootfs**. Under CoCo-by-default, the host is untrusted and must not be a source of executables — no host→guest sharing of binaries via virtio-fs. These are present in the guest rootfs at the expected paths (`/opt/kernel/`, `/opt/backend.ai/lib/...`). + +#### Mounts That Change Mechanism (CHANGE) + +**`/tmp` tmpfs**: In Docker, `/tmp` is backed by a host-side tmpfs bind-mounted into the container. For Kata, sharing host tmpfs across the VM boundary via virtio-fs adds unnecessary overhead for ephemeral data. Use guest-side tmpfs instead: `Mount(TMPFS, None, "/tmp", RW)` — allocated inside guest RAM, never crosses the VM boundary. + +**krunner volume** (`/opt/backend.ai`): This Docker named volume contains the pre-built Python runtime environment (distro-matched libc, interpreter, packages). Docker named volumes are not available with containerd. **Phase 1**: extract krunner to a host directory and share via virtio-fs bind mount (transparent translation). **Production**: bake into the guest rootfs image (faster startup, eliminates one virtio-fs share; requires rebuild on krunner update). + +**Core dump mount**: In Docker, the host kernel's `/proc/sys/kernel/core_pattern` governs container processes because they share the host kernel. The agent bind-mounts the host's coredump directory so dumps land on the host filesystem. In Kata, the **guest kernel** has its own `core_pattern` — host-side configuration doesn't affect guest processes. To capture guest core dumps: configure the guest kernel's `core_pattern` to write to a guest-local path (e.g., `/home/work/.coredumps` on the guest disk). Low priority. + +#### Mounts That Are Skipped (SKIP) + +All of the following are Docker-specific mechanisms that are unnecessary or incompatible with the Kata VM model: + +- **Agent socket** (`/opt/kernel/agent.sock`): ZMQ REP socket for jail/C binaries (PID translation, jail status). Irrelevant — jail is not used and the VM boundary isolates PIDs. The primary agent↔kernel-runner channel (ZMQ PUSH/PULL) is TCP-based and works over the Calico network. +- **lxcfs** (`/proc/cpuinfo`, `/proc/meminfo`, etc.): Provides cgroup-scoped `/proc` for Docker containers that share the host kernel. Unnecessary — the guest kernel's `/proc` already reflects the VM's allocated resources. +- **`libbaihook.so`** (LD_PRELOAD): Hooks `sysconf()` to return cgroup-scoped CPU count. Unnecessary — the guest kernel already returns the VM's vCPU count. +- **`jail` sandbox** (`/opt/kernel/jail`): ptrace-based syscall tracer. Redundant — the VM boundary (KVM) is a stronger isolation layer. +- **Domain socket proxies**: Host-side Unix socket access for service containers. UDS cannot cross the VM boundary. +- **Deep learning samples volume**: Deprecated Docker named volume. Not available with containerd. + +#### Mounts That Differ Entirely (DIFFERENT) + +**Accelerator plugin mounts**: For Docker, the `CUDAPlugin` injects hook `.so` files via bind mount (appended to `LD_PRELOAD`) and generates Docker `DeviceRequests` for the NVIDIA Container Toolkit. For Kata with VFIO passthrough, this is entirely replaced: +- No `LD_PRELOAD` GPU hooks — the GPU is natively visible inside the VM via VFIO PCI passthrough +- `CUDAVFIOPlugin` generates VFIO device configuration (PCI addresses, IOMMU groups) passed to the Kata shim, not Docker `DeviceRequests` +- `nvidia-smi` and CUDA work natively inside the guest with the passthrough GPU +- See [vfio-accelerator-plugin.md](vfio-accelerator-plugin.md) for details + +### Mount Compatibility Matrix + +| Mount | Docker | Kata | Verdict | +|-------|--------|------|---------| +| `/home/config` (scratch) | Bind mount | virtio-fs | **KEEP** — same spec | +| `/home/work` (workspace) | Bind mount | Guest disk directory | **CHANGE** — on guest rootfs, not virtio-fs | +| `/tmp` (tmpfs) | Host tmpfs bind | Guest-side tmpfs | **CHANGE** — `TMPFS` type | +| `/etc/localtime`, `/etc/timezone` | Bind mount | virtio-fs | **KEEP** — same spec | +| lxcfs `/proc/*`, `/sys/*` | Bind mount | N/A | **SKIP** — guest kernel provides | +| `libbaihook.so` + `LD_PRELOAD` | Bind mount | N/A | **SKIP** — guest kernel provides | +| `jail` sandbox | Bind mount | N/A | **SKIP** — VM is stronger isolation | +| `/opt/kernel/agent.sock` | Unix socket (jail/C binaries only) | N/A | **SKIP** — jail skipped, PID translation irrelevant in VM | +| Domain socket proxies | Unix socket | N/A | **SKIP** — defer to future phase | +| Core dump path | Host core_pattern | Guest core_pattern | **CHANGE** — guest-side config | +| krunner volume (`/opt/backend.ai`) | Docker volume | virtio-fs / rootfs | **CHANGE** — no Docker volumes | +| Deep learning samples | Docker volume | N/A | **SKIP** — legacy, deprecated | +| krunner binaries (15+ files) | Bind mount | virtio-fs | **KEEP** — same spec | +| Python libs (`kernel`, `helpers`) | Bind mount | virtio-fs | **KEEP** — same spec | +| Accelerator hooks | Plugin `.so` | VFIO PCI config | **DIFFERENT** — new plugin | +| VFolders | Bind mount | Guest-side NFS/Lustre bind mount | **CHANGE** — NOT virtio-fs; mounted via guest kernel client | + +### I/O Performance Characteristics + +#### Throughput (vs Native) + +Benchmark data from Red Hat virtio-fs mailing list, Kata Containers issues, and Proxmox community testing: + +| Workload | virtio-fs (no DAX) | virtio-fs (DAX) | virtio-9p | Source | +|----------|-------------------|-----------------|-----------|--------| +| Sequential read (psync, 4K) | 98 MB/s | 660 MB/s | 99 MB/s | Red Hat ML | +| Sequential read (mmap, 4 threads) | ~219 MB/s | 2,849-3,107 MB/s | 140 MB/s | LWN RFC | +| Sequential write (psync) | ~98 MB/s | 487 MB/s | — | Red Hat ML | +| Random 4K read IOPS | 36.2k | 64.2k | — | Proxmox | +| Random 4K write IOPS | 15.5k | 21.4k | — | Proxmox | +| Large file random R/W (4 GB) | 211 / 70.6 MB/s | — | 43 / 14 MB/s | Kata #2815 | + +**Relative to native host performance** (tuned, kernel 6.11+, `cache=never`, `direct-io=1`): + +| Workload | % of Native | Notes | +|----------|-------------|-------| +| Sequential read | 85-95% (DAX) | DAX critical for large sequential reads | +| Sequential write | 75-85% (DAX) | `cache=none` outperforms `cache=always` for writes | +| Random 4K read | ~85% (tuned) | 25.4k vs 29.8k IOPS (Proxmox 6.11) | +| Random 4K write | ~85% (tuned) | 13.7k vs 16.1k IOPS | +| Metadata (file create) | ~50-70% (est.) | 3.7x faster than 9p (714 vs 194 files/sec) | +| Directory listing | ~60-80% (est.) | 6.7x faster than 9p with caching (13.5k vs 2k files/sec) | + +#### DAX: Critical Caveats + +DAX maps host page cache directly into guest memory, providing near-native read performance for sequential workloads. However: + +1. **DAX thrashing**: When the working set exceeds the DAX window size, performance **degrades catastrophically** — Kata #2138 measured 3.6k IOPS with DAX enabled vs 252k IOPS with DAX disabled for random workloads. DAX mappings thrash as the VM continuously faults in and evicts pages. + +2. **DAX window sizing**: Each 2 MB DAX chunk requires 32 KB of guest page descriptors. A 16 GB window costs ~256 MB in page descriptor overhead (device memory, not counted against VM RAM). Undersized windows trigger thrashing; oversized windows waste host memory reservation. + +3. **Small file penalty**: Files under 32 KB should not use DAX — the 32 KB page descriptor overhead per 2 MB chunk exceeds the benefit. Linux 5.17+ supports per-file DAX (`dax=inode`) to selectively enable DAX only for large files. + +**Recommendation**: **Disable DAX by default** (`virtio_fs_cache_size = 0` in `kata.toml`). AI/ML workloads alternate between large sequential reads (training data loading — DAX beneficial) and heavy random I/O (checkpointing, Python imports, pip installs — DAX harmful). The penalty from DAX thrashing on random I/O is far worse than the throughput loss from no-DAX on sequential reads. Enable DAX only for workloads with known large-sequential-read-dominant I/O patterns. + +#### Host-Side Overhead + +Kata runs **one `virtiofsd` process per sandbox** (per VM), not one per bind mount. The virtiofsd serves a shared top-level directory (`/run/kata-containers/shared/sandboxes//`) that contains all container mounts underneath it. Individual bind mounts are subdirectories within this shared tree: + +| Resource | Per virtiofsd (1 VM) | 10 VMs | +|----------|---------------------|--------| +| Memory (RSS) | ~15 MB idle | ~150 MB | +| Heap under load | 5-6 MB | ~50-60 MB | +| File descriptors | Up to 80k under heavy load | Cumulative FD pressure | +| Shared memory | VM RAM size (shared with hypervisor, not double-counted) | Per-VM | + +The `virtiofsd` Rust implementation is recommended over the legacy C version for better CPU efficiency. Use `--thread-pool-size=1` for most workloads — the default 64 threads causes lock contention that reduces IOPS by ~27%. + +**Mount count impact**: Since virtiofsd is shared per sandbox, the 15+ individual krunner binary mounts do not create additional processes. However, consolidating them into a single directory mount simplifies the OCI mount spec and reduces the kata-agent's mount setup time inside the guest. + +#### AI/ML Workload Impact Assessment + +| Phase | I/O Pattern | Dominant Factor | virtio-fs Impact | +|-------|-------------|-----------------|-----------------| +| Data loading | Large sequential read from vfolders | Throughput | 85-95% native (no DAX sufficient; NVMe/SSD saturates first) | +| Training | GPU compute, minimal I/O | GPU | Negligible — I/O is not on critical path | +| Checkpointing | Large sequential write | Throughput | 75-85% native | +| pip install / imports | Many small file metadata ops | Latency | 50-70% native (one-time cost at session start) | +| Jupyter notebook | Small random R/W | Latency | ~85% native (tuned) | + +For typical AI/ML workloads, GPU compute dominates wall-clock time. The virtio-fs overhead is concentrated in data loading (first minutes of a training run) and session startup (pip install, Python imports). Once training begins, the GPU utilization rate determines throughput — not storage I/O. + +## Interface / API + +No new storage interfaces are introduced. The existing abstractions are reused: + +| Abstraction | Location | Change for Kata | +|-------------|----------|-----------------| +| `VFolderMount` | `src/ai/backend/common/types.py:1299` | None | +| `MountTypes` | `src/ai/backend/common/types.py:612` | None | +| `Mount` | `src/ai/backend/agent/resources.py:843` | None | +| `MountPermission` | `src/ai/backend/common/types.py:603` | None | +| `mount_vfolders()` | Agent method | Reuse as-is | +| `get_intrinsic_mounts()` | Agent method | Override: skip lxcfs, agent socket, domain socket proxies, deep learning samples; use guest-side tmpfs | +| `mount_krunner()` | Agent method | Override: skip libbaihook.so + LD_PRELOAD, skip jail; convert krunner Docker volume to bind mount; skip accelerator LD_PRELOAD hooks | +| `process_mounts()` | Agent method | Override to produce OCI mount format (vs Docker API format) | + +## Implementation Notes + +- The virtiofsd daemon is managed by the Kata shim, not by the Backend.AI agent. No virtiofsd lifecycle code is needed in KataAgent. +- virtio-fs shares are set up per-sandbox (per-VM). Multiple containers in the same Kata sandbox share the same virtio-fs instance. Since Backend.AI uses one container per session (one sandbox per container), this is a 1:1 mapping. +- File ownership (UID/GID) is preserved across the virtio-fs boundary. The `kernel_uid`/`kernel_gid` settings work as expected. +- Symbolic links within mounted directories work correctly with virtio-fs. +- File locking (flock, fcntl) works with virtio-fs but may have different semantics for distributed filesystems (CephFS, NFS) that are already mounted on the host. +- `KataKernelCreationContext.mount_krunner()` should consolidate the 15+ individual krunner binary mounts into a single directory mount (e.g., bind-mount the entire `runner/` directory to `/opt/kernel/`) to simplify the OCI mount spec and reduce kata-agent's mount setup time inside the guest. While virtiofsd is shared per sandbox (not per mount), fewer mount entries reduce guest-side setup overhead. Docker uses individual file mounts to overlay binaries into an existing container filesystem; Kata can use a directory mount since the guest rootfs is purpose-built. +- The `ContainerSandboxType` config option is irrelevant for Kata — always use `DOCKER` (no-op sandbox) or introduce a `VM` type. Never use `JAIL` with Kata. +- **VFolder data does NOT use virtio-fs.** virtio-fs introduces FUSE overhead and breaks RDMA data paths for high-performance storage backends (Lustre over IB, WekaFS RDMA, GPFS). Instead, storage volumes are mounted directly inside the guest VM using the guest's own NFS/Lustre/WekaFS kernel client (preserving RDMA, 100% native throughput). The agent writes `storage-mounts.json` to `/home/config/` (virtio-fs config channel) containing volume mount specs; a guest boot script reads this and performs the mounts before the container starts. Vfolder subdirectories are then bind-mounted into the container — identical to Docker's host-level mount model. Storage topology is visible in guest `/proc/mounts` (acceptable — same isolation model as Docker; users can see mount info but cannot mount their own disks). virtio-fs is retained only for scratch/config directories (`/home/config`). See [migration-compatibility.md](migration-compatibility.md) "VFolder Storage" section for full architecture. + +## Direct Storage Access from Guest VMs + +### Motivation + +Backend.AI supports 12+ storage backends (`vfs`, `xfs`, `netapp`, `dellemc-onefs`, `weka`, `gpfs`, `cephfs`, `vast`, `exascaler`, `purestorage`, `hammerspace`, etc.), each exposing vfolders to agent hosts via different protocols. The adopted design mounts storage volumes **directly inside the guest VM** using the guest's own NFS/Lustre/WekaFS kernel client (preserving RDMA), then bind-mounts vfolder subdirectories into the container at `/home/work/{vfolder}`. This section documents the analysis that led to this decision, including why virtio-fs was rejected for vfolder data I/O. + +### Storage Provider Protocol Survey + +| Provider | Client Protocol | Native VM Client Feasible? | Notes | +|----------|----------------|---------------------------|-------| +| VAST | NFS v3/v4 | Yes (NFS client in guest) | Pure NFS appliance; no vendor-specific VM integration | +| NetApp ONTAP | NFS v3/v4/v4.1 | Yes (NFS client in guest) | Supports KVM environments via NFS; no Kata-specific support | +| Weka | NFS v3/v4.1 or WekaFS POSIX | NFS: Yes; WekaFS: No | WekaFS native client requires SR-IOV NIC passthrough in VMs — too complex | +| Pure Storage FlashBlade | NFS v3 | Yes (NFS client in guest) | Standard NFS appliance; no vendor-specific VM integration | +| Dell EMC PowerScale (OneFS) | NFS v3/v4/v4.2 | Yes (NFS client in guest) | Standard NFS; guest OS mount documented for VMware VMs | +| Hammerspace | NFS (pNFS/FlexFiles) | Yes (NFS client in guest) | Metadata orchestration layer; supports VM and container environments | +| CephFS | Kernel ceph client or NFS gateway | Yes (kernel ceph client in guest) | Native client in guest achieves ~46% of virtio-fs throughput (see benchmarks below) | +| Lustre (DDN EXAScaler) | Lustre kernel client | Possible but complex | Requires Lustre kernel modules in guest rootfs; no Kata integration exists | +| GPFS (IBM Storage Scale) | GPFS kernel client or NFS gateway | Possible but complex | CSI driver for K8s exists; native client requires cluster membership | +| XFS / VFS (local) | Local filesystem | No | Local storage is host-attached; direct guest access not applicable | + +### Why virtio-fs Is the Right Default + +The Kata project explicitly chose virtio-fs over network protocols for host-guest file sharing: + +> "These techniques [shared memory, DAX mapping] are not applicable to network file systems since the communications channel is bypassed by taking advantage of shared memory on a local machine." + +For **host-local storage** (XFS, local NVMe), virtio-fs is unambiguously faster — it uses shared memory between host and guest processes, avoiding any network stack overhead. + +An early Kata RFC ([kata-containers/runtime#279](https://github.com/kata-containers/runtime/issues/279)) explored NFS-over-VSOCK as an alternative to 9p (the predecessor to virtio-fs). While optimized NFS achieved 17-35x raw R/W performance over 9p, virtio-fs with DAX superseded this approach, and the NFS-over-VSOCK kernel patches were never upstreamed. + +### Double-Hop Analysis for Network Storage + +For storage that is **already network-attached** (NFS, CephFS, etc.), the current design creates a double-hop: + +``` +Storage cluster ──(NFS/CephFS)──→ Host mount ──(virtio-fs)──→ Guest mount +``` + +A direct guest mount would eliminate one hop: + +``` +Storage cluster ──(NFS/CephFS)──→ Guest mount (directly) +``` + +However, benchmarks from CephFS testing on Proxmox suggest virtio-fs passthrough still outperforms direct guest mounts: + +| Path | Write (MiB/s) | Read (MiB/s) | +|------|---------------|--------------| +| Host CephFS → virtio-fs → guest | 2145 | 3488 | +| Guest native ceph client → cluster | 984 | 1601 | + +The virtio-fs path wins because the host-side ceph client benefits from the host page cache and DAX mapping shares this cache directly with the guest — avoiding the overhead of running a separate ceph client (with its own cache, network connections, and OSD interactions) inside the resource-constrained guest kernel. + +For NFS-based providers, the comparison is less clear-cut since NFS clients are lighter weight than ceph clients, but the shared-memory advantage of virtio-fs likely still applies. + +### Practical Barriers to Direct Guest Mount + +Even where technically feasible, direct guest NFS mount has significant practical issues: + +1. **Guest kernel configuration**: The default Kata guest kernel does not include NFS client modules. A custom kernel with `CONFIG_NFS_FS=y`, `CONFIG_NFS_V4=y` is required. Kata [issue #10509](https://github.com/kata-containers/kata-containers/issues/10509) documents that even after adding NFS support, automatic volume mounts fail silently — only manual mounts work. + +2. **Network connectivity**: The guest must have network access to the storage cluster. In the virtio-fs model, only the host needs storage network access — the guest's network is limited to the container workload's needs. + +3. **Credential management**: NFS exports, CephFS keyrings, Lustre/GPFS client configurations must be provisioned inside the guest. This conflicts with the security model — especially for confidential computing where minimizing guest configuration surface is important. + +4. **Mount lifecycle**: Backend.AI's agent manages mount lifecycle (create scratch dirs, write config files, clean up). With virtio-fs, this host-side management works transparently. Direct guest mounts would require the agent to coordinate mount operations inside the guest via kata-agent or containerd exec. + +5. **Storage proxy integration**: Backend.AI's Storage Proxy resolves `host_path` for vfolders. These paths are host-local. Direct guest mount would require the guest to resolve storage paths independently, bypassing the Storage Proxy entirely. + +### Recommendation + +**Use direct guest-side mounts for VFolder data; reserve virtio-fs for scratch/config only.** This aligns with the CoCo-by-default trust model and preserves RDMA performance: + +- VFolder data is mounted **inside the guest VM** via the native NFS/Lustre/WekaFS kernel client, achieving 100% native throughput with RDMA preserved end-to-end +- Containers bind-mount vfolder subdirectories from the guest filesystem — the same model as Docker +- virtio-fs is retained only for non-performance-critical, host-originated paths (scratch/config directories), where the FUSE overhead is acceptable and simplifies host→guest config file delivery +- Storage topology visible in guest `/proc/mounts` — acceptable per the same isolation model as Docker (vfolder permissions enforce access control, not mount concealment) + +The earlier analysis of virtio-fs performance (DAX, throughput benchmarks) remains relevant as background context for understanding the I/O characteristics of scratch/config mounts, but **virtio-fs is not used for VFolder data I/O** in the final design. + +## Per-VM Cloned Disk: Hybrid Storage Model + +### Motivation + +While virtiofsd runs as a single process per sandbox (not per mount), the virtio-fs-for-everything model has I/O performance implications: every file operation on read-only infrastructure mounts (krunner binaries, Python libraries, timezone files) crosses the VM boundary via the FUSE protocol, even though these files never change at runtime. Block devices (virtio-blk) provide near-native I/O for read-only content because the guest kernel accesses them directly without FUSE overhead. + +Conventional VM hypervisors (VMware, Proxmox, OpenStack) solve this differently: they clone a template disk image for each VM and the guest mounts it as a local block device. Kata Containers supports a similar model via **qcow2 CoW (Copy-on-Write) cloning** and **virtio-blk device passthrough**. + +### Block Device Options in Kata + +**qcow2 CoW cloning**: Create instant thin-clone disks from a base image. `qemu-img create -f qcow2 -b base.qcow2 -F qcow2 clone.qcow2` takes <10ms and produces a file that starts at near-zero size, growing only as the guest writes new data. Reads fall through to the base image. QEMU and Cloud Hypervisor both support multi-level backing chains. + +**devicemapper snapshotter**: containerd's devicemapper snapshotter stores each OCI image layer as a thin-provisioned block device. The Kata shim attaches these as `virtio-blk` devices to the guest VM. The kata-agent inside the guest assembles the layers using devicemapper. No virtiofsd needed for the container rootfs. + +**EROFS snapshotter** (containerd 2.1+): Each OCI image layer is stored as a read-only EROFS (Enhanced Read-Only File System) blob and exposed to the guest via `virtio-blk`. EROFS is optimized for read-heavy container workloads with better compression and lower memory overhead than ext4/xfs overlays. + +### Two Models Compared + +**Model A: virtio-fs for everything** (reference — NOT the adopted model for vfolders) + +```mermaid +flowchart LR + subgraph HOST["Host"] + virtiofsd["virtiofsd (1 per VM)"] + end + subgraph VM["Guest VM"] + config["/home/config"] + work["/home/work"] + kernel["/opt/kernel/*"] + data["/home/work/data"] + end + virtiofsd -->|virtio-fs| config & work & kernel & data +``` + +**Model B: Per-VM cloned disk + selective virtio-fs** (hybrid) + +```mermaid +flowchart LR + subgraph HOST["Host"] + qcow2["base.qcow2 → clone.qcow2 (CoW)"] + vfsd["virtiofsd (scratch only)"] + end + subgraph VM["Guest VM"] + root["/ (rootfs: krunner, pylibs, timezone)"] + scratch["/home/config"] + vf1["/home/work/data"] + vf2["/home/work/models"] + end + qcow2 -->|virtio-blk| root + vfsd -->|virtio-fs| scratch + NFS["Storage cluster"] -->|"NFS/Lustre (direct)"| vf1 & vf2 +``` + +### Trade-off Analysis + +| Dimension | Model A (virtio-fs only) | Model B (hybrid) | +|-----------|-------------------------|-------------------| +| Host memory overhead | ~15 MB per VM (1 virtiofsd) | ~15 MB per VM (same — 1 virtiofsd for remaining shares) | +| Read-only I/O perf | 85-95% native via FUSE path | Near-native (virtio-blk, direct block I/O) | +| Random 4K IOPS | ~85% native (tuned) | Near-native for rootfs; ~85% for vfolders | +| Bidirectional sharing | Native (inherent to virtio-fs) | Only for virtio-fs mounts (scratch, vfolders) | +| Config injection | Write to host path → visible in guest | Write to host path → visible only for virtio-fs mounts | +| Disk space per VM | Zero (shared host paths) | Thin clone overhead (~100-500 MB per VM for CoW writes) | +| Startup complexity | Low (Kata shim handles) | Medium (template management, clone lifecycle) | +| Image update rollout | Immediate (host files are live) | Requires template rebuild + new clones | + +### Critical Constraint: Bidirectional File Sharing + +Backend.AI's architecture relies on the agent writing config files (environ.txt, resource.txt, SSH keys, accelerator configs) to scratch directories **after** container creation — the kernel runner reads these at startup. This requires **bidirectional** host-guest file sharing: + +1. Agent writes `scratch_dir/config/environ.txt` on the host +2. Kernel runner reads `/home/config/environ.txt` inside the guest +3. User writes files to `/home/work/` inside the guest +4. VFolder sync reads those files from `scratch_dir/work/` on the host + +Block devices (qcow2 clones, virtio-blk) are **guest-local** — the host cannot write to them after the VM boots without offline image manipulation. This means scratch directories and vfolders **must** remain on virtio-fs regardless of what model is chosen for read-only infrastructure mounts. + +### Recommended Hybrid Approach + +Use block devices for **read-only infrastructure** (VM rootfs, container image, krunner binaries, Python libraries), virtio-fs for **config delivery** (scratch/config dirs), and direct guest-side NFS/Lustre mounts for **vfolder data**: + +| Mount Category | Transport | Rationale | +|---------------|-----------|-----------| +| VM boot rootfs | virtio-pmem (existing) | Kata default — read-only shared base image | +| Container OCI image | virtio-blk (devicemapper/EROFS) | Read-only layers; eliminates largest virtiofsd share | +| krunner + Python libs | Baked into guest rootfs template **or** virtio-blk overlay | Eliminates 15+ individual virtio-fs shares | +| Timezone | Baked into guest rootfs template | Static config, no runtime changes | +| Scratch dirs (`/home/config`) | virtio-fs | Bidirectional sharing required for config delivery | +| VFolders | **Direct guest-side NFS/Lustre mount** | RDMA-preserving; NOT virtio-fs | + +Since virtiofsd is already one process per sandbox, the hybrid model does not reduce process count. Its benefit is **I/O performance**: read-only infrastructure content is served via direct block I/O instead of the FUSE protocol. + +### Implementation Considerations + +- **Template lifecycle**: The guest rootfs template containing krunner binaries, Python libraries, DCGM/Node Exporter, and storage mount scripts must be rebuilt when these components are updated. Versioned template naming (`kata-rootfs-{version}.qcow2`) enables atomic rollover. +- **devicemapper setup**: Requires a dedicated thin pool on the host (LVM thin provisioning or loopback device). containerd's devicemapper snapshotter handles pool management, but initial setup is more complex than the default overlayfs snapshotter. +- **EROFS availability**: Requires containerd >= 2.1 and the `nydus-snapshotter` plugin. EROFS support is maturing but not yet the default in most distributions. +- **Phase 1 storage model**: VFolders use direct guest-side NFS/Lustre mounts from day one (RDMA required). virtio-fs is used only for scratch/config directories (`/home/config`). krunner binaries and Python libraries are baked into the attested guest rootfs (CoCo — host untrusted). The hybrid block device model (virtio-blk for container rootfs) is a future optimization. diff --git a/proposals/BEP-1051/vfio-accelerator-plugin.md b/proposals/BEP-1051/vfio-accelerator-plugin.md new file mode 100644 index 00000000000..86ff94f1993 --- /dev/null +++ b/proposals/BEP-1051/vfio-accelerator-plugin.md @@ -0,0 +1,429 @@ + + +# BEP-1051: VFIO Accelerator Plugin + +## Summary + +A new `CUDAVFIOPlugin` compute plugin manages NVIDIA GPUs via VFIO passthrough for Kata VMs. Unlike the existing `CUDAPlugin` which uses Docker's `DeviceRequests` API and NVML, this plugin discovers devices via sysfs PCI enumeration, validates IOMMU group isolation, and produces VFIO device assignment configuration consumed by `KataKernelCreationContext`. + +## Current Design + +The existing `CUDAPlugin` (`src/ai/backend/accelerator/cuda_open/plugin.py`): + +```python +# Device discovery via NVML/libcudart +async def list_devices(self) -> Collection[CUDADevice]: + count = libcudart.get_device_count() + for idx in range(count): + props = libcudart.get_device_props(idx) + # Uses nvidia driver to enumerate devices + +# Docker-specific GPU attachment +async def generate_docker_args(self, docker, device_alloc): + return { + "HostConfig": { + "DeviceRequests": [{ + "Driver": "nvidia", + "DeviceIDs": assigned_device_ids, + "Capabilities": [["utility", "compute", "video", "graphics"]], + }] + } + } + +# Allocation via DiscretePropertyAllocMap +async def create_alloc_map(self) -> AbstractAllocMap: + return DiscretePropertyAllocMap( + device_slots={ + dev.device_id: DeviceSlotInfo(SlotTypes.COUNT, SlotName("cuda.device"), Decimal(1)) + for dev in devices + }, + ) +``` + +Key incompatibilities with Kata/VFIO: +1. NVML requires the `nvidia` driver loaded; in VFIO mode, GPUs are bound to `vfio-pci` +2. `DeviceRequests` is a Docker API concept; Kata uses VFIO device annotations +3. Host-side metrics (NVML) are unavailable when device is bound to `vfio-pci` + +## Proposed Design + +### Package Structure + +``` +src/ai/backend/accelerator/cuda_vfio/ +├── __init__.py +├── plugin.py # CUDAVFIOPlugin +├── device.py # CUDAVFIODevice, PCI/sysfs scanning +├── iommu.py # IOMMU group detection and validation +└── BUILD # Pants build with backendai_accelerator_v21 entry point +``` + +### Entry Point Registration + +```python +# BUILD +python_distribution( + name="dist", + entry_points={ + "backendai_accelerator_v21": { + "cuda_vfio": "ai.backend.accelerator.cuda_vfio.plugin:CUDAVFIOPlugin", + } + }, +) +``` + +The agent config's plugin allowlist/blocklist controls which plugin loads: +- Docker agents: load `cuda` (existing CUDAPlugin) +- Kata agents: load `cuda_vfio` (new CUDAVFIOPlugin) + +### CUDAVFIODevice + +```python +@attrs.define(slots=True) +class CUDAVFIODevice(AbstractComputeDevice): + pci_address: str # "0000:41:00.0" + iommu_group: int # IOMMU group number + vfio_device_path: str # "/dev/vfio/42" + pci_vendor_id: str # "10de" (NVIDIA) + pci_device_id: str # "2684" (e.g., RTX 4090) + model_name: str # Mapped from PCI device ID + memory_size: int # GPU memory in bytes (from PCI BAR or lookup table) + numa_node: int # NUMA node affinity + clique_id: str | None # PCIe P2P group ID from Kata VRA (e.g., "clusterUUID.0") +``` + +### Device Discovery via sysfs + +Since GPUs are bound to `vfio-pci` (not `nvidia`), NVML is unavailable. Discovery uses PCI sysfs: + +```python +async def list_devices(self) -> Collection[CUDAVFIODevice]: + devices = [] + pci_devices_path = Path("/sys/bus/pci/devices") + + for pci_dir in pci_devices_path.iterdir(): + vendor = (pci_dir / "vendor").read_text().strip() + if vendor != "0x10de": # NVIDIA vendor ID + continue + + device_class = (pci_dir / "class").read_text().strip() + if not device_class.startswith("0x0302"): # 3D controller (GPU) + # Also check 0x0300 (VGA) for display GPUs + if not device_class.startswith("0x0300"): + continue + + driver_link = pci_dir / "driver" + if driver_link.is_symlink(): + driver = driver_link.resolve().name + if driver != "vfio-pci": + log.warning("GPU {} bound to {} (expected vfio-pci), skipping", + pci_dir.name, driver) + continue + + iommu_group = self._get_iommu_group(pci_dir) + if iommu_group is None: + log.warning("GPU {} has no IOMMU group, skipping", pci_dir.name) + continue + + pci_device_id = (pci_dir / "device").read_text().strip().removeprefix("0x") + model_name = PCI_DEVICE_ID_MAP.get(pci_device_id, f"NVIDIA GPU ({pci_device_id})") + + devices.append(CUDAVFIODevice( + device_id=DeviceId(pci_dir.name), + hw_location=pci_dir.name, + numa_node=self._read_numa_node(pci_dir), + memory_size=self._estimate_memory(pci_dir), + processing_units=0, # Unknown without NVML + pci_address=pci_dir.name, + iommu_group=iommu_group, + vfio_device_path=f"/dev/vfio/{iommu_group}", + pci_vendor_id="10de", + pci_device_id=pci_device_id, + model_name=model_name, + )) + + return devices +``` + +### IOMMU Group Validation + +A GPU can only be safely passed through if its IOMMU group is isolated: + +```python +def _validate_iommu_group(self, pci_address: str, iommu_group: int) -> bool: + """Check that the IOMMU group contains only the GPU and its audio companion.""" + group_path = Path(f"/sys/kernel/iommu_groups/{iommu_group}/devices") + group_devices = [d.name for d in group_path.iterdir()] + + allowed_companions = set() + for dev_addr in group_devices: + if dev_addr == pci_address: + continue + dev_class = (Path("/sys/bus/pci/devices") / dev_addr / "class").read_text().strip() + # 0x0403 = Audio device (NVIDIA GPU audio is commonly co-located) + # 0x0604 = PCI bridge (acceptable, handled by VFIO) + if dev_class.startswith("0x0403") or dev_class.startswith("0x0604"): + allowed_companions.add(dev_addr) + else: + log.warning( + "IOMMU group {} contains non-GPU device {} (class {}), " + "GPU {} may not be passthrough-safe", + iommu_group, dev_addr, dev_class, pci_address, + ) + return False + + return True +``` + +When multiple GPUs are in separate IOMMU groups, they can be independently assigned to different VMs. When GPUs share an IOMMU group (rare on server hardware, common on consumer motherboards), they must be assigned together or not at all. + +### Allocation Map + +Same `DiscretePropertyAllocMap` pattern as the existing CUDAPlugin: + +```python +async def create_alloc_map(self) -> AbstractAllocMap: + devices = await self.list_devices() + return DiscretePropertyAllocMap( + device_slots={ + dev.device_id: DeviceSlotInfo( + SlotTypes.COUNT, + SlotName("cuda.device"), + Decimal(1), + ) + for dev in devices + }, + ) +``` + +The slot name `cuda.device` is the same as the Docker CUDAPlugin — from the scheduler's perspective, a GPU is a GPU regardless of attachment mechanism. + +### generate_docker_args() — VFIO Device Config + +Despite the method name (inherited from `AbstractComputePlugin`), this returns Kata-specific VFIO configuration: + +```python +async def generate_docker_args( + self, + docker: Any, # Unused for Kata; None is passed + device_alloc: DeviceAllocation, +) -> Mapping[str, Any]: + vfio_devices = [] + for slot_type, per_device_alloc in device_alloc.items(): + for device_id, alloc in per_device_alloc.items(): + if alloc > 0: + dev = self._device_map[DeviceId(device_id)] + vfio_devices.append({ + "pci_address": dev.pci_address, + "iommu_group": dev.iommu_group, + "vfio_device": dev.vfio_device_path, + "model_name": dev.model_name, + "clique_id": dev.clique_id, # P2P group for Kata VRA topology + "memory_size": dev.memory_size, # GPU memory in bytes + "numa_node": dev.numa_node, # NUMA affinity for CPU pinning + }) + return {"_kata_vfio_devices": vfio_devices} +``` + +`KataKernelCreationContext.apply_accelerator_allocation()` (see [kata-agent-backend.md](kata-agent-backend.md)) reads `_kata_vfio_devices` and translates it into Kata container annotations for VFIO hotplug. + +### Metrics Collection + +Host-side GPU metrics are limited when devices are bound to `vfio-pci`: + +**Node-level metrics** (available from sysfs): +- Power consumption: `/sys/bus/pci/devices/{addr}/power/` (limited) +- Temperature: not available from sysfs alone + +**Container-level metrics** (from inside guest VM): +- The guest VM has the `nvidia` driver loaded and NVML is functional inside the guest +- CoCo-by-default: kata-agent `ExecProcessRequest` is blocked by policy — `containerd exec nvidia-smi` does **not** work +- Metrics are exposed by **DCGM Exporter** (`:9400`) and **Node Exporter** (`:9100`) running inside the guest as systemd services (baked into the attested rootfs) +- Prometheus scrapes these exporters over the Calico network — standard HTTP, no kata-agent involvement +- The CoCo policy controls tRPC API calls (exec, copy file), NOT network traffic + +```python +async def gather_container_measures( + self, stat_ctx, container_ids, +) -> Sequence[ContainerMeasurement]: + # CoCo: ExecProcessRequest is blocked by kata-agent policy. + # GPU metrics are collected by Prometheus scraping DCGM Exporter + # inside the guest over the Calico network (port 9400). + # This plugin does NOT collect per-container GPU metrics directly. + # Backend.AI reads GPU metrics from Prometheus/Grafana instead. + return [] +``` + +### restore_from_container() + +When the agent restarts, reconstruct VFIO bindings from running Kata sandboxes: + +```python +async def restore_from_container(self, container, alloc_map): + labels = container.labels + # Read allocated device info from container labels + # (KataKernelCreationContext stores PCI addresses in labels at creation time) + pci_addresses = labels.get("ai.backend.vfio.pci_addresses", "").split(",") + for pci_addr in pci_addresses: + if pci_addr: + device_id = DeviceId(pci_addr) + alloc_map.apply_allocation({ + SlotName("cuda.device"): {device_id: Decimal(1)}, + }) +``` + +## Interface / API + +| Method | Purpose | +|--------|---------| +| `list_devices()` | PCI sysfs scan for NVIDIA GPUs bound to vfio-pci | +| `available_slots()` | Count of passthrough-ready GPUs | +| `create_alloc_map()` | `DiscretePropertyAllocMap` with `cuda.device` slots | +| `generate_docker_args()` | Returns `_kata_vfio_devices` list for Kata shim | +| `gather_node_measures()` | Limited sysfs metrics (no NVML on host) | +| `gather_container_measures()` | Returns empty — CoCo policy blocks exec; metrics via Prometheus scraping DCGM Exporter | +| `restore_from_container()` | Reconstruct allocations from container labels | + +## Multi-Node GPU Interconnect: GPUDirect RDMA + +### Overview + +Multi-node GPU workloads (distributed training, multi-node inference) require **GPUDirect RDMA** — direct memory access between a GPU and an RDMA-capable NIC (InfiniBand HCA or RoCE adapter) without CPU involvement. This enables NCCL-based collective operations across nodes at near-wire-speed InfiniBand bandwidth. + +For Kata VMs, this means **both the GPU and the InfiniBand HCA must be VFIO-passthrough to the same VM**, and peer-to-peer DMA between them must work across the VM boundary. + +### Requirements for GPUDirect RDMA in Kata VMs + +| Requirement | Details | +|---|---| +| **GPU + NIC co-passthrough** | Both the NVIDIA GPU and InfiniBand HCA (ConnectX-5+) must be VFIO-assigned to the same Kata VM | +| **IOMMU pass-through mode** | GPUDirect RDMA requires physical addresses to be identical from both devices' perspective. The IOMMU must be in 1:1 pass-through mode (`iommu=pt`), not performing address translation. Inside the guest, vIOMMU (if present) must also be in pass-through or both devices must share an IOMMU domain | +| **Same PCIe root complex** | GPU and NIC must share an upstream PCIe root complex. Cross-socket (QPI/UPI) P2P is unreliable or non-functional | +| **ACS/ATS for Direct Translated P2P** | Access Control Services (ACS) normally forces all DMA through the root complex. Address Translation Services (ATS, available on ConnectX-5+ and Volta+ GPUs) allows endpoints to prefetch IOVA→PA translations and DMA directly to peers, bypassing the IOMMU for P2P | +| **PCIe topology replication** | NVIDIA's driver uses PCIe topology to determine P2P capability. Default virtualization flattens topology (all devices on root bus), which disables P2P. The guest must replicate the host's PCIe switch hierarchy | +| **Guest driver stack** | Full NVIDIA + Mellanox stack inside the guest: `nvidia` driver, `mlx5_core`/`mlx5_ib`, `ib_core`, `ib_uverbs`, `nvidia-peermem` (bridges GPU and IB subsystems), MLNX_OFED | + +### Kata VRA (Virtualization Reference Architecture) + +The Kata project's [VRA design document](https://github.com/kata-containers/kata-containers/blob/main/docs/design/kata-vra.md) explicitly addresses GPUDirect P2P and RDMA with three key mechanisms: + +**1. PCIe topology replication via switch ports:** + +```toml +# kata configuration.toml +hotplug_vfio = "switch-port" # Replicate host PCIe switch topology in guest +pcie_switch_port = 8 # Number of switch downstream ports +``` + +This creates PCIe switch structures inside the guest that mirror the host topology. Devices that were under the same PCIe switch on the host are placed under the same virtual switch in the guest, enabling the NVIDIA driver to recognize P2P capability. + +**2. Clique-ID annotation for P2P device grouping:** + +Devices with the same `clique-id` form a P2P group. Configured via Container Device Interface (CDI): + +```yaml +# GPU on PCIe switch +- annotations: + bdf: "41:00.0" + clique-id: "0" + +# InfiniBand HCA on same PCIe switch +- annotations: + bdf: "42:00.0" + clique-id: "0" # Same clique → P2P enabled between GPU and NIC +``` + +The hypervisor uses clique-id to provide topology metadata that the NVIDIA driver reads to enable GPUDirect P2P, even if the virtualized PCIe topology doesn't perfectly match the host. + +**3. Multi-function IOMMU group handling:** + +NVIDIA datacenter GPUs often contain multiple functions (GPU + audio + USB) in a single IOMMU group. VRA handles this by selectively attaching: +- GPU: PCIe root/switch port (requires express protocol, consumes 4K IO range) +- Audio/USB companions: PCI bridge (standard PCI, no IO range consumption) + +This conserves scarce PCIe IO space (64K total, 4K per port, max ~16 ports). + +### NVLink / NVSwitch (Intra-Node GPU-GPU) + +NVLink connects GPUs directly, bypassing PCIe. NVSwitch enables all-to-all NVLink in DGX/HGX systems. + +When GPUs are VFIO-passthrough to a Kata VM: +- NVLink hardware connections remain physically present +- The NVIDIA driver inside the guest detects NVLink via GPU registers accessible through VFIO BAR mapping +- P2P over NVLink works if both GPUs are in the same guest and the driver detects NVLink topology +- This has been demonstrated in VMware vSphere (GPU-passthrough with NVLink), suggesting KVM/VFIO works similarly + +For Backend.AI multi-container sessions (clusters on a single host), all GPUs can be passed through to a single Kata sandbox (VM) via the containerd Sandbox API, enabling NVLink communication between containers sharing the VM. + +### Current Upstream Status + +| Component | Status | Reference | +|---|---|---| +| kata-vra.md design | Written, covers GPUDirect P2P + RDMA | [kata-vra.md](https://github.com/kata-containers/kata-containers/blob/main/docs/design/kata-vra.md) | +| GPU VFIO passthrough | Working (A100, H100 with caveats) | [#12723](https://github.com/kata-containers/kata-containers/issues/12723) | +| InfiniBand HCA passthrough | Partially working; BAR allocation fixed in Kata 3.8+ | [#10392](https://github.com/kata-containers/kata-containers/issues/10392) | +| SR-IOV VF passthrough | Broken — "No PCI mapping found" | [#11910](https://github.com/kata-containers/kata-containers/issues/11910) | +| GPUDirect RDMA end-to-end | Not yet demonstrated in Kata | [#10796](https://github.com/kata-containers/kata-containers/issues/10796) (open, high-priority) | + +### InfiniBand HCA BAR Allocation Fix + +Kata 3.8+ added `getBARsMaxAddressableMemory()` to auto-size PCIe BAR windows for large-BAR devices. However, this function initially only detected NVIDIA GPUs (via `IsGPU()` check), skipping InfiniBand HCAs. ConnectX-6 requires 32MB+ BAR0 but the default 2MB allocation caused `"BAR0: no space for [mem size 32MB 64bit pref]"`. The fix: include Mellanox device detection alongside GPU detection in BAR sizing logic ([#10392](https://github.com/kata-containers/kata-containers/issues/10392)). + +### Impact on CUDAVFIOPlugin + +For GPUDirect RDMA support, the `CUDAVFIOPlugin` needs awareness of co-located InfiniBand HCAs: + +```python +async def generate_docker_args(self, docker, device_alloc): + vfio_devices = [] + for device_id, alloc in per_device_alloc.items(): + dev = self._device_map[DeviceId(device_id)] + vfio_devices.append({ + "pci_address": dev.pci_address, + "iommu_group": dev.iommu_group, + "vfio_device": dev.vfio_device_path, + "model_name": dev.model_name, + "clique_id": dev.clique_id, # NEW: P2P group identifier + }) + return { + "_kata_vfio_devices": vfio_devices, + "_kata_rdma_devices": self._get_colocated_rdma_devices(vfio_devices), # NEW + } +``` + +A separate **RDMAVFIOPlugin** (or extension to `CUDAVFIOPlugin`) would handle InfiniBand HCA discovery, IOMMU group validation, and VFIO device configuration for the NIC side. The `clique_id` links GPU and NIC devices for P2P topology replication. + +### Guest VM Base Image Requirements (GPUDirect RDMA) + +In addition to storage filesystem clients (see [migration-compatibility.md](migration-compatibility.md)), the guest rootfs must include: + +| Component | Purpose | +|---|---| +| NVIDIA GPU driver + CUDA toolkit | GPU compute + `nvidia-peermem` module | +| MLNX_OFED | InfiniBand/RoCE driver stack (`mlx5_core`, `mlx5_ib`) | +| `nvidia-peermem` kernel module | Registers GPU memory with IB subsystem for GPUDirect RDMA | +| `libibverbs`, `librdmacm` | RDMA userspace libraries | +| NCCL | Multi-GPU collective communication (uses GPUDirect RDMA for inter-node) | +| Guest kernel with `CONFIG_INFINIBAND=y`, `CONFIG_MLX5_CORE=y`, `CONFIG_VFIO=y` | InfiniBand + VFIO kernel support | + +**Note:** MLNX_OFED must be installed before or alongside the NVIDIA GPU driver. If the GPU driver is installed first, it must be reinstalled after OFED to build `nvidia-peermem` correctly. + +## Implementation Notes + +- `PCI_DEVICE_ID_MAP` is a static mapping from PCI device IDs to model names (e.g., `"2684"` → `"NVIDIA RTX 4090"`). This can be sourced from the PCI ID database or maintained as a curated subset for common GPU models. +- VFIO requires `/dev/vfio/{group}` device files to be accessible. The agent process needs `CAP_SYS_RAWIO` or root access. +- When BEP-1016 (Accelerator Interface v2) is implemented, `generate_docker_args()` should be migrated to `create_lifecycle_hook()` returning a `WorkloadConfig` with proper VFIO device entries. +- The `docker` parameter in `generate_docker_args()` is passed as `None` by `KataKernelCreationContext` since there is no Docker client. +- GPU audio companion devices in the same IOMMU group must also be passed through; the plugin handles this transparently.