Skip to content

Latest commit

 

History

History
729 lines (553 loc) · 26.8 KB

File metadata and controls

729 lines (553 loc) · 26.8 KB

Why FCSP When GPUs Already Support MIG?

A Deep Dive into Software vs. Hardware GPU Virtualization for Multi-Tenant Workloads

December 2024


Introduction

If you've ever tried to share a GPU between multiple users or workloads in a Kubernetes cluster, you've probably heard of NVIDIA's Multi-Instance GPU (MIG) technology. It's the official, hardware-backed solution for GPU partitioning. But what if I told you there's a compelling software alternative that might actually be better for your use case?

Enter FCSP (Flexible Compute Sharing Platform) – a sophisticated software-based GPU virtualization library that provides MIG-like multi-tenant isolation without the hardware constraints. In this article, we'll explore when and why you might choose FCSP over native MIG, backed by real benchmarks and production considerations.


Understanding the Landscape

What is NVIDIA MIG?

Multi-Instance GPU (MIG) is a hardware feature introduced with NVIDIA Ampere architecture (A100) and continued in Hopper (H100) GPUs. It allows a single physical GPU to be partitioned into up to 7 isolated instances, each with dedicated:

  • Memory: Physically isolated memory regions
  • SMs (Streaming Multiprocessors): Dedicated compute units
  • Memory Controllers: Guaranteed memory bandwidth

MIG partitions are "hard" – once created, they're fixed until explicitly reconfigured. Each MIG instance appears as a separate GPU to CUDA applications.

What is FCSP?

FCSP is a software-based GPU virtualization layer that achieves similar multi-tenant isolation through:

  • LD_PRELOAD interception: Transparent CUDA API interception
  • Token-bucket rate limiting: Fine-grained compute time allocation
  • Memory quota enforcement: Hard/soft memory limits per process
  • Work-conserving scheduling: Unused resources flow to active tenants
  • Intelligent prefetching: UVM-based memory oversubscription

The key difference: FCSP operates entirely in software, works on any CUDA-capable GPU, and provides dynamic, flexible resource allocation.


The Problem with MIG

MIG is excellent technology, but it comes with significant limitations:

1. Hardware Requirements

MIG-Supported GPUs:
├── A100 (40GB/80GB)
├── A30
├── H100
└── H200

That's it. No consumer GPUs. No V100. No T4. No RTX series.

If you're running anything other than expensive data center GPUs, MIG simply isn't an option.

2. Static Partitioning

MIG partitions are fixed at creation time:

# Creating MIG instances requires:
# 1. No running CUDA processes
# 2. MIG mode enabled (requires GPU reset)
# 3. Specific, predefined partition profiles

$ nvidia-smi mig -cgi 9,9,9 -C  # Create 3 MIG instances
# GPU must be idle, process disruption required

Want to change the partition sizes? You need to:

  1. Terminate all workloads
  2. Destroy existing instances
  3. Create new instances
  4. Restart workloads

This is disruptive and operationally expensive.

3. Limited Partition Profiles

MIG doesn't allow arbitrary resource splits. You're limited to predefined profiles:

A100-80GB Profile Memory SMs Instances
1g.10gb 10GB 14 Up to 7
2g.20gb 20GB 28 Up to 3
3g.40gb 40GB 42 Up to 2
7g.80gb 80GB 98 1

Can't do 30%/70% splits. Can't do 5 equal instances. Can't dynamically rebalance.

4. Wasted Resources

With MIG's static allocation, idle instances can't share resources:

Scenario: 3 MIG instances, only Tenant A is active

MIG Behavior:
┌─────────────────────────────────────────────────────────┐
│ GPU                                                     │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐              │
│ │ Tenant A  │ │   IDLE    │ │   IDLE    │              │
│ │  (active) │ │ (wasted)  │ │ (wasted)  │              │
│ │   33%     │ │   33%     │ │   33%     │              │
│ └───────────┘ └───────────┘ └───────────┘              │
└─────────────────────────────────────────────────────────┘

Result: Tenant A limited to 33% even though 67% is idle

How FCSP Solves These Problems

1. Universal GPU Support

FCSP works on any CUDA GPU:

FCSP-Supported GPUs:
├── All Data Center: A100, H100, V100, T4, A10, L4, etc.
├── All Professional: RTX A6000, RTX 4000/5000/6000
├── All Consumer: RTX 3080/3090/4080/4090
└── Legacy: GTX 1080 Ti, Titan V, etc.

Minimum: CUDA Compute Capability 3.0+

Your RTX 3080 can now provide MIG-like isolation. Your V100 cluster gets multi-tenancy.

2. Dynamic Resource Allocation

FCSP partitions can be changed at runtime without disrupting workloads:

# Change tenant limits on the fly
$ export BUD_SM_LIMIT=50    # 50% compute
$ export BUD_MEMORY_LIMIT=8G # 8GB memory

# Or per-device
$ export BUD_SM_LIMIT_DEV0=30
$ export BUD_SM_LIMIT_DEV1=70

# Changes take effect immediately for new allocations

No restarts. No service disruption. Instant rebalancing.

3. Arbitrary Resource Splits

Define any percentage you want:

# FCSP supports any configuration
Tenant A: 25% compute, 4GB memory
Tenant B: 35% compute, 6GB memory
Tenant C: 40% compute, 10GB memory

# Or dynamic fair-share
$ export BUD_ISOLATION_MODE=adaptive
# Resources automatically balance based on demand

4. Work-Conserving Scheduling

This is FCSP's killer feature. Idle resources automatically flow to active tenants:

Scenario: 3 tenants, only Tenant A is active

FCSP Behavior:
┌─────────────────────────────────────────────────────────┐
│ GPU                                                     │
│ ┌─────────────────────────────────────────────────────┐│
│ │              Tenant A (active)                      ││
│ │   Floor: 33%  +  Burst: 67% from idle neighbors     ││
│ │              Total: ~100% utilization               ││
│ └─────────────────────────────────────────────────────┘│
│   (Idle tenants' resources borrowed, returned on demand)│
└─────────────────────────────────────────────────────────┘

Result: Tenant A uses full GPU while alone
        Resources reclaimed when others become active

Our benchmarks show this achieves 143.9% efficiency compared to static allocation (>100% because compute+memory heavy workloads complement each other).


FCSP Architecture Deep Dive

Core Components

┌─────────────────────────────────────────────────────────────┐
│                    FCSP Architecture                         │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Application Process                                         │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  CUDA Application (unchanged)                         │   │
│  └──────────────────────────────────────────────────────┘   │
│                          │ LD_PRELOAD                        │
│                          ▼                                   │
│  ┌──────────────────────────────────────────────────────┐   │
│  │                   libvgpu.so                          │   │
│  │  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐     │   │
│  │  │  Memory     │ │  Compute    │ │   Stream    │     │   │
│  │  │  Manager    │ │  Throttler  │ │  Classifier │     │   │
│  │  └─────────────┘ └─────────────┘ └─────────────┘     │   │
│  │  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐     │   │
│  │  │   NCCL      │ │    UVM      │ │   Graph     │     │   │
│  │  │   Hooks     │ │  Prefetch   │ │  Optimizer  │     │   │
│  │  └─────────────┘ └─────────────┘ └─────────────┘     │   │
│  └──────────────────────────────────────────────────────┘   │
│                          │                                   │
│                          ▼                                   │
│  ┌──────────────────────────────────────────────────────┐   │
│  │              Shared Memory Region                     │   │
│  │  • Per-process resource tracking                      │   │
│  │  • Global utilization metrics                         │   │
│  │  • Token bucket state                                 │   │
│  │  • Burst pool management                              │   │
│  └──────────────────────────────────────────────────────┘   │
│                          │                                   │
│                          ▼                                   │
│  ┌──────────────────────────────────────────────────────┐   │
│  │              SM Observer Thread                       │   │
│  │  • NVML polling (5ms interval)                        │   │
│  │  • Fair share calculation                             │   │
│  │  • Idle tenant detection                              │   │
│  │  • Contention monitoring                              │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Isolation Modes

FCSP provides four isolation modes to match your requirements:

1. None (BUD_ISOLATION_MODE=none)

  • No isolation enforcement
  • Useful for single-tenant scenarios
  • Minimal overhead (~40ns per API call)

2. Balanced (BUD_ISOLATION_MODE=balanced)

  • Default mode
  • Fair sharing with burst capability
  • 20% floor + 40% shared pool + 40% burst pool
  • Best for mixed workloads
Resource Pool Distribution (Balanced):
┌────────────────────────────────────────────────────────┐
│ GPU Resources (100%)                                   │
├──────────────┬──────────────────┬─────────────────────┤
│  Floor Pool  │   Shared Pool    │    Burst Pool       │
│    (20%)     │     (40%)        │      (40%)          │
│  Guaranteed  │  Fair-share      │  From idle tenants  │
│  per-tenant  │  among active    │  work-conserving    │
└──────────────┴──────────────────┴─────────────────────┘

3. Strict (BUD_ISOLATION_MODE=strict)

  • Hard quotas, no bursting
  • MIG-like behavior
  • Maximum isolation, minimum efficiency

4. Adaptive (BUD_ISOLATION_MODE=adaptive)

  • Automatically switches based on contention
  • Low contention → relaxed limits (more throughput)
  • High contention → strict limits (better isolation)
  • Best of both worlds

Rate Limiting Implementation

FCSP uses a sophisticated token bucket algorithm:

// Per-stream token buckets (eliminates CAS contention)
typedef struct {
    _Atomic int64_t tokens;      // Current tokens
    int64_t max_tokens;          // Bucket capacity
    int64_t refill_rate;         // Tokens per second
    uint64_t last_update;        // Last refill timestamp
} token_bucket_t;

// Kernel launch cost = grids × blocks × TOKEN_FACTOR(32)
// Example: 256 grids × 256 blocks = 2,097,152 tokens

The system also includes:

  • Batch token consumption: Thread-local caching reduces atomic operations by 8-16x
  • PID controller: Smooth rate limiting with proportional-integral-derivative control
  • Exponential backoff: Graceful throttling (50ns → 10µs)

Memory Management

FCSP provides flexible memory limits:

# Absolute limit
export BUD_MEMORY_LIMIT=8G

# Percentage of GPU memory
export BUD_MEMORY_LIMIT=50%

# Enforcement modes
export BUD_MEMORY_LIMIT_MODE=hard   # Reject allocations
export BUD_MEMORY_LIMIT_MODE=soft   # Warning + callback

# Per-device limits
export BUD_MEMORY_LIMIT_DEV0=4G
export BUD_MEMORY_LIMIT_DEV1=8G

Plus advanced UVM (Unified Virtual Memory) features:

  • Intelligent prefetching: Predict and prefetch based on access patterns
  • Memory pressure monitoring: Early warning at 70%, 85%, 95% thresholds
  • Automatic eviction: LRU, access-aware, or FIFO policies
  • Oversubscription: Use more GPU memory than physically available

Benchmark Comparison

Let's look at real numbers comparing FCSP to native execution and MIG.

Test Environment

  • GPU: NVIDIA RTX 3080 (10GB, 68 SMs)
  • FCSP: Adaptive isolation mode
  • Iterations: 100 per test

Overhead Metrics

Metric Native FCSP MIG (A100)
Kernel Launch Latency ~3 µs 4.9 µs ~3.5 µs
Memory Alloc Latency ~100 µs 704 µs ~100 µs
API Interception 0 ns 40 ns 0 ns
Rate Limiter N/A 1.5 µs N/A

Analysis: FCSP adds ~2µs overhead per kernel launch – negligible for typical GPU workloads where kernels run for milliseconds. Memory allocation is slower due to tracking, but this is amortized over the allocation lifetime.

Isolation Metrics

Metric FCSP (Balanced) FCSP (Adaptive) MIG
Fairness Index 0.996 0.996 1.0
QoS Consistency (CV) 0.07 0.07 <0.05
Noisy Neighbor Impact 11.5% 4.66% ~3%
Cross-Tenant Isolation 81.4% 81.4% ~95%

Analysis: FCSP achieves near-perfect fairness (0.996 out of 1.0). With adaptive isolation, noisy neighbor impact drops to 4.66% – approaching MIG's ~3%. The tradeoff is flexibility: FCSP allows work-conservation, MIG doesn't.

Efficiency Metrics

Metric FCSP MIG
Affinity Complementary Efficiency 143.9% N/A
Work Conservation Benefit Up to 67% 0%
Resource Utilization (3 idle, 1 active) ~100% 33%

Analysis: FCSP's work-conservation is a game-changer. When tenants are idle, active tenants can use the full GPU. MIG wastes 67% of resources in this scenario.


Feature Comparison Matrix

Feature FCSP MIG
Hardware Support
Consumer GPUs (RTX)
V100, T4, A10
A100, H100
Partitioning
Arbitrary splits
Dynamic resizing
No-disruption changes
Work-conservation
Isolation
Memory isolation Software Hardware
Compute isolation Software Hardware
Fault isolation Partial Complete
Error containment Per-process Per-instance
Performance
Zero overhead option
Burst capability
QoS guarantees Soft Hard
Operations
Kubernetes integration Via device plugin Native
Monitoring NVML + custom NVML
Configuration Env vars nvidia-smi

Use Cases: When to Choose FCSP vs MIG

Choose FCSP When:

1. You Don't Have MIG-Capable GPUs

The most obvious case. If you're running:

  • Consumer GPUs (RTX series)
  • Older data center GPUs (V100, T4, P100)
  • Mixed GPU clusters

FCSP is your only option for multi-tenant isolation.

2. You Need Dynamic Resource Allocation

# Kubernetes scenario: Training job needs more resources during backprop
apiVersion: v1
kind: Pod
metadata:
  name: training-job
spec:
  containers:
  - name: trainer
    env:
    - name: BUD_SM_LIMIT
      value: "50"  # Start at 50%
    # FCSP allows runtime adjustment:
    # kubectl exec training-job -- export BUD_SM_LIMIT=80

3. Variable Workload Patterns

Perfect for:

  • Development environments: Burst when testing, idle when coding
  • Batch processing: Jobs that come and go
  • Time-sharing: Different users active at different times
Workday Pattern (Work-Conservation):
┌──────────────────────────────────────────────────────────┐
│ 9:00 AM: All 4 tenants active → 25% each                │
│ 12:00 PM: Tenant B at lunch → A,C,D get 33% each        │
│ 2:00 PM: Tenant D in meeting → A,C get 50% each         │
│ 6:00 PM: Only Tenant A working → A gets 100%            │
└──────────────────────────────────────────────────────────┘
MIG would waste 75% at 6 PM. FCSP uses everything.

4. Complementary Workload Scheduling

FCSP's workload affinity feature can co-schedule compute-heavy and memory-heavy workloads for better-than-isolated performance:

Without Affinity:
├── Compute-heavy job: 50% SM utilization, 10% memory BW
├── Memory-heavy job: 10% SM utilization, 50% memory BW
└── Total GPU utilization: ~55% (many resources idle)

With FCSP Affinity:
├── Both jobs run together
├── Total GPU utilization: 60% SM + 60% memory BW
└── Efficiency: 143.9% vs isolated execution

5. LLM Inference at Scale

LLM inference has unique characteristics:

  • Bursty memory allocation (KV cache)
  • Mixed compute patterns (attention vs FFN)
  • Variable batch sizes

FCSP includes LLM-optimized profiles:

# LLM-optimized UVM configuration
export BUD_UVM_PROFILE=llm_inference
# Enables: larger prefetch (64MB), aggressive pattern detection,
# handles bursty KV cache, optimized for attention patterns

Choose MIG When:

1. Maximum Isolation is Critical

If tenant A crashing absolutely cannot affect tenant B:

  • Financial trading systems
  • Medical imaging
  • Safety-critical applications

MIG provides hardware-level fault isolation. FCSP provides process-level isolation.

2. Regulatory Compliance Requires Hardware Isolation

Some compliance frameworks explicitly require hardware-level isolation:

  • HIPAA (healthcare)
  • PCI-DSS (payment)
  • SOC 2 Type II

MIG's hardware partitioning may satisfy auditors where software isolation doesn't.

3. Guaranteed, Predictable Performance

MIG provides hard QoS guarantees:

MIG Instance: 1g.10gb
├── Guaranteed: 14 SMs (always)
├── Guaranteed: 10GB memory (always)
└── No variability, no neighbors

FCSP (even with strict mode):
├── Target: 14 SMs
├── Actual: 13-15 SMs (software scheduling variance)
└── ~0.07 coefficient of variation

If your SLA requires "exactly 14 SMs, never less", use MIG.

4. You Have A100/H100 and Static Workloads

If you have:

  • MIG-capable hardware (A100, H100)
  • Predictable, always-on workloads
  • No need for resource flexibility

MIG is simpler to operate and has zero runtime overhead.


Configuration Guide

Basic FCSP Setup

# 1. Build FCSP
cd bud_fcsp && mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j8

# 2. Run with FCSP
LD_PRELOAD=/path/to/libvgpu.so python train.py

Multi-Tenant Kubernetes Deployment

# ConfigMap for FCSP settings
apiVersion: v1
kind: ConfigMap
metadata:
  name: fcsp-config
data:
  isolation-mode: "adaptive"
  tenant-floor: "20"
  shared-pool: "40"
  burst-pool: "40"

---
# Pod with FCSP isolation
apiVersion: v1
kind: Pod
metadata:
  name: ml-workload
  labels:
    fcsp.io/tenant: "team-a"
spec:
  containers:
  - name: training
    image: pytorch/pytorch:latest
    env:
    - name: LD_PRELOAD
      value: "/opt/fcsp/libvgpu.so"
    - name: BUD_ISOLATION_MODE
      valueFrom:
        configMapKeyRef:
          name: fcsp-config
          key: isolation-mode
    - name: BUD_SM_LIMIT
      value: "50"
    - name: BUD_MEMORY_LIMIT
      value: "8G"
    volumeMounts:
    - name: fcsp-lib
      mountPath: /opt/fcsp
  volumes:
  - name: fcsp-lib
    hostPath:
      path: /usr/local/lib/fcsp

Production Configuration Profiles

Profile 1: Development Cluster

# Maximize flexibility, allow bursting
export BUD_ISOLATION_MODE=adaptive
export BUD_TENANT_FLOOR_PCT=10    # Low guaranteed floor
export BUD_SHARED_POOL_PCT=30     # Moderate shared pool
export BUD_BURST_MAX_PCT=60       # High burst capability
export BUD_ADAPTIVE_SM_MAX_BURST_PCT=100  # Allow full burst

Profile 2: Production ML Serving

# Balance isolation and efficiency
export BUD_ISOLATION_MODE=balanced
export BUD_TENANT_FLOOR_PCT=40    # High guaranteed floor
export BUD_SHARED_POOL_PCT=30     # Moderate shared pool
export BUD_BURST_MAX_PCT=30       # Limited burst
export BUD_COOPERATIVE_MODE=1     # Smooth throttling

Profile 3: High-Isolation Multi-Tenant

# Maximum isolation, MIG-like behavior
export BUD_ISOLATION_MODE=strict
export BUD_TENANT_FLOOR_PCT=100   # All resources guaranteed
export BUD_SHARED_POOL_PCT=0      # No sharing
export BUD_BURST_MAX_PCT=0        # No bursting

Performance Tuning

Reducing Overhead

# 1. Enable fast paths (default in production)
export BUD_FAST_PATH_ENABLED=1

# 2. Increase batch token size for high-throughput
export BATCH_TOKENS_DEFAULT_SIZE=64  # Default: 16

# 3. Disable unnecessary features
export BUD_ENABLE_METRICS=0          # If not monitoring
export BUD_LOG_LEVEL=error           # Reduce logging

# 4. Use SIMD optimizations (auto-detected)
# FCSP automatically uses AVX2/SSE4.2 for slot finding

Improving Isolation

# 1. Use adaptive mode for dynamic workloads
export BUD_ISOLATION_MODE=adaptive

# 2. Increase floor for better guarantees
export BUD_TENANT_FLOOR_PCT=40

# 3. Enable PID controller for smooth throttling
export BUD_USE_PID_CONTROLLER=1
export PID_DEFAULT_KP=0.8
export PID_DEFAULT_KI=0.2

# 4. Reduce observer interval for faster response
export BUD_OBSERVER_INTERVAL_US=2000  # 2ms instead of 5ms

Memory Oversubscription (UVM)

# Enable UVM oversubscription for large models
export BUD_UVM_ENABLED=1
export BUD_UVM_PROFILE=llm_inference

# Or manual configuration
export BUD_UVM_PRESSURE_WARNING_PCT=70
export BUD_UVM_PRESSURE_HIGH_PCT=85
export BUD_UVM_PRESSURE_CRITICAL_PCT=95
export BUD_PREFETCH_ENABLED=1
export BUD_PREFETCH_MAX_SIZE_MB=64

Limitations and Considerations

FCSP Limitations

  1. Software Isolation Only: A malicious or buggy tenant could potentially bypass isolation (MIG has hardware enforcement)

  2. Overhead: ~2µs per kernel launch, ~600µs for memory allocation (negligible for most workloads, but measurable)

  3. No Hardware Error Isolation: GPU errors (ECC, timeouts) affect all tenants

  4. Requires LD_PRELOAD: Application must use dynamically-linked CUDA

  5. Approximated Metrics: SM utilization is polled (5ms interval), not instantaneous

MIG Limitations

  1. Limited Hardware: Only A100, A30, H100, H200

  2. Fixed Profiles: Cannot create arbitrary partition sizes

  3. Static Allocation: Requires workload termination to resize

  4. No Work Conservation: Idle resources are wasted

  5. Reduced Total Performance: Sum of MIG instances < full GPU performance

  6. Complexity: More infrastructure to manage (MIG instances, CUDA MIG handles)


Conclusion

The choice between FCSP and MIG isn't about which is "better" – it's about which fits your requirements:

Requirement Recommendation
No MIG hardware FCSP (only option)
Dynamic workloads FCSP (work-conservation)
Maximum isolation MIG (hardware guarantee)
Compliance requirements MIG (auditable hardware)
Development clusters FCSP (flexibility)
Predictable production MIG (simplicity)
Mixed GPU types FCSP (universal support)
LLM inference FCSP (optimized profiles)

For most practical scenarios, FCSP provides 80-95% of MIG's isolation benefits with significantly more flexibility and universal hardware support. The work-conservation feature alone can improve cluster utilization by 40-67% in realistic multi-tenant scenarios.

Consider FCSP when you value flexibility, efficiency, and broad hardware support. Choose MIG when you need the absolute guarantee of hardware isolation and have compatible GPUs.


Resources

  • FCSP Source: /home/bud/Desktop/hami/bud_fcsp
  • Benchmarks: /home/bud/Desktop/hami/gpu-virt-bench
  • NVIDIA MIG Documentation: NVIDIA MIG User Guide

This article is based on FCSP v1.0 and NVIDIA MIG as of December 2024. Benchmarks were conducted on an NVIDIA RTX 3080 (10GB, 68 SMs) with CUDA 12.0.