December 2024
If you've ever tried to share a GPU between multiple users or workloads in a Kubernetes cluster, you've probably heard of NVIDIA's Multi-Instance GPU (MIG) technology. It's the official, hardware-backed solution for GPU partitioning. But what if I told you there's a compelling software alternative that might actually be better for your use case?
Enter FCSP (Flexible Compute Sharing Platform) – a sophisticated software-based GPU virtualization library that provides MIG-like multi-tenant isolation without the hardware constraints. In this article, we'll explore when and why you might choose FCSP over native MIG, backed by real benchmarks and production considerations.
Multi-Instance GPU (MIG) is a hardware feature introduced with NVIDIA Ampere architecture (A100) and continued in Hopper (H100) GPUs. It allows a single physical GPU to be partitioned into up to 7 isolated instances, each with dedicated:
- Memory: Physically isolated memory regions
- SMs (Streaming Multiprocessors): Dedicated compute units
- Memory Controllers: Guaranteed memory bandwidth
MIG partitions are "hard" – once created, they're fixed until explicitly reconfigured. Each MIG instance appears as a separate GPU to CUDA applications.
FCSP is a software-based GPU virtualization layer that achieves similar multi-tenant isolation through:
- LD_PRELOAD interception: Transparent CUDA API interception
- Token-bucket rate limiting: Fine-grained compute time allocation
- Memory quota enforcement: Hard/soft memory limits per process
- Work-conserving scheduling: Unused resources flow to active tenants
- Intelligent prefetching: UVM-based memory oversubscription
The key difference: FCSP operates entirely in software, works on any CUDA-capable GPU, and provides dynamic, flexible resource allocation.
MIG is excellent technology, but it comes with significant limitations:
MIG-Supported GPUs:
├── A100 (40GB/80GB)
├── A30
├── H100
└── H200
That's it. No consumer GPUs. No V100. No T4. No RTX series.
If you're running anything other than expensive data center GPUs, MIG simply isn't an option.
MIG partitions are fixed at creation time:
# Creating MIG instances requires:
# 1. No running CUDA processes
# 2. MIG mode enabled (requires GPU reset)
# 3. Specific, predefined partition profiles
$ nvidia-smi mig -cgi 9,9,9 -C # Create 3 MIG instances
# GPU must be idle, process disruption requiredWant to change the partition sizes? You need to:
- Terminate all workloads
- Destroy existing instances
- Create new instances
- Restart workloads
This is disruptive and operationally expensive.
MIG doesn't allow arbitrary resource splits. You're limited to predefined profiles:
| A100-80GB Profile | Memory | SMs | Instances |
|---|---|---|---|
| 1g.10gb | 10GB | 14 | Up to 7 |
| 2g.20gb | 20GB | 28 | Up to 3 |
| 3g.40gb | 40GB | 42 | Up to 2 |
| 7g.80gb | 80GB | 98 | 1 |
Can't do 30%/70% splits. Can't do 5 equal instances. Can't dynamically rebalance.
With MIG's static allocation, idle instances can't share resources:
Scenario: 3 MIG instances, only Tenant A is active
MIG Behavior:
┌─────────────────────────────────────────────────────────┐
│ GPU │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ Tenant A │ │ IDLE │ │ IDLE │ │
│ │ (active) │ │ (wasted) │ │ (wasted) │ │
│ │ 33% │ │ 33% │ │ 33% │ │
│ └───────────┘ └───────────┘ └───────────┘ │
└─────────────────────────────────────────────────────────┘
Result: Tenant A limited to 33% even though 67% is idle
FCSP works on any CUDA GPU:
FCSP-Supported GPUs:
├── All Data Center: A100, H100, V100, T4, A10, L4, etc.
├── All Professional: RTX A6000, RTX 4000/5000/6000
├── All Consumer: RTX 3080/3090/4080/4090
└── Legacy: GTX 1080 Ti, Titan V, etc.
Minimum: CUDA Compute Capability 3.0+
Your RTX 3080 can now provide MIG-like isolation. Your V100 cluster gets multi-tenancy.
FCSP partitions can be changed at runtime without disrupting workloads:
# Change tenant limits on the fly
$ export BUD_SM_LIMIT=50 # 50% compute
$ export BUD_MEMORY_LIMIT=8G # 8GB memory
# Or per-device
$ export BUD_SM_LIMIT_DEV0=30
$ export BUD_SM_LIMIT_DEV1=70
# Changes take effect immediately for new allocationsNo restarts. No service disruption. Instant rebalancing.
Define any percentage you want:
# FCSP supports any configuration
Tenant A: 25% compute, 4GB memory
Tenant B: 35% compute, 6GB memory
Tenant C: 40% compute, 10GB memory
# Or dynamic fair-share
$ export BUD_ISOLATION_MODE=adaptive
# Resources automatically balance based on demandThis is FCSP's killer feature. Idle resources automatically flow to active tenants:
Scenario: 3 tenants, only Tenant A is active
FCSP Behavior:
┌─────────────────────────────────────────────────────────┐
│ GPU │
│ ┌─────────────────────────────────────────────────────┐│
│ │ Tenant A (active) ││
│ │ Floor: 33% + Burst: 67% from idle neighbors ││
│ │ Total: ~100% utilization ││
│ └─────────────────────────────────────────────────────┘│
│ (Idle tenants' resources borrowed, returned on demand)│
└─────────────────────────────────────────────────────────┘
Result: Tenant A uses full GPU while alone
Resources reclaimed when others become active
Our benchmarks show this achieves 143.9% efficiency compared to static allocation (>100% because compute+memory heavy workloads complement each other).
┌─────────────────────────────────────────────────────────────┐
│ FCSP Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ Application Process │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ CUDA Application (unchanged) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ LD_PRELOAD │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ libvgpu.so │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Memory │ │ Compute │ │ Stream │ │ │
│ │ │ Manager │ │ Throttler │ │ Classifier │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ NCCL │ │ UVM │ │ Graph │ │ │
│ │ │ Hooks │ │ Prefetch │ │ Optimizer │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Shared Memory Region │ │
│ │ • Per-process resource tracking │ │
│ │ • Global utilization metrics │ │
│ │ • Token bucket state │ │
│ │ • Burst pool management │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ SM Observer Thread │ │
│ │ • NVML polling (5ms interval) │ │
│ │ • Fair share calculation │ │
│ │ • Idle tenant detection │ │
│ │ • Contention monitoring │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
FCSP provides four isolation modes to match your requirements:
- No isolation enforcement
- Useful for single-tenant scenarios
- Minimal overhead (~40ns per API call)
- Default mode
- Fair sharing with burst capability
- 20% floor + 40% shared pool + 40% burst pool
- Best for mixed workloads
Resource Pool Distribution (Balanced):
┌────────────────────────────────────────────────────────┐
│ GPU Resources (100%) │
├──────────────┬──────────────────┬─────────────────────┤
│ Floor Pool │ Shared Pool │ Burst Pool │
│ (20%) │ (40%) │ (40%) │
│ Guaranteed │ Fair-share │ From idle tenants │
│ per-tenant │ among active │ work-conserving │
└──────────────┴──────────────────┴─────────────────────┘
- Hard quotas, no bursting
- MIG-like behavior
- Maximum isolation, minimum efficiency
- Automatically switches based on contention
- Low contention → relaxed limits (more throughput)
- High contention → strict limits (better isolation)
- Best of both worlds
FCSP uses a sophisticated token bucket algorithm:
// Per-stream token buckets (eliminates CAS contention)
typedef struct {
_Atomic int64_t tokens; // Current tokens
int64_t max_tokens; // Bucket capacity
int64_t refill_rate; // Tokens per second
uint64_t last_update; // Last refill timestamp
} token_bucket_t;
// Kernel launch cost = grids × blocks × TOKEN_FACTOR(32)
// Example: 256 grids × 256 blocks = 2,097,152 tokensThe system also includes:
- Batch token consumption: Thread-local caching reduces atomic operations by 8-16x
- PID controller: Smooth rate limiting with proportional-integral-derivative control
- Exponential backoff: Graceful throttling (50ns → 10µs)
FCSP provides flexible memory limits:
# Absolute limit
export BUD_MEMORY_LIMIT=8G
# Percentage of GPU memory
export BUD_MEMORY_LIMIT=50%
# Enforcement modes
export BUD_MEMORY_LIMIT_MODE=hard # Reject allocations
export BUD_MEMORY_LIMIT_MODE=soft # Warning + callback
# Per-device limits
export BUD_MEMORY_LIMIT_DEV0=4G
export BUD_MEMORY_LIMIT_DEV1=8GPlus advanced UVM (Unified Virtual Memory) features:
- Intelligent prefetching: Predict and prefetch based on access patterns
- Memory pressure monitoring: Early warning at 70%, 85%, 95% thresholds
- Automatic eviction: LRU, access-aware, or FIFO policies
- Oversubscription: Use more GPU memory than physically available
Let's look at real numbers comparing FCSP to native execution and MIG.
- GPU: NVIDIA RTX 3080 (10GB, 68 SMs)
- FCSP: Adaptive isolation mode
- Iterations: 100 per test
| Metric | Native | FCSP | MIG (A100) |
|---|---|---|---|
| Kernel Launch Latency | ~3 µs | 4.9 µs | ~3.5 µs |
| Memory Alloc Latency | ~100 µs | 704 µs | ~100 µs |
| API Interception | 0 ns | 40 ns | 0 ns |
| Rate Limiter | N/A | 1.5 µs | N/A |
Analysis: FCSP adds ~2µs overhead per kernel launch – negligible for typical GPU workloads where kernels run for milliseconds. Memory allocation is slower due to tracking, but this is amortized over the allocation lifetime.
| Metric | FCSP (Balanced) | FCSP (Adaptive) | MIG |
|---|---|---|---|
| Fairness Index | 0.996 | 0.996 | 1.0 |
| QoS Consistency (CV) | 0.07 | 0.07 | <0.05 |
| Noisy Neighbor Impact | 11.5% | 4.66% | ~3% |
| Cross-Tenant Isolation | 81.4% | 81.4% | ~95% |
Analysis: FCSP achieves near-perfect fairness (0.996 out of 1.0). With adaptive isolation, noisy neighbor impact drops to 4.66% – approaching MIG's ~3%. The tradeoff is flexibility: FCSP allows work-conservation, MIG doesn't.
| Metric | FCSP | MIG |
|---|---|---|
| Affinity Complementary Efficiency | 143.9% | N/A |
| Work Conservation Benefit | Up to 67% | 0% |
| Resource Utilization (3 idle, 1 active) | ~100% | 33% |
Analysis: FCSP's work-conservation is a game-changer. When tenants are idle, active tenants can use the full GPU. MIG wastes 67% of resources in this scenario.
| Feature | FCSP | MIG |
|---|---|---|
| Hardware Support | ||
| Consumer GPUs (RTX) | ✅ | ❌ |
| V100, T4, A10 | ✅ | ❌ |
| A100, H100 | ✅ | ✅ |
| Partitioning | ||
| Arbitrary splits | ✅ | ❌ |
| Dynamic resizing | ✅ | ❌ |
| No-disruption changes | ✅ | ❌ |
| Work-conservation | ✅ | ❌ |
| Isolation | ||
| Memory isolation | Software | Hardware |
| Compute isolation | Software | Hardware |
| Fault isolation | Partial | Complete |
| Error containment | Per-process | Per-instance |
| Performance | ||
| Zero overhead option | ✅ | ✅ |
| Burst capability | ✅ | ❌ |
| QoS guarantees | Soft | Hard |
| Operations | ||
| Kubernetes integration | Via device plugin | Native |
| Monitoring | NVML + custom | NVML |
| Configuration | Env vars | nvidia-smi |
The most obvious case. If you're running:
- Consumer GPUs (RTX series)
- Older data center GPUs (V100, T4, P100)
- Mixed GPU clusters
FCSP is your only option for multi-tenant isolation.
# Kubernetes scenario: Training job needs more resources during backprop
apiVersion: v1
kind: Pod
metadata:
name: training-job
spec:
containers:
- name: trainer
env:
- name: BUD_SM_LIMIT
value: "50" # Start at 50%
# FCSP allows runtime adjustment:
# kubectl exec training-job -- export BUD_SM_LIMIT=80Perfect for:
- Development environments: Burst when testing, idle when coding
- Batch processing: Jobs that come and go
- Time-sharing: Different users active at different times
Workday Pattern (Work-Conservation):
┌──────────────────────────────────────────────────────────┐
│ 9:00 AM: All 4 tenants active → 25% each │
│ 12:00 PM: Tenant B at lunch → A,C,D get 33% each │
│ 2:00 PM: Tenant D in meeting → A,C get 50% each │
│ 6:00 PM: Only Tenant A working → A gets 100% │
└──────────────────────────────────────────────────────────┘
MIG would waste 75% at 6 PM. FCSP uses everything.
FCSP's workload affinity feature can co-schedule compute-heavy and memory-heavy workloads for better-than-isolated performance:
Without Affinity:
├── Compute-heavy job: 50% SM utilization, 10% memory BW
├── Memory-heavy job: 10% SM utilization, 50% memory BW
└── Total GPU utilization: ~55% (many resources idle)
With FCSP Affinity:
├── Both jobs run together
├── Total GPU utilization: 60% SM + 60% memory BW
└── Efficiency: 143.9% vs isolated execution
LLM inference has unique characteristics:
- Bursty memory allocation (KV cache)
- Mixed compute patterns (attention vs FFN)
- Variable batch sizes
FCSP includes LLM-optimized profiles:
# LLM-optimized UVM configuration
export BUD_UVM_PROFILE=llm_inference
# Enables: larger prefetch (64MB), aggressive pattern detection,
# handles bursty KV cache, optimized for attention patternsIf tenant A crashing absolutely cannot affect tenant B:
- Financial trading systems
- Medical imaging
- Safety-critical applications
MIG provides hardware-level fault isolation. FCSP provides process-level isolation.
Some compliance frameworks explicitly require hardware-level isolation:
- HIPAA (healthcare)
- PCI-DSS (payment)
- SOC 2 Type II
MIG's hardware partitioning may satisfy auditors where software isolation doesn't.
MIG provides hard QoS guarantees:
MIG Instance: 1g.10gb
├── Guaranteed: 14 SMs (always)
├── Guaranteed: 10GB memory (always)
└── No variability, no neighbors
FCSP (even with strict mode):
├── Target: 14 SMs
├── Actual: 13-15 SMs (software scheduling variance)
└── ~0.07 coefficient of variation
If your SLA requires "exactly 14 SMs, never less", use MIG.
If you have:
- MIG-capable hardware (A100, H100)
- Predictable, always-on workloads
- No need for resource flexibility
MIG is simpler to operate and has zero runtime overhead.
# 1. Build FCSP
cd bud_fcsp && mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j8
# 2. Run with FCSP
LD_PRELOAD=/path/to/libvgpu.so python train.py# ConfigMap for FCSP settings
apiVersion: v1
kind: ConfigMap
metadata:
name: fcsp-config
data:
isolation-mode: "adaptive"
tenant-floor: "20"
shared-pool: "40"
burst-pool: "40"
---
# Pod with FCSP isolation
apiVersion: v1
kind: Pod
metadata:
name: ml-workload
labels:
fcsp.io/tenant: "team-a"
spec:
containers:
- name: training
image: pytorch/pytorch:latest
env:
- name: LD_PRELOAD
value: "/opt/fcsp/libvgpu.so"
- name: BUD_ISOLATION_MODE
valueFrom:
configMapKeyRef:
name: fcsp-config
key: isolation-mode
- name: BUD_SM_LIMIT
value: "50"
- name: BUD_MEMORY_LIMIT
value: "8G"
volumeMounts:
- name: fcsp-lib
mountPath: /opt/fcsp
volumes:
- name: fcsp-lib
hostPath:
path: /usr/local/lib/fcsp# Maximize flexibility, allow bursting
export BUD_ISOLATION_MODE=adaptive
export BUD_TENANT_FLOOR_PCT=10 # Low guaranteed floor
export BUD_SHARED_POOL_PCT=30 # Moderate shared pool
export BUD_BURST_MAX_PCT=60 # High burst capability
export BUD_ADAPTIVE_SM_MAX_BURST_PCT=100 # Allow full burst# Balance isolation and efficiency
export BUD_ISOLATION_MODE=balanced
export BUD_TENANT_FLOOR_PCT=40 # High guaranteed floor
export BUD_SHARED_POOL_PCT=30 # Moderate shared pool
export BUD_BURST_MAX_PCT=30 # Limited burst
export BUD_COOPERATIVE_MODE=1 # Smooth throttling# Maximum isolation, MIG-like behavior
export BUD_ISOLATION_MODE=strict
export BUD_TENANT_FLOOR_PCT=100 # All resources guaranteed
export BUD_SHARED_POOL_PCT=0 # No sharing
export BUD_BURST_MAX_PCT=0 # No bursting# 1. Enable fast paths (default in production)
export BUD_FAST_PATH_ENABLED=1
# 2. Increase batch token size for high-throughput
export BATCH_TOKENS_DEFAULT_SIZE=64 # Default: 16
# 3. Disable unnecessary features
export BUD_ENABLE_METRICS=0 # If not monitoring
export BUD_LOG_LEVEL=error # Reduce logging
# 4. Use SIMD optimizations (auto-detected)
# FCSP automatically uses AVX2/SSE4.2 for slot finding# 1. Use adaptive mode for dynamic workloads
export BUD_ISOLATION_MODE=adaptive
# 2. Increase floor for better guarantees
export BUD_TENANT_FLOOR_PCT=40
# 3. Enable PID controller for smooth throttling
export BUD_USE_PID_CONTROLLER=1
export PID_DEFAULT_KP=0.8
export PID_DEFAULT_KI=0.2
# 4. Reduce observer interval for faster response
export BUD_OBSERVER_INTERVAL_US=2000 # 2ms instead of 5ms# Enable UVM oversubscription for large models
export BUD_UVM_ENABLED=1
export BUD_UVM_PROFILE=llm_inference
# Or manual configuration
export BUD_UVM_PRESSURE_WARNING_PCT=70
export BUD_UVM_PRESSURE_HIGH_PCT=85
export BUD_UVM_PRESSURE_CRITICAL_PCT=95
export BUD_PREFETCH_ENABLED=1
export BUD_PREFETCH_MAX_SIZE_MB=64-
Software Isolation Only: A malicious or buggy tenant could potentially bypass isolation (MIG has hardware enforcement)
-
Overhead: ~2µs per kernel launch, ~600µs for memory allocation (negligible for most workloads, but measurable)
-
No Hardware Error Isolation: GPU errors (ECC, timeouts) affect all tenants
-
Requires LD_PRELOAD: Application must use dynamically-linked CUDA
-
Approximated Metrics: SM utilization is polled (5ms interval), not instantaneous
-
Limited Hardware: Only A100, A30, H100, H200
-
Fixed Profiles: Cannot create arbitrary partition sizes
-
Static Allocation: Requires workload termination to resize
-
No Work Conservation: Idle resources are wasted
-
Reduced Total Performance: Sum of MIG instances < full GPU performance
-
Complexity: More infrastructure to manage (MIG instances, CUDA MIG handles)
The choice between FCSP and MIG isn't about which is "better" – it's about which fits your requirements:
| Requirement | Recommendation |
|---|---|
| No MIG hardware | FCSP (only option) |
| Dynamic workloads | FCSP (work-conservation) |
| Maximum isolation | MIG (hardware guarantee) |
| Compliance requirements | MIG (auditable hardware) |
| Development clusters | FCSP (flexibility) |
| Predictable production | MIG (simplicity) |
| Mixed GPU types | FCSP (universal support) |
| LLM inference | FCSP (optimized profiles) |
For most practical scenarios, FCSP provides 80-95% of MIG's isolation benefits with significantly more flexibility and universal hardware support. The work-conservation feature alone can improve cluster utilization by 40-67% in realistic multi-tenant scenarios.
Consider FCSP when you value flexibility, efficiency, and broad hardware support. Choose MIG when you need the absolute guarantee of hardware isolation and have compatible GPUs.
- FCSP Source:
/home/bud/Desktop/hami/bud_fcsp - Benchmarks:
/home/bud/Desktop/hami/gpu-virt-bench - NVIDIA MIG Documentation: NVIDIA MIG User Guide
This article is based on FCSP v1.0 and NVIDIA MIG as of December 2024. Benchmarks were conducted on an NVIDIA RTX 3080 (10GB, 68 SMs) with CUDA 12.0.