Skip to content

Latest commit

 

History

History
348 lines (252 loc) · 9.82 KB

File metadata and controls

348 lines (252 loc) · 9.82 KB

fcvm Performance Guide

This document covers performance characteristics, benchmarks, and tuning for fcvm.

Test Environment

All benchmarks run on AWS c7g.metal (bare-metal ARM64):

Component Specification
CPU 64× Neoverse-V1 (Graviton3)
Architecture aarch64
Memory 128GB
Storage btrfs on NVMe
Kernel 6.18+ (nested virtualization enabled for the nested L2 benchmarks; not required for general fcvm use)
Instance c7g.metal

Quick Reference

Run Benchmarks

make bench                    # Full benchmark suite (~5 minutes)
make bench-quick              # Quick iteration (fewer samples)
make bench-throughput         # Parallel I/O throughput only
make bench-operations         # Single-op latency only
make bench-protocol           # Serialization overhead only

Parallel I/O Throughput

Workload: 256 parallel workers, 1024 files × 4KB each

Parallel Reads

FUSE Readers Time vs Host Speedup vs 1 Reader
Host (direct) 7.9ms 1.0×
1 reader 393ms 49.7× slower 1.0×
2 readers 201ms 25.5× slower 2.0×
4 readers 109ms 13.8× slower 3.6×
8 readers 70ms 8.9× slower 5.6×
16 readers 61ms 7.7× slower 6.4×
32 readers 66ms 8.4× slower 6.0×
64 readers 65ms 8.2× slower 6.0×
128 readers 66ms 8.4× slower 6.0×
256 readers 66ms 8.4× slower 6.0×

Optimal for reads: 16 readers (diminishing returns above)

Parallel Writes (with fsync)

FUSE Readers Time vs Host
Host (direct) 161ms 1.0×
1 reader 3.01s 18.7× slower
4 readers 816ms 5.1× slower
16 readers 293ms 1.8× slower
64 readers 165ms 1.02× (matches host!)
256 readers 162ms 1.01×

Optimal for writes: 64 readers (matches host performance)

Key Finding: Serialization Bottleneck

With only 1 FUSE reader, all requests serialize through a single thread:

  • Reads: 49.7× slower
  • Writes: 18.7× slower

Default is 64 readers which balances read/write performance with memory usage.


Single Operation Latency

Individual FUSE operation overhead (256 readers):

Operation Host FUSE Overhead
getattr 791ns 832ns 1.05×
lookup 784ns 832ns 1.06×
read 4KB 853ns 796ns 0.93× (faster!)
write 4KB 1.0µs 119µs 119×
open+close 1.4µs 98µs 68×
readdir 6.0µs 274µs 46×
create+unlink 11.8µs 300µs 25×

Observations:

  • Cached reads are faster than host due to kernel page cache
  • Metadata ops (getattr, lookup) have ~5% overhead
  • Mutating ops (write, create) have significant overhead due to fsync

Wire Protocol Performance

Serialization overhead for fuse-pipe protocol:

Operation Time Throughput
Serialize lookup request 31ns
Deserialize lookup request 100ns
Serialize attr response 35ns
Serialize read response (4KB) 3.2µs 1.19 GiB/s
Serialize read response (64KB) 50.6µs 1.21 GiB/s
Serialize read response (128KB) 101µs 1.21 GiB/s
Roundtrip wire request 105ns
Roundtrip wire response (4KB) 8.1µs 485 MiB/s

Observation: Serialization overhead is negligible (~1% of total latency).


Nested Virtualization

fcvm supports running VMs inside VMs using ARM64 FEAT_NV2. This creates FUSE-over-FUSE:

  • L1: Host → VM (one FUSE layer)
  • L2: Host → L1 → L2 (two FUSE layers)

L1 vs L2 Benchmark Results

Test: 10MB I/O operations, 100 metadata operations

Metric L1 L2 L2/L1 Ratio
Local Write 5ms 7ms 1.4×
Local Read 2ms 5ms 2.5×
FUSE Write (sync) 81ms 568ms 7.0×
FUSE Write (async) 56ms 165ms 2.9×
FUSE Read 46ms 160ms 3.5×
FUSE stat 1.1ms 2.4ms 2.2×
FUSE small read 1.5ms 7.4ms 4.9×
Memory Used 423MB 221MB

Why Sync Writes Are 7× Slower

Each L2 fsync must propagate synchronously through both FUSE layers:

L2 app calls fsync()
  ↓
L2 FUSE kernel → L2 fuse-pipe client → vsock
  ↓
L1 VolumeServer receives, calls fsync() on its FUSE mount
  ↓ (BLOCKS until complete)
L1 FUSE kernel → L1 fuse-pipe client → vsock
  ↓
Host VolumeServer receives, calls fsync() on btrfs
  ↓ (BLOCKS until disk sync)
Response propagates back through all layers

Breakdown:

Component L1 L2
Async data write (10MB) 56ms 165ms (2.9×)
Fsync overhead (10 ops) 25ms total 403ms total
Per-fsync latency 2.5ms 40ms (16×)

The fsync alone is 16× slower because it blocks through two FUSE layers.

Network Performance (iperf3)

Egress/ingress throughput (3-second tests, various block sizes and parallelism):

Direction Block Size Streams L1 L2 Overhead
Egress (VM→Host) 128K 1 42.4 Gbps 11.0 Gbps 3.9x
128K 4 38.0 Gbps 12.8 Gbps 3.0x
1M 1 43.1 Gbps 9.0 Gbps 4.8x
1M 8 33.1 Gbps 12.3 Gbps 2.7x
Ingress (Host→VM) 128K 1 48.7 Gbps 8.4 Gbps 5.8x
128K 4 44.3 Gbps 8.6 Gbps 5.2x
1M 1 53.4 Gbps 11.7 Gbps 4.6x
1M 8 43.0 Gbps 10.4 Gbps 4.1x
  • L1: 40-53 Gbps, L2: 8-13 Gbps (~4-5x overhead from double NAT)
  • Single stream often outperforms parallel (virtio queue contention)

Optimizing L2 Workloads

  1. Avoid fsync when possible - async writes are only 3× slower, not 7×
  2. Batch operations - amortize fsync cost across many writes
  3. Use local storage - L2's local disk (/tmp) is nearly as fast as L1
  4. Reduce FUSE readers - saves memory at deeper nesting levels
# L2 with reduced readers (saves ~400MB virtual memory)
FCVM_FUSE_READERS=8 fcvm podman run ...

FUSE Tracing

Enable per-operation tracing to diagnose latency issues:

# Trace every 100th request (recommended)
FCVM_FUSE_TRACE_RATE=100 fcvm podman run ...

# Trace all requests (high overhead, debugging only)
FCVM_FUSE_TRACE_RATE=1 fcvm podman run ...

Trace Output

[TRACE         read] total=8940µs srv=159µs | fs=149 | to_srv=33 to_cli=1974
[TRACE        fsync] total=70000µs srv=3000µs | fs=2900 | to_srv=? to_cli=?
Field Meaning
total End-to-end client round-trip (always accurate)
srv Server-side processing time (always accurate)
fs Filesystem operation time
to_srv Network latency client→server (may show ? if clocks differ)
to_cli Network latency server→client (may show ? if clocks differ)

Clone Benchmarks

10-Clone Sequential

Metric Average Range
Snapshot load (UFFD) 9.08ms 8.76-9.56ms
VM resume 0.48ms 0.44-0.56ms
Core VM restore ~9.5ms
Full clone cycle 611ms 587-631ms

Individual clone times: 631, 599, 611, 611, 615, 618, 618, 622, 587, 599ms

10-Clone Parallel

All 10 clones launched simultaneously:

Metric Value
Wall clock time 1.03s
Snapshot load (UFFD) 9-11ms (consistent under load)
Individual clone times 743-1024ms

Core restore is ~10ms regardless of sequential or parallel. Single memory server handles 10 concurrent clones. Bottleneck is network teardown and state cleanup under contention.


Memory Efficiency

UFFD Memory Sharing

Multiple VMs cloned from the same snapshot share memory via kernel page cache:

Scenario Expected RAM Actual RAM
1 VM (512MB) 512MB 512MB
10 clones 5.1GB ~550MB
50 clones 25.6GB ~600MB

Memory is only copied on write (true CoW at page level).

btrfs CoW Disk Snapshots

Disk cloning uses btrfs reflinks:

Method Time Space
Regular copy (cp) ~850ms Full duplicate
Reflink copy ~1.5ms Zero until modified

Configuration Reference

Variable Default Purpose
FCVM_FUSE_READERS 64 Number of FUSE reader threads
FCVM_FUSE_TRACE_RATE 0 Trace every Nth request (0=disabled)

Memory Usage

Memory per FUSE mount ≈ readers × 8MB (thread stack)

64 readers = 512MB virtual (RSS much lower due to lazy allocation)
8 readers = 64MB virtual

Reproducing Benchmarks

1. fuse-pipe Benchmarks

# Full benchmark suite
make bench

# Results printed to stdout, graphs in target/criterion/

2. Nested Virtualization Benchmarks

# Build nested kernel (first time, ~20 min)
fcvm setup --kernel-profile nested --build-kernels

# Run L2 test with benchmarks
make test-root FILTER=nested_l2 STREAM=1

3. FUSE Latency Tracing

# Trace L2 operations
FCVM_FUSE_TRACE_RATE=100 make test-root FILTER=nested_l2 STREAM=1 2>&1 | tee trace.log

# Extract traces
grep "TRACE" trace.log

Summary

Scenario Recommendation
General workloads Use default 64 FUSE readers
Read-heavy 16 readers is optimal
Write-heavy 64+ readers needed
Memory constrained Reduce to 8-16 readers
Nested VMs (L2+) Avoid fsync, use local disk
Debug latency Enable FUSE tracing
Many clones Use UFFD memory sharing
Fast disk copies Use btrfs reflinks

Related Documentation