Skip to content

Focus 2: Performance Engineering #32

@rainliu

Description

@rainliu

Focus 2: Performance Engineering

Performance is not an afterthought—it's a core requirement for real-time communication. This focus area encompasses systematic benchmarking, profiling, and optimization across the entire stack.

Benchmarking Infrastructure

Before optimizing, we need to measure. A comprehensive benchmarking infrastructure is essential.

Planned benchmark suite:

// Using criterion for statistical rigor
use criterion::{criterion_group, criterion_main, Criterion, Throughput};

fn bench_datachannel_throughput(c: &mut Criterion) {
    let mut group = c.benchmark_group("datachannel");

    for size in [64, 1024, 16384, 65536].iter() {
        group.throughput(Throughput::Bytes(*size as u64));
        group.bench_with_input(
            BenchmarkId::new("send", size),
            size,
            |b, &size| {
                b.iter(|| {
                    dc.send(&message[..size])
                });
            },
        );
    }
    group.finish();
}

fn bench_rtp_pipeline(c: &mut Criterion) {
    c.bench_function("rtp_parse", |b| {
        b.iter(|| RtpPacket::unmarshal(&packet_bytes))
    });

    c.bench_function("rtp_marshal", |b| {
        b.iter(|| packet.marshal_to(&mut buffer))
    });

    c.bench_function("srtp_encrypt", |b| {
        b.iter(|| context.encrypt_rtp(&mut packet))
    });

    c.bench_function("srtp_decrypt", |b| {
        b.iter(|| context.decrypt_rtp(&mut packet))
    });
}

criterion_group!(benches, bench_datachannel_throughput, bench_rtp_pipeline);
criterion_main!(benches);

Benchmark categories:

Category Metrics Tools
Throughput Messages/sec, Bytes/sec criterion, custom
Latency p50, p99, p999 criterion, hdr_histogram
Memory Allocations, peak usage dhat, heaptrack
CPU Cycles per operation perf, flamegraph

Profiling and Analysis

Profiling workflow:

┌─────────────────────────────────────────────────────────────────────────────┐
│                         Performance Analysis Workflow                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   1. Baseline         2. Profile           3. Analyze                       │
│   ┌─────────┐        ┌─────────┐          ┌─────────┐                       │
│   │ Run     │───────▶│ Collect │─────────▶│ Generate│                       │
│   │ Bench   │        │ Samples │          │ Reports │                       │
│   └─────────┘        └─────────┘          └─────────┘                       │
│        │                  │                    │                            │
│        ▼                  ▼                    ▼                            │
│   criterion          perf record          flamegraph                        │
│   results            + perf script        + hotspot analysis                │
│                                                                             │
│   4. Optimize         5. Validate          6. Document                      │
│   ┌─────────┐        ┌─────────┐          ┌─────────┐                       │
│   │ Apply   │───────▶│ Re-run  │─────────▶│ Record  │                       │
│   │ Changes │        │ Bench   │          │ Gains   │                       │
│   └─────────┘        └─────────┘          └─────────┘                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Profiling tools:

  • perf — Linux performance counters, CPU profiling
  • flamegraph — Visualize hot code paths
  • heaptrack — Memory allocation profiling
  • cargo-llvm-lines — Generic code bloat analysis
  • valgrind/cachegrind — Cache behavior analysis

DataChannel Optimization

WebRTC DataChannels are increasingly used for high-throughput applications. Optimization targets:

SCTP layer:

Optimization Description Expected Impact
Chunk batching Combine small messages into fewer SCTP chunks Reduce overhead 20-40%
Zero-copy I/O Avoid buffer copies in send/receive path Reduce CPU usage
TSN tracking Optimize sequence number management Reduce memory allocations
Congestion control Tune SCTP congestion parameters Improve throughput stability

Application layer:

  • Message framing optimization
  • Backpressure handling
  • Buffer pool for allocations

Performance targets:

Metric Baseline Target Notes
Throughput (reliable, ordered) TBD > 500 Mbps Single channel
Throughput (unreliable) TBD > 1 Gbps Best-effort
Latency (1KB message) TBD < 1 ms p99
Messages/second TBD > 100K Small messages

RTP/RTCP Pipeline Optimization

Media transport is latency-sensitive and high-volume.

Packet processing:

Incoming RTP Packet
        │
        ▼
┌───────────────┐
│ UDP Receive   │ ← Goal: zero-copy receive
└───────┬───────┘
        │
        ▼
┌───────────────┐
│ SRTP Decrypt  │ ← Goal: hardware AES-NI
└───────┬───────┘
        │
        ▼
┌───────────────┐
│ RTP Parse     │ ← Goal: minimal validation
└───────┬───────┘
        │
        ▼
┌───────────────┐
│ Interceptors  │ ← Goal: inline, no allocations
└───────┬───────┘
        │
        ▼
┌───────────────┐
│ Jitter Buffer │ ← Goal: lock-free, pre-allocated
└───────┬───────┘
        │
        ▼
    Application

Specific optimizations:

  • SIMD parsing — Use SIMD instructions for header parsing where beneficial
  • AES-NI — Ensure hardware acceleration for SRTP
  • Inline interceptors — Compile-time interceptor composition (already implemented via generics)
  • Pre-allocated buffers — Avoid per-packet allocations
  • Branch prediction — Optimize common code paths

ICE Performance

Connection establishment time directly impacts user experience.

Optimization areas:

Phase Current Target Approach
Candidate gathering TBD < 100ms Parallel STUN queries
Connectivity checks TBD < 500ms Prioritized pair testing
DTLS handshake TBD < 200ms Session resumption
Total time-to-media TBD < 1s Combined optimizations

Techniques:

  • Aggressive candidate nomination
  • Parallel connectivity checks
  • STUN response caching
  • Optimized candidate pair sorting

Memory Optimization

Real-time systems benefit from predictable memory behavior.

Goals:

  • Minimize allocations in hot paths
  • Use buffer pools for packet buffers
  • Pre-allocate data structures where possible
  • Reduce memory fragmentation

Tracking:

// Example: Using dhat for allocation profiling
#[global_allocator]
static ALLOC: dhat::Alloc = dhat::Alloc;

#[test]
fn test_allocations_in_hot_path() {
    let _profiler = dhat::Profiler::new_heap();

    // Run hot path code
    for _ in 0..10000 {
        process_rtp_packet(&packet);
    }

    // Analyze allocation count and sizes
}

Continuous Performance Monitoring

CI integration:

  • Run benchmarks on every PR
  • Track performance regressions
  • Publish benchmark results
  • Alert on significant regressions

Planned dashboard metrics:

  • Throughput trends over time
  • Latency percentiles
  • Memory usage patterns
  • CPU efficiency

Metadata

Metadata

Assignees

No one assigned

    Labels

    p1high priority

    Type

    No fields configured for Task.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions