Memory Management

This guide describes memory management techniques for long-running federated learning jobs using Python, PyTorch, and glibc/jemalloc.

Table of Contents

Overview
Allocator Support
Platform Compatibility
Environment Variables
- Client (Training Nodes)
- Server (Aggregation Node)
Server-Side Memory Cleanup
Client-Side Memory Cleanup
Recommended Settings
Using jemalloc
- Startup Script
- Installing jemalloc
API Reference
Troubleshooting

Overview

Federated learning jobs can run for hours or days. Without proper memory management, RSS (Resident Set Size) can grow continuously due to:

Python garbage collection delays
glibc memory arena fragmentation
PyTorch CUDA cache retention

NVFlare provides utilities and configuration options to manage memory effectively on both server and client sides. The framework automatically detects the memory allocator in use (glibc or jemalloc) and adapts its cleanup strategy accordingly.

Allocator Support

NVFlare supports two memory allocators:

glibc (default on most Linux): Uses malloc_trim() to release free heap pages to the OS. Requires MALLOC_ARENA_MAX for optimal memory behavior.
jemalloc (recommended for PyTorch): Uses auto-decay for memory management. Configure via MALLOC_CONF. No malloc_trim() calls needed (jemalloc handles this automatically).

NVFlare automatically detects which allocator is in use at runtime.

Platform Compatibility

Not all memory management features work on all platforms. The table below summarizes compatibility:

Feature	Linux/glibc	Linux/musl	macOS
`gc.collect()`	✓	✓	✓
`MALLOC_ARENA_MAX`	✓	✗	✗
`malloc_trim()`	✓	✗	✗
`torch.cuda.empty_cache`	✓	✓	✓

Notes:

Linux/glibc: Standard Linux distributions (Ubuntu, RHEL, Debian, etc.)
Linux/musl: Alpine Linux and other musl-based distributions
macOS: malloc_trim() is silently skipped (safe no-op)

Warning

For maximum memory efficiency, use Linux with glibc. Alpine Linux (musl) and macOS/Windows will still benefit from gc.collect() but cannot release fragmented heap memory back to the OS.

Environment Variables

Set these environment variables before starting NVFlare processes:

Client (Training Nodes)

export MALLOC_ARENA_MAX=2

Why: Clients typically have limited CPU memory. Setting MALLOC_ARENA_MAX=2 prevents arena explosion and reduces memory fragmentation.

Server (Aggregation Node)

export MALLOC_ARENA_MAX=4

Why: Servers are CPU memory heavy (4-7× model size) with multi-threaded networking. MALLOC_ARENA_MAX=4 balances throughput vs memory. Use 8 for high parallelism.

Server-Side Memory Cleanup

The FedAvg controller supports automatic memory cleanup via the server_memory_gc_rounds parameter.

Configuration

from nvflare.recipe.fedavg import FedAvgRecipe

recipe = FedAvgRecipe(
    name="my_job",
    min_clients=4,
    num_rounds=100,
    train_script="client.py",
    server_memory_gc_rounds=5,  # Cleanup every 5 rounds
)

Values:

0 = Disabled (default for BaseFedAvg-based controllers)
1 = Cleanup every round (default for legacy controllers like ScatterAndGather)
5 = Cleanup every 5 rounds (recommended for server)

What It Does

When enabled, at the end of every N rounds:

Runs Python garbage collection (gc.collect())
Returns free heap pages to OS (malloc_trim(), Linux/glibc only)

Performance Impact

Memory cleanup has minimal overhead in typical federated learning workloads:

Operation	Typical Duration	Notes
`gc.collect()`	10-500 ms	Depends on Python object count
`malloc_trim()`	< 1 ms	Very fast (page table ops)

Overhead analysis:

Training round duration: Typically 30 seconds to 10+ minutes
Cleanup duration: 10-500 ms total
Overhead per round: Usually < 1%

With server_memory_gc_rounds=5:

Cleanup runs once every 5 rounds
Total overhead: < 0.2% of training time

Recommendation: The default server_memory_gc_rounds=5 provides good memory management with negligible performance impact. Only disable (=0) if you've measured and confirmed RSS is stable without cleanup.

Client-Side Memory Cleanup

The FedAvg recipe and ScriptRunner support automatic memory cleanup on clients via client_memory_gc_rounds and cuda_empty_cache parameters.

Configuration

from nvflare.recipe.fedavg import FedAvgRecipe

recipe = FedAvgRecipe(
    name="my_job",
    min_clients=4,
    num_rounds=100,
    train_script="client.py",

    # Server-side cleanup
    server_memory_gc_rounds=5,

    # Client-side cleanup
    client_memory_gc_rounds=1,   # Cleanup every round
    cuda_empty_cache=True, # Clear GPU cache
)

Swarm Learning Configuration

Swarm Learning uses memory_gc_rounds (not memory_gc_counts) and cuda_empty_cache on SimpleSwarmLearningRecipe:

from nvflare.app_opt.pt.recipes.swarm import SimpleSwarmLearningRecipe

recipe = SimpleSwarmLearningRecipe(
    name="swarm_job",
    model=MyModel(),
    num_rounds=10,
    train_script="train.py",
    memory_gc_rounds=1,   # Cleanup every round on trainer and aggregator roles
    cuda_empty_cache=True,
)

Note

memory_gc_rounds and cuda_empty_cache are top-level Swarm recipe arguments. Do not pass them inside train_args (they are reserved keys).

Parameters:

client_memory_gc_rounds: Run cleanup every N rounds on client (0 = disabled)
cuda_empty_cache: If True, call torch.cuda.empty_cache() on cleanup
memory_gc_rounds (Swarm): Run cleanup every N rounds (0 = disabled)

What It Does

When enabled, after each flare.send() on the client:

Runs Python garbage collection (gc.collect())
For glibc: Returns free heap pages to OS (malloc_trim())
For jemalloc: Relies on auto-decay (no manual action needed)
Optionally clears PyTorch CUDA cache

Note: The cleanup is transparent to the user's training script. No code changes are required in train.py.

External Process Support

For external process execution (launch_external_process=True), memory settings are passed via environment variables:

NVFLARE_CLIENT_MEMORY_GC_ROUNDS: Cleanup interval
NVFLARE_CUDA_EMPTY_CACHE: GPU cache cleanup (true/false)

Recommended Settings

Role	`server_memory_gc_rounds`	`client_memory_gc_rounds`	`MALLOC_ARENA_MAX`	`cuda_empty_cache`
Server	5	N/A	4	N/A
Client	N/A	1	2	True (for GPU)

Using jemalloc

For PyTorch workloads, jemalloc is recommended over glibc malloc. NVFlare's startup scripts automatically detect and use jemalloc if available.

Startup Script

The generated sub_start.sh script includes jemalloc auto-detection:

# Auto-detects jemalloc at standard locations
for JEMALLOC in /usr/lib/x86_64-linux-gnu/libjemalloc.so.2 \
                /usr/lib64/libjemalloc.so.2 \
                /usr/local/lib/libjemalloc.so; do
    if [ -f "$JEMALLOC" ]; then
        export LD_PRELOAD="${LD_PRELOAD:+$LD_PRELOAD:}$JEMALLOC"
        export MALLOC_CONF="${MALLOC_CONF:-dirty_decay_ms:5000,muzzy_decay_ms:5000}"
        break
    fi
done

Installing jemalloc

# Ubuntu/Debian
apt-get install libjemalloc2

# RHEL/CentOS
yum install jemalloc

API Reference

cleanup_memory

from nvflare.fuel.utils.memory_utils import cleanup_memory

cleanup_memory(cuda_empty_cache=True)

Signature: cleanup_memory(cuda_empty_cache: bool = False) -> None

Performs allocator-aware memory cleanup:

Runs gc.collect()
For glibc: Calls malloc_trim(0)
For jemalloc: Relies on auto-decay (no action needed)
Optionally calls torch.cuda.empty_cache()

get_allocator_type

from nvflare.fuel.utils.memory_utils import get_allocator_type

allocator = get_allocator_type()  # "glibc", "jemalloc", or "unknown"

Signature: get_allocator_type() -> str

Detects which memory allocator is in use at runtime. Result is cached.

try_malloc_trim

from nvflare.fuel.utils.memory_utils import try_malloc_trim

result = try_malloc_trim()

Signature: try_malloc_trim() -> Optional[int]

Low-level function to return free heap pages to OS.

Returns:

1 if memory was released
0 if no memory to release
None if not available (non-Linux or non-glibc)

Troubleshooting

High RSS on Server

Check MALLOC_ARENA_MAX is set
Enable server_memory_gc_rounds=5
Consider using jemalloc (LD_PRELOAD)
Monitor with top or htop

High RSS on Client

Check MALLOC_ARENA_MAX=2 is set
Enable client_memory_gc_rounds=1
Enable cuda_empty_cache=True for GPU
Consider using jemalloc

OOM Errors

Reduce batch size
Enable memory cleanup every round (client_memory_gc_rounds=1 or server_memory_gc_rounds=1)
Check for memory leaks in training code
Use jemalloc with appropriate decay settings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Management

Overview

Allocator Support

Platform Compatibility

Environment Variables

Client (Training Nodes)

Server (Aggregation Node)

Server-Side Memory Cleanup

Configuration

What It Does

Performance Impact

Client-Side Memory Cleanup

Configuration

Swarm Learning Configuration

What It Does

External Process Support

Recommended Settings

Using jemalloc

Startup Script

Installing jemalloc

API Reference

cleanup_memory

get_allocator_type

try_malloc_trim

Troubleshooting

High RSS on Server

High RSS on Client

OOM Errors

FilesExpand file tree

memory_management.rst

Latest commit

History

memory_management.rst

File metadata and controls

Memory Management