Cherry-pick [2.7] Client-side memory management (#4211)#4290
Cherry-pick [2.7] Client-side memory management (#4211)#4290YuanTingHsieh wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
Greptile SummaryThis PR introduces client-side memory management for NVFlare federated learning workloads, complementing the existing server-side controls. It adds allocator-aware cleanup ( Key changes:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant R as Recipe / ScriptRunner
participant E as InProcessClientAPIExecutor
participant A as InProcessClientAPI (APISpec)
participant M as memory_utils
R->>E: __init__(memory_gc_rounds, cuda_empty_cache)
E->>A: InProcessClientAPI(task_metadata)
E->>A: configure_memory_management(gc_rounds, cuda_empty_cache)
loop Each FL Round
E->>A: set_meta(task_meta)
A-->>R: receive() → FLModel
Note over R: Training step
R->>A: send(model)
A->>A: _maybe_cleanup_memory()
Note over A: _round_count += 1<br/>if _round_count % gc_rounds == 0
A->>M: cleanup_memory(cuda_empty_cache)
M->>M: gc.collect()
alt glibc
M->>M: malloc_trim(0)
else jemalloc
M->>M: rely on auto-decay
end
opt cuda_empty_cache=True
M->>M: torch.cuda.empty_cache()
end
end
|
There was a problem hiding this comment.
Pull request overview
This PR cherry-picks allocator-aware client-side memory management into the 2.7 branch, wiring configurable periodic cleanup through Client APIs, executors, and recipes (including Swarm), with supporting docs and unit tests.
Changes:
- Add allocator detection + allocator-aware
cleanup_memory()(glibcmalloc_trimvs jemalloc decay) with optional CUDA cache clearing. - Trigger client cleanup after
flare.send()and propagatememory_gc_rounds/client_memory_gc_rounds+cuda_empty_cachethrough ScriptRunner and recipe APIs (including Swarm cadence control). - Update startup template + documentation, and add/adjust unit tests for memory behavior and recipe/executor parameter plumbing.
Reviewed changes
Copilot reviewed 42 out of 42 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit_test/recipe/swarm_recipe_test.py | Adds Swarm recipe tests for memory_gc_rounds defaults/validation and CUDA passthrough. |
| tests/unit_test/job_config/script_runner_test.py | Adds ScriptRunner tests for new memory-management parameters. |
| tests/unit_test/fuel/utils/memory_utils_test.py | Updates/expands tests for allocator-aware cleanup and new API names. |
| tests/unit_test/client/in_process/api_test.py | Adds tests for in-process client API cleanup cadence. |
| tests/unit_test/client/ex_process/memory_test.py | Adds tests around env-var parsing expectations for ex-process mode (logic-level). |
| tests/unit_test/client/ex_process/init.py | Marks ex_process test directory as a package. |
| tests/unit_test/app_opt/tf/tf_recipe_no_cuda_cache_test.py | Ensures TF recipes reject cuda_empty_cache. |
| tests/unit_test/app_common/statistics/quantile_test.py | Disables TDigest-based tests on darwin. |
| tests/unit_test/app_common/executors/in_process_client_api_executor_test.py | Adds tests for executor memory parameters. |
| tests/unit_test/app_common/ccwf/test_swarm_memory_gc.py | Adds tests for Swarm aggregator GC cadence and CUDA passthrough. |
| nvflare/recipe/fedavg.py | Exposes client_memory_gc_rounds/cuda_empty_cache and wires into ScriptRunner. |
| nvflare/recipe/cyclic.py | Exposes client_memory_gc_rounds/cuda_empty_cache and wires into ScriptRunner. |
| nvflare/lighter/templates/master_template.yml | Adds opt-in jemalloc preload + MALLOC_CONF defaults to startup script template. |
| nvflare/job_config/script_runner.py | Adds memory parameters and forwards them to in-process / launcher executors. |
| nvflare/fuel/utils/memory_utils.py | Implements get_allocator_type() and allocator-aware cleanup_memory(cuda_empty_cache=...). |
| nvflare/client/in_process/api.py | Calls base init and triggers cleanup after send(); adds configuration helper. |
| nvflare/client/ex_process/api.py | Reads memory settings from config with env override; triggers cleanup after send(). |
| nvflare/client/config.py | Adds config keys for memory_gc_rounds and cuda_empty_cache. |
| nvflare/client/api_spec.py | Adds shared memory-management state + _maybe_cleanup_memory() to base client API. |
| nvflare/app_opt/tf/recipes/scaffold.py | Adds client_memory_gc_rounds plumbing (hard-codes cuda_empty_cache=False). |
| nvflare/app_opt/tf/recipes/fedopt.py | Adds client_memory_gc_rounds plumbing (hard-codes cuda_empty_cache=False). |
| nvflare/app_opt/tf/recipes/fedavg.py | Adds client_memory_gc_rounds and forces cuda_empty_cache=False. |
| nvflare/app_opt/tf/recipes/cyclic.py | Adds client_memory_gc_rounds and forces cuda_empty_cache=False. |
| nvflare/app_opt/tf/in_process_client_api_executor.py | Adds memory params to TF in-process executor wrapper plumbing. |
| nvflare/app_opt/tf/client_api_launcher_executor.py | Adds memory params to TF launcher executor wrapper plumbing. |
| nvflare/app_opt/pt/recipes/swarm.py | Adds Swarm memory_gc_rounds/cuda_empty_cache top-level args + reserved-key checks. |
| nvflare/app_opt/pt/recipes/scaffold.py | Adds recipe-level client memory knobs and forwards to ScriptRunner. |
| nvflare/app_opt/pt/recipes/fedopt.py | Adds recipe-level client memory knobs and forwards to ScriptRunner. |
| nvflare/app_opt/pt/recipes/fedeval.py | Adds recipe-level client memory knobs and forwards to ScriptRunner. |
| nvflare/app_opt/pt/recipes/fedavg_he.py | Adds recipe-level client memory knobs and forwards to ScriptRunner. |
| nvflare/app_opt/pt/recipes/fedavg.py | Forwards new client memory knobs to the unified base recipe. |
| nvflare/app_opt/pt/recipes/cyclic.py | Forwards new client memory knobs to the unified base recipe. |
| nvflare/app_opt/pt/in_process_client_api_executor.py | Adds memory params to PT in-process executor wrapper plumbing. |
| nvflare/app_opt/pt/client_api_launcher_executor.py | Adds memory params to PT launcher executor wrapper plumbing. |
| nvflare/app_common/np/recipes/lr/fedavg.py | Adds client memory knobs and forwards to ScriptRunner for RAW framework recipe. |
| nvflare/app_common/np/recipes/fedavg.py | Adds client memory knobs to NumPy recipe surface. |
| nvflare/app_common/np/recipes/cross_site_eval.py | Adds client memory knobs and forwards to ScriptRunner for validation tasks. |
| nvflare/app_common/executors/in_process_client_api_executor.py | Stores memory params and configures InProcessClientAPI accordingly. |
| nvflare/app_common/executors/client_api_launcher_executor.py | Writes memory params into client API config for external processes. |
| nvflare/app_common/ccwf/swarm_client_ctl.py | Makes Swarm aggregator-side cleanup configurable by round cadence + CUDA passthrough. |
| nvflare/app_common/ccwf/ccwf_job.py | Plumbs Swarm client config memory knobs into controller creation. |
| docs/programming_guide/memory_management.rst | Expands docs for allocator support and client-side cleanup configuration. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This PR aligns the 2.7 branch with allocator-aware **client-side memory
management** capabilities and recipe-level configurability,
complementing server-side memory controls.
In long-running FL workloads, client processes can accumulate memory
over rounds due to:
- delayed Python garbage collection release
- allocator behavior (glibc arena retention vs jemalloc decay)
- PyTorch CUDA cache retention
This is amplified for:
- large models and long jobs
- constrained edge environments
- Swarm-like topologies where clients can have higher memory pressure
1. **Allocator Detection**: `get_allocator_type()` detects active
allocator at runtime
- glibc: uses `malloc_trim()` path
- jemalloc: uses decay behavior (`MALLOC_CONF`)
2. **Transparent Integration**: cleanup is triggered after
`flare.send()`; user training scripts require no code changes
3. **Configurable Frequency**:
- Recipe-side: `client_memory_gc_rounds` (PT/TF/base) or
`memory_gc_rounds` (Swarm)
- ScriptRunner-side: `memory_gc_rounds`
- `0` = disabled, `1` = every round, `N` = every N rounds
4. **GPU Support**: `cuda_empty_cache=True` triggers CUDA cache cleanup
(PyTorch only)
5. **Subprocess Support**: external process settings propagate via:
- `NVFLARE_CLIENT_MEMORY_GC_ROUNDS`
- `NVFLARE_CUDA_EMPTY_CACHE`
- plus allocator env (`MALLOC_ARENA_MAX`, `MALLOC_CONF`)
6. **jemalloc Startup Integration (Opt-in)**: startup template preloads
jemalloc only when `NVFLARE_ENABLE_JEMALLOC_PRELOAD=true`, and sets
recommended `MALLOC_CONF`
```python
from nvflare.app_opt.pt.recipes import FedAvgRecipe
recipe = FedAvgRecipe(
min_clients=2,
num_rounds=100,
train_script="client.py",
client_memory_gc_rounds=5,
cuda_empty_cache=True,
)
```
```python
from nvflare.app_opt.pt.recipes.swarm import SimpleSwarmLearningRecipe
recipe = SimpleSwarmLearningRecipe(
name="my_swarm",
model=MyModel(),
num_rounds=100,
train_script="client.py",
memory_gc_rounds=5,
cuda_empty_cache=True,
)
```
```python
executor = ScriptRunner(
script="train.py",
memory_gc_rounds=5,
cuda_empty_cache=True,
)
```
> Note: these are workload-dependent estimates; profile your deployment
for exact numbers.
| Operation | Typical Duration | Notes |
|-----------|------------------|-------|
| `gc.collect()` | 10-500 ms | depends on object graph |
| `malloc_trim()` (glibc) | < 1 ms | typically very fast |
| `torch.cuda.empty_cache()` | < 50 ms | can synchronize CUDA stream |
For round durations in seconds/minutes, cleanup overhead is typically
small.
| Scenario | `server_memory_gc_rounds` | `client_memory_gc_rounds` |
`cuda_empty_cache` | Notes |
|----------|---------------------------|---------------------------|--------------------|-------|
| Quick experiments | 0 | 0 | False | minimal overhead |
| Standard training | 5 | 5-10 | False | balanced |
| Long training (100+ rounds) | 5 | 5 | True | prevent gradual growth |
| Large models (10B+ params) | 1-3 | 1-3 | True | aggressive cleanup |
| Memory-constrained edge | 5 | 1 | True | maximize stability |
- **PyTorch**: FedAvg, FedAvg-HE, FedEval, FedOpt, Scaffold, Cyclic,
Swarm
- **TensorFlow**: FedAvg, FedOpt, Scaffold, Cyclic
- **NumPy / classic workflows**: FedAvg, Cross-Site Eval, LR FedAvg
- **Base recipes**: `nvflare/recipe/fedavg.py`,
`nvflare/recipe/cyclic.py`
TF recipes do **not** expose `cuda_empty_cache`.
`torch.cuda.empty_cache()` is PyTorch-only; TF GPU memory is managed
differently. TF subclasses hard-code `cuda_empty_cache=False` to the
parent/ScriptRunner. The parameter will be added back when TF-specific
GPU cache cleanup is implemented.
In swarm learning both the trainer role and the aggregator role run on
the client. The old code unconditionally called `gc.collect()` after
each trainer submission in `Gatherer.gather()`. This PR replaces that
with:
- configurable per-FL-round cadence via `memory_gc_rounds` (default `1`
= every round, preserving legacy behavior)
- full `cleanup_memory()` call (`gc.collect` + `malloc_trim` + optional
CUDA) instead of bare `gc.collect`, giving allocator-aware OS-level
memory return
- `cuda_empty_cache` wired through since the aggregator client may also
hold GPU tensors from training
The parameter is named `memory_gc_rounds` (not
`client_memory_gc_rounds`) in `SimpleSwarmLearningRecipe` because both
roles are client-side.
Documentation was updated to clarify Swarm uses `memory_gc_rounds` and
`cuda_empty_cache` as top-level recipe args (not `train_args`).
| Recipe | Reason |
|--------|--------|
| Sklearn | Small models; Python GC has negligible impact |
| XGBoost | C++-managed memory; Python GC not impactful |
| FedEval | Evaluation-only, single-pass |
| server `cuda_empty_cache` | Servers typically have no GPU |
| TF `cuda_empty_cache` | PyTorch-only; TF cleanup not yet implemented |
- config key: `ConfigKey.CUDA_EMPTY_CACHE`
- env var: `NVFLARE_CUDA_EMPTY_CACHE`
- Client memory cleanup and allocator-aware behavior
- Client API and executor plumbing
- Recipe API exposure for client memory knobs
- Swarm GC made configurable with per-round cadence
- Documentation and unit tests related to client memory behavior
- Unrelated recipe semantic refactors
- Unrelated server workflow behavior changes
- fastdigest API upgrade (tracked separately)
- `pytest tests/unit_test/client/ex_process/memory_test.py` ✅ (7 passed)
- `pytest tests/unit_test/fuel/utils/memory_utils_test.py` ✅ (10 passed)
- `pytest tests/unit_test/recipe/server_memory_gc_rounds_test.py` ✅ (9
passed)
- `pytest tests/unit_test/app_common/ccwf/test_swarm_memory_gc.py` ✅ (7
passed)
- `pytest tests/unit_test/recipe/swarm_recipe_test.py` ✅ (15 passed)
- `pytest tests/unit_test/app_opt/tf/tf_recipe_no_cuda_cache_test.py` ✅
(skipped without TF, runs in TF CI)
- [x] Client memory env/API unit tests
- [x] Memory utils unit tests
- [x] Server memory GC rounds recipe tests
- [x] Swarm aggregator GC cadence unit tests
- [x] Swarm recipe memory param tests (default=1, old name rejected,
cuda passthrough)
- [x] TF recipe no-cuda-empty-cache tests (run in TF CI)
- [ ] Run full targeted memory-management unit suite in stable
CI/runtime
- [ ] Run full PR CI
- Baseline reference PR style/content: NVIDIA#4200
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
863bf4a to
8534e67
Compare
|
this is a cherry-pick additional comments will be addressed in a different PR |
This PR aligns the 2.7 branch with allocator-aware client-side memory management capabilities and recipe-level configurability, complementing server-side memory controls.
In long-running FL workloads, client processes can accumulate memory over rounds due to:
This is amplified for:
Allocator Detection:
get_allocator_type()detects active allocator at runtimemalloc_trim()pathMALLOC_CONF)Transparent Integration: cleanup is triggered after
flare.send(); user training scripts require no code changesConfigurable Frequency:
client_memory_gc_rounds(PT/TF/base) ormemory_gc_rounds(Swarm)memory_gc_rounds0= disabled,1= every round,N= every N roundsGPU Support:
cuda_empty_cache=Truetriggers CUDA cache cleanup (PyTorch only)Subprocess Support: external process settings propagate via:
NVFLARE_CLIENT_MEMORY_GC_ROUNDSNVFLARE_CUDA_EMPTY_CACHEMALLOC_ARENA_MAX,MALLOC_CONF)jemalloc Startup Integration (Opt-in): startup template preloads jemalloc only when
NVFLARE_ENABLE_JEMALLOC_PRELOAD=true, and sets recommendedMALLOC_CONFgc.collect()For round durations in seconds/minutes, cleanup overhead is typically small.
| Scenario |
server_memory_gc_rounds|client_memory_gc_rounds|cuda_empty_cache| Notes ||----------|---------------------------|---------------------------|--------------------|-------| | Quick experiments | 0 | 0 | False | minimal overhead | | Standard training | 5 | 5-10 | False | balanced | | Long training (100+ rounds) | 5 | 5 | True | prevent gradual growth | | Large models (10B+ params) | 1-3 | 1-3 | True | aggressive cleanup | | Memory-constrained edge | 5 | 1 | True | maximize stability |
nvflare/recipe/fedavg.py,nvflare/recipe/cyclic.pyTF recipes do not expose
cuda_empty_cache.torch.cuda.empty_cache()is PyTorch-only; TF GPU memory is managed differently. TF subclasses hard-codecuda_empty_cache=Falseto the parent/ScriptRunner. The parameter will be added back when TF-specific GPU cache cleanup is implemented.In swarm learning both the trainer role and the aggregator role run on the client. The old code unconditionally called
gc.collect()after each trainer submission inGatherer.gather(). This PR replaces that with:memory_gc_rounds(default1= every round, preserving legacy behavior)cleanup_memory()call (gc.collect+malloc_trim+ optional CUDA) instead of baregc.collect, giving allocator-aware OS-level memory returncuda_empty_cachewired through since the aggregator client may also hold GPU tensors from trainingThe parameter is named
memory_gc_rounds(notclient_memory_gc_rounds) inSimpleSwarmLearningRecipebecause both roles are client-side.Documentation was updated to clarify Swarm uses
memory_gc_roundsandcuda_empty_cacheas top-level recipe args (nottrain_args).cuda_empty_cacheconfig key:
ConfigKey.CUDA_EMPTY_CACHEenv var:
NVFLARE_CUDA_EMPTY_CACHEClient memory cleanup and allocator-aware behavior
Client API and executor plumbing
Recipe API exposure for client memory knobs
Swarm GC made configurable with per-round cadence
Documentation and unit tests related to client memory behavior
Unrelated recipe semantic refactors
Unrelated server workflow behavior changes
fastdigest API upgrade (tracked separately)
pytest tests/unit_test/client/ex_process/memory_test.py✅ (7 passed)pytest tests/unit_test/fuel/utils/memory_utils_test.py✅ (10 passed)pytest tests/unit_test/recipe/server_memory_gc_rounds_test.py✅ (9 passed)pytest tests/unit_test/app_common/ccwf/test_swarm_memory_gc.py✅ (7 passed)pytest tests/unit_test/recipe/swarm_recipe_test.py✅ (15 passed)pytest tests/unit_test/app_opt/tf/tf_recipe_no_cuda_cache_test.py✅ (skipped without TF, runs in TF CI)Client memory env/API unit tests
Memory utils unit tests
Server memory GC rounds recipe tests
Swarm aggregator GC cadence unit tests
Swarm recipe memory param tests (default=1, old name rejected, cuda passthrough)
TF recipe no-cuda-empty-cache tests (run in TF CI)
Run full targeted memory-management unit suite in stable CI/runtime
Run full PR CI
Baseline reference PR style/content: Add allocator-aware client memory management (glibc/jemalloc/CUDA) #4200
Fixes # .
Description
A few sentences describing the changes proposed in this pull request.
Types of changes
./runtest.sh.