Skip to content

Cherry-pick [2.7] Client-side memory management (#4211)#4290

Open
YuanTingHsieh wants to merge 2 commits intoNVIDIA:mainfrom
YuanTingHsieh:cherry-pick-4211
Open

Cherry-pick [2.7] Client-side memory management (#4211)#4290
YuanTingHsieh wants to merge 2 commits intoNVIDIA:mainfrom
YuanTingHsieh:cherry-pick-4211

Conversation

@YuanTingHsieh
Copy link
Collaborator

This PR aligns the 2.7 branch with allocator-aware client-side memory management capabilities and recipe-level configurability, complementing server-side memory controls.

In long-running FL workloads, client processes can accumulate memory over rounds due to:

  • delayed Python garbage collection release
  • allocator behavior (glibc arena retention vs jemalloc decay)
  • PyTorch CUDA cache retention

This is amplified for:

  • large models and long jobs
  • constrained edge environments
  • Swarm-like topologies where clients can have higher memory pressure
  1. Allocator Detection: get_allocator_type() detects active allocator at runtime

    • glibc: uses malloc_trim() path
    • jemalloc: uses decay behavior (MALLOC_CONF)
  2. Transparent Integration: cleanup is triggered after flare.send(); user training scripts require no code changes

  3. Configurable Frequency:

  • Recipe-side: client_memory_gc_rounds (PT/TF/base) or memory_gc_rounds (Swarm)
    • ScriptRunner-side: memory_gc_rounds
    • 0 = disabled, 1 = every round, N = every N rounds
  1. GPU Support: cuda_empty_cache=True triggers CUDA cache cleanup (PyTorch only)

  2. Subprocess Support: external process settings propagate via:

    • NVFLARE_CLIENT_MEMORY_GC_ROUNDS
    • NVFLARE_CUDA_EMPTY_CACHE
    • plus allocator env (MALLOC_ARENA_MAX, MALLOC_CONF)
  3. jemalloc Startup Integration (Opt-in): startup template preloads jemalloc only when NVFLARE_ENABLE_JEMALLOC_PRELOAD=true, and sets recommended MALLOC_CONF

from nvflare.app_opt.pt.recipes import FedAvgRecipe

recipe = FedAvgRecipe(
    min_clients=2,
    num_rounds=100,
    train_script="client.py",
    client_memory_gc_rounds=5,
    cuda_empty_cache=True,
)
from nvflare.app_opt.pt.recipes.swarm import SimpleSwarmLearningRecipe

recipe = SimpleSwarmLearningRecipe(
    name="my_swarm",
    model=MyModel(),
    num_rounds=100,
    train_script="client.py",
    memory_gc_rounds=5,
    cuda_empty_cache=True,
)
executor = ScriptRunner(
    script="train.py",
    memory_gc_rounds=5,
    cuda_empty_cache=True,
)

Note: these are workload-dependent estimates; profile your deployment
for exact numbers.

Operation Typical Duration Notes
gc.collect() 10-500 ms depends on object graph

For round durations in seconds/minutes, cleanup overhead is typically small.

| Scenario | server_memory_gc_rounds | client_memory_gc_rounds | cuda_empty_cache | Notes |

|----------|---------------------------|---------------------------|--------------------|-------| | Quick experiments | 0 | 0 | False | minimal overhead | | Standard training | 5 | 5-10 | False | balanced | | Long training (100+ rounds) | 5 | 5 | True | prevent gradual growth | | Large models (10B+ params) | 1-3 | 1-3 | True | aggressive cleanup | | Memory-constrained edge | 5 | 1 | True | maximize stability |

  • PyTorch: FedAvg, FedAvg-HE, FedEval, FedOpt, Scaffold, Cyclic, Swarm
  • TensorFlow: FedAvg, FedOpt, Scaffold, Cyclic
  • NumPy / classic workflows: FedAvg, Cross-Site Eval, LR FedAvg
  • Base recipes: nvflare/recipe/fedavg.py, nvflare/recipe/cyclic.py

TF recipes do not expose cuda_empty_cache.
torch.cuda.empty_cache() is PyTorch-only; TF GPU memory is managed differently. TF subclasses hard-code cuda_empty_cache=False to the parent/ScriptRunner. The parameter will be added back when TF-specific GPU cache cleanup is implemented.

In swarm learning both the trainer role and the aggregator role run on the client. The old code unconditionally called gc.collect() after each trainer submission in Gatherer.gather(). This PR replaces that with:

  • configurable per-FL-round cadence via memory_gc_rounds (default 1 = every round, preserving legacy behavior)
  • full cleanup_memory() call (gc.collect + malloc_trim + optional CUDA) instead of bare gc.collect, giving allocator-aware OS-level memory return
  • cuda_empty_cache wired through since the aggregator client may also hold GPU tensors from training

The parameter is named memory_gc_rounds (not
client_memory_gc_rounds) in SimpleSwarmLearningRecipe because both roles are client-side.

Documentation was updated to clarify Swarm uses memory_gc_rounds and cuda_empty_cache as top-level recipe args (not train_args).

Recipe Reason
Sklearn Small models; Python GC has negligible impact
server cuda_empty_cache Servers typically have no GPU
  • config key: ConfigKey.CUDA_EMPTY_CACHE

  • env var: NVFLARE_CUDA_EMPTY_CACHE

  • Client memory cleanup and allocator-aware behavior

  • Client API and executor plumbing

  • Recipe API exposure for client memory knobs

  • Swarm GC made configurable with per-round cadence

  • Documentation and unit tests related to client memory behavior

  • Unrelated recipe semantic refactors

  • Unrelated server workflow behavior changes

  • fastdigest API upgrade (tracked separately)

  • pytest tests/unit_test/client/ex_process/memory_test.py ✅ (7 passed)

  • pytest tests/unit_test/fuel/utils/memory_utils_test.py ✅ (10 passed)

  • pytest tests/unit_test/recipe/server_memory_gc_rounds_test.py ✅ (9 passed)

  • pytest tests/unit_test/app_common/ccwf/test_swarm_memory_gc.py ✅ (7 passed)

  • pytest tests/unit_test/recipe/swarm_recipe_test.py ✅ (15 passed)

  • pytest tests/unit_test/app_opt/tf/tf_recipe_no_cuda_cache_test.py ✅ (skipped without TF, runs in TF CI)

  • Client memory env/API unit tests

  • Memory utils unit tests

  • Server memory GC rounds recipe tests

  • Swarm aggregator GC cadence unit tests

  • Swarm recipe memory param tests (default=1, old name rejected, cuda passthrough)

  • TF recipe no-cuda-empty-cache tests (run in TF CI)

  • Run full targeted memory-management unit suite in stable CI/runtime

  • Run full PR CI

  • Baseline reference PR style/content: Add allocator-aware client memory management (glibc/jemalloc/CUDA) #4200


Fixes # .

Description

A few sentences describing the changes proposed in this pull request.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

Copilot AI review requested due to automatic review settings March 11, 2026 01:19
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 11, 2026

Greptile Summary

This PR introduces client-side memory management for NVFlare federated learning workloads, complementing the existing server-side controls. It adds allocator-aware cleanup (glibc via malloc_trim, jemalloc via auto-decay), optional CUDA cache clearing, and configurable per-N-round cadence — wired through recipes, ScriptRunner, and both in-process and external-process client APIs.

Key changes:

  • memory_utils.py: New cleanup_memory() / get_allocator_type() utilities with @lru_cache allocator detection and safe malloc_trim wrapping.
  • APISpec: Shared _maybe_cleanup_memory() base-class method drives round counting and cleanup for all client API implementations.
  • ExProcessClientAPI: Reads memory_gc_rounds / cuda_empty_cache from the job config file (written by ClientAPILauncherExecutor) with NVFLARE_CLIENT_MEMORY_GC_ROUNDS / NVFLARE_CUDA_EMPTY_CACHE env-var overrides.
  • InProcessClientAPI: Explicit configure_memory_management() called by InProcessClientAPIExecutor at START_RUN.
  • Swarm: _end_gather replaces unconditional gc.collect() with configurable cleanup_memory() via a dedicated _aggr_round_count.
  • sub_start_sh template: Opt-in jemalloc preload block added; however, MALLOC_ARENA_MAX defaults to 4 (server recommendation) instead of 2 (client recommendation), contradicting the inline comment.
  • send() cleanup ordering: Both InProcessClientAPI.send() and ExProcessClientAPI.send() call _maybe_cleanup_memory() unconditionally — even when clear_cache=False — which advances the round counter and may trigger GC while the caller is intentionally preserving model state.

Confidence Score: 3/5

  • PR is mostly safe to merge but contains two correctness issues that may silently degrade the feature's effectiveness in the most common deployments.
  • The MALLOC_ARENA_MAX=4 default in the client startup template contradicts the inline comment and the PR's stated goal of reducing memory on edge/constrained clients — most deployments will silently get the wrong value without any error. The unconditional _maybe_cleanup_memory() call regardless of clear_cache is a subtler semantic issue. The rest of the implementation is well-structured and the new memory_utils module is correct.
  • nvflare/lighter/templates/master_template.yml (wrong MALLOC_ARENA_MAX default for client) and nvflare/client/in_process/api.py / nvflare/client/ex_process/api.py (cleanup fires when clear_cache=False).

Important Files Changed

Filename Overview
nvflare/fuel/utils/memory_utils.py New module providing get_allocator_type(), try_malloc_trim(), and cleanup_memory(). Logic is correct with safe fallbacks; @lru_cache on allocator detection is appropriate.
nvflare/client/api_spec.py Adds shared _maybe_cleanup_memory() to APISpec base class; clean implementation with in-method lazy import and correct modular counter logic.
nvflare/client/ex_process/api.py Memory settings correctly read from config with env-var override; int() conversion of NVFLARE_CLIENT_MEMORY_GC_ROUNDS is unguarded against non-numeric input (flagged in prior thread).
nvflare/client/in_process/api.py Memory cleanup fires unconditionally after send() regardless of clear_cache flag, advancing the round counter even when callers explicitly preserve model state.
nvflare/lighter/templates/master_template.yml sub_start_sh (client startup) hardcodes MALLOC_ARENA_MAX default to 4 (the server recommendation) while the inline comment and docs say clients should use 2; ARM64 jemalloc path also missing (flagged in prior thread).
nvflare/app_common/ccwf/swarm_client_ctl.py Aggregator-side GC is now configurable via memory_gc_rounds/cuda_empty_cache; _aggr_round_count counter correctly tracked; inline import in hot path flagged in prior thread.
nvflare/app_common/executors/in_process_client_api_executor.py Correctly gates configure_memory_management() behind _memory_gc_rounds > 0 and wires parameters through from ScriptRunner.
nvflare/job_config/script_runner.py memory_gc_rounds and cuda_empty_cache correctly threaded through ScriptRunner; docstring missing parameter documentation (noted in prior thread).
nvflare/app_opt/pt/recipes/swarm.py memory_gc_rounds=1 default preserves legacy gc.collect behavior; correctly plumbed to both ScriptRunner (trainer) and SwarmClientConfig (aggregator).
nvflare/app_opt/tf/recipes/fedavg.py Correctly hard-codes cuda_empty_cache=False for TF since torch.cuda.empty_cache is PyTorch-only; exposes client_memory_gc_rounds as documented.

Sequence Diagram

sequenceDiagram
    participant R as Recipe / ScriptRunner
    participant E as InProcessClientAPIExecutor
    participant A as InProcessClientAPI (APISpec)
    participant M as memory_utils

    R->>E: __init__(memory_gc_rounds, cuda_empty_cache)
    E->>A: InProcessClientAPI(task_metadata)
    E->>A: configure_memory_management(gc_rounds, cuda_empty_cache)

    loop Each FL Round
        E->>A: set_meta(task_meta)
        A-->>R: receive() → FLModel
        Note over R: Training step
        R->>A: send(model)
        A->>A: _maybe_cleanup_memory()
        Note over A: _round_count += 1<br/>if _round_count % gc_rounds == 0
        A->>M: cleanup_memory(cuda_empty_cache)
        M->>M: gc.collect()
        alt glibc
            M->>M: malloc_trim(0)
        else jemalloc
            M->>M: rely on auto-decay
        end
        opt cuda_empty_cache=True
            M->>M: torch.cuda.empty_cache()
        end
    end
Loading

Comments Outside Diff (1)

  1. nvflare/client/in_process/api.py, line 136-158 (link)

    Memory cleanup fires even when clear_cache=False, causing counter divergence from actual round boundaries

    _maybe_cleanup_memory() (which increments _round_count) is called unconditionally at the end of send(), regardless of the clear_cache flag. The clear_cache=False path is specifically intended for callers that want to preserve the in-flight model state (e.g. for multi-task scripts), so incrementing the internal round counter and potentially triggering a full gc.collect() / malloc_trim / cuda.empty_cache() at that point can be surprising and can interfere with the model that was intentionally kept alive.

    The same pattern exists in ExProcessClientAPI.send(). Consider gating the cleanup on whether clear_cache is True, or at least documenting that the round counter always advances:

    if clear_cache:
        self.fl_model = None
        self.receive_called = False
        # Perform memory cleanup if configured
        self._maybe_cleanup_memory()

Last reviewed commit: afe3f3d

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR cherry-picks allocator-aware client-side memory management into the 2.7 branch, wiring configurable periodic cleanup through Client APIs, executors, and recipes (including Swarm), with supporting docs and unit tests.

Changes:

  • Add allocator detection + allocator-aware cleanup_memory() (glibc malloc_trim vs jemalloc decay) with optional CUDA cache clearing.
  • Trigger client cleanup after flare.send() and propagate memory_gc_rounds / client_memory_gc_rounds + cuda_empty_cache through ScriptRunner and recipe APIs (including Swarm cadence control).
  • Update startup template + documentation, and add/adjust unit tests for memory behavior and recipe/executor parameter plumbing.

Reviewed changes

Copilot reviewed 42 out of 42 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/unit_test/recipe/swarm_recipe_test.py Adds Swarm recipe tests for memory_gc_rounds defaults/validation and CUDA passthrough.
tests/unit_test/job_config/script_runner_test.py Adds ScriptRunner tests for new memory-management parameters.
tests/unit_test/fuel/utils/memory_utils_test.py Updates/expands tests for allocator-aware cleanup and new API names.
tests/unit_test/client/in_process/api_test.py Adds tests for in-process client API cleanup cadence.
tests/unit_test/client/ex_process/memory_test.py Adds tests around env-var parsing expectations for ex-process mode (logic-level).
tests/unit_test/client/ex_process/init.py Marks ex_process test directory as a package.
tests/unit_test/app_opt/tf/tf_recipe_no_cuda_cache_test.py Ensures TF recipes reject cuda_empty_cache.
tests/unit_test/app_common/statistics/quantile_test.py Disables TDigest-based tests on darwin.
tests/unit_test/app_common/executors/in_process_client_api_executor_test.py Adds tests for executor memory parameters.
tests/unit_test/app_common/ccwf/test_swarm_memory_gc.py Adds tests for Swarm aggregator GC cadence and CUDA passthrough.
nvflare/recipe/fedavg.py Exposes client_memory_gc_rounds/cuda_empty_cache and wires into ScriptRunner.
nvflare/recipe/cyclic.py Exposes client_memory_gc_rounds/cuda_empty_cache and wires into ScriptRunner.
nvflare/lighter/templates/master_template.yml Adds opt-in jemalloc preload + MALLOC_CONF defaults to startup script template.
nvflare/job_config/script_runner.py Adds memory parameters and forwards them to in-process / launcher executors.
nvflare/fuel/utils/memory_utils.py Implements get_allocator_type() and allocator-aware cleanup_memory(cuda_empty_cache=...).
nvflare/client/in_process/api.py Calls base init and triggers cleanup after send(); adds configuration helper.
nvflare/client/ex_process/api.py Reads memory settings from config with env override; triggers cleanup after send().
nvflare/client/config.py Adds config keys for memory_gc_rounds and cuda_empty_cache.
nvflare/client/api_spec.py Adds shared memory-management state + _maybe_cleanup_memory() to base client API.
nvflare/app_opt/tf/recipes/scaffold.py Adds client_memory_gc_rounds plumbing (hard-codes cuda_empty_cache=False).
nvflare/app_opt/tf/recipes/fedopt.py Adds client_memory_gc_rounds plumbing (hard-codes cuda_empty_cache=False).
nvflare/app_opt/tf/recipes/fedavg.py Adds client_memory_gc_rounds and forces cuda_empty_cache=False.
nvflare/app_opt/tf/recipes/cyclic.py Adds client_memory_gc_rounds and forces cuda_empty_cache=False.
nvflare/app_opt/tf/in_process_client_api_executor.py Adds memory params to TF in-process executor wrapper plumbing.
nvflare/app_opt/tf/client_api_launcher_executor.py Adds memory params to TF launcher executor wrapper plumbing.
nvflare/app_opt/pt/recipes/swarm.py Adds Swarm memory_gc_rounds/cuda_empty_cache top-level args + reserved-key checks.
nvflare/app_opt/pt/recipes/scaffold.py Adds recipe-level client memory knobs and forwards to ScriptRunner.
nvflare/app_opt/pt/recipes/fedopt.py Adds recipe-level client memory knobs and forwards to ScriptRunner.
nvflare/app_opt/pt/recipes/fedeval.py Adds recipe-level client memory knobs and forwards to ScriptRunner.
nvflare/app_opt/pt/recipes/fedavg_he.py Adds recipe-level client memory knobs and forwards to ScriptRunner.
nvflare/app_opt/pt/recipes/fedavg.py Forwards new client memory knobs to the unified base recipe.
nvflare/app_opt/pt/recipes/cyclic.py Forwards new client memory knobs to the unified base recipe.
nvflare/app_opt/pt/in_process_client_api_executor.py Adds memory params to PT in-process executor wrapper plumbing.
nvflare/app_opt/pt/client_api_launcher_executor.py Adds memory params to PT launcher executor wrapper plumbing.
nvflare/app_common/np/recipes/lr/fedavg.py Adds client memory knobs and forwards to ScriptRunner for RAW framework recipe.
nvflare/app_common/np/recipes/fedavg.py Adds client memory knobs to NumPy recipe surface.
nvflare/app_common/np/recipes/cross_site_eval.py Adds client memory knobs and forwards to ScriptRunner for validation tasks.
nvflare/app_common/executors/in_process_client_api_executor.py Stores memory params and configures InProcessClientAPI accordingly.
nvflare/app_common/executors/client_api_launcher_executor.py Writes memory params into client API config for external processes.
nvflare/app_common/ccwf/swarm_client_ctl.py Makes Swarm aggregator-side cleanup configurable by round cadence + CUDA passthrough.
nvflare/app_common/ccwf/ccwf_job.py Plumbs Swarm client config memory knobs into controller creation.
docs/programming_guide/memory_management.rst Expands docs for allocator support and client-side cleanup configuration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

This PR aligns the 2.7 branch with allocator-aware **client-side memory
management** capabilities and recipe-level configurability,
complementing server-side memory controls.

In long-running FL workloads, client processes can accumulate memory
over rounds due to:
- delayed Python garbage collection release
- allocator behavior (glibc arena retention vs jemalloc decay)
- PyTorch CUDA cache retention

This is amplified for:
- large models and long jobs
- constrained edge environments
- Swarm-like topologies where clients can have higher memory pressure

1. **Allocator Detection**: `get_allocator_type()` detects active
allocator at runtime
   - glibc: uses `malloc_trim()` path
   - jemalloc: uses decay behavior (`MALLOC_CONF`)

2. **Transparent Integration**: cleanup is triggered after
`flare.send()`; user training scripts require no code changes

3. **Configurable Frequency**:
- Recipe-side: `client_memory_gc_rounds` (PT/TF/base) or
`memory_gc_rounds` (Swarm)
   - ScriptRunner-side: `memory_gc_rounds`
   - `0` = disabled, `1` = every round, `N` = every N rounds

4. **GPU Support**: `cuda_empty_cache=True` triggers CUDA cache cleanup
(PyTorch only)

5. **Subprocess Support**: external process settings propagate via:
   - `NVFLARE_CLIENT_MEMORY_GC_ROUNDS`
   - `NVFLARE_CUDA_EMPTY_CACHE`
   - plus allocator env (`MALLOC_ARENA_MAX`, `MALLOC_CONF`)

6. **jemalloc Startup Integration (Opt-in)**: startup template preloads
jemalloc only when `NVFLARE_ENABLE_JEMALLOC_PRELOAD=true`, and sets
recommended `MALLOC_CONF`

```python
from nvflare.app_opt.pt.recipes import FedAvgRecipe

recipe = FedAvgRecipe(
    min_clients=2,
    num_rounds=100,
    train_script="client.py",
    client_memory_gc_rounds=5,
    cuda_empty_cache=True,
)
```

```python
from nvflare.app_opt.pt.recipes.swarm import SimpleSwarmLearningRecipe

recipe = SimpleSwarmLearningRecipe(
    name="my_swarm",
    model=MyModel(),
    num_rounds=100,
    train_script="client.py",
    memory_gc_rounds=5,
    cuda_empty_cache=True,
)
```

```python
executor = ScriptRunner(
    script="train.py",
    memory_gc_rounds=5,
    cuda_empty_cache=True,
)
```

> Note: these are workload-dependent estimates; profile your deployment
for exact numbers.

| Operation | Typical Duration | Notes |
|-----------|------------------|-------|
| `gc.collect()` | 10-500 ms | depends on object graph |
| `malloc_trim()` (glibc) | < 1 ms | typically very fast |
| `torch.cuda.empty_cache()` | < 50 ms | can synchronize CUDA stream |

For round durations in seconds/minutes, cleanup overhead is typically
small.

| Scenario | `server_memory_gc_rounds` | `client_memory_gc_rounds` |
`cuda_empty_cache` | Notes |

|----------|---------------------------|---------------------------|--------------------|-------|
| Quick experiments | 0 | 0 | False | minimal overhead |
| Standard training | 5 | 5-10 | False | balanced |
| Long training (100+ rounds) | 5 | 5 | True | prevent gradual growth |
| Large models (10B+ params) | 1-3 | 1-3 | True | aggressive cleanup |
| Memory-constrained edge | 5 | 1 | True | maximize stability |

- **PyTorch**: FedAvg, FedAvg-HE, FedEval, FedOpt, Scaffold, Cyclic,
Swarm
- **TensorFlow**: FedAvg, FedOpt, Scaffold, Cyclic
- **NumPy / classic workflows**: FedAvg, Cross-Site Eval, LR FedAvg
- **Base recipes**: `nvflare/recipe/fedavg.py`,
`nvflare/recipe/cyclic.py`

TF recipes do **not** expose `cuda_empty_cache`.
`torch.cuda.empty_cache()` is PyTorch-only; TF GPU memory is managed
differently. TF subclasses hard-code `cuda_empty_cache=False` to the
parent/ScriptRunner. The parameter will be added back when TF-specific
GPU cache cleanup is implemented.

In swarm learning both the trainer role and the aggregator role run on
the client. The old code unconditionally called `gc.collect()` after
each trainer submission in `Gatherer.gather()`. This PR replaces that
with:
- configurable per-FL-round cadence via `memory_gc_rounds` (default `1`
= every round, preserving legacy behavior)
- full `cleanup_memory()` call (`gc.collect` + `malloc_trim` + optional
CUDA) instead of bare `gc.collect`, giving allocator-aware OS-level
memory return
- `cuda_empty_cache` wired through since the aggregator client may also
hold GPU tensors from training

The parameter is named `memory_gc_rounds` (not
`client_memory_gc_rounds`) in `SimpleSwarmLearningRecipe` because both
roles are client-side.

Documentation was updated to clarify Swarm uses `memory_gc_rounds` and
`cuda_empty_cache` as top-level recipe args (not `train_args`).

| Recipe | Reason |
|--------|--------|
| Sklearn | Small models; Python GC has negligible impact |
| XGBoost | C++-managed memory; Python GC not impactful |
| FedEval | Evaluation-only, single-pass |
| server `cuda_empty_cache` | Servers typically have no GPU |
| TF `cuda_empty_cache` | PyTorch-only; TF cleanup not yet implemented |

- config key: `ConfigKey.CUDA_EMPTY_CACHE`
- env var: `NVFLARE_CUDA_EMPTY_CACHE`

- Client memory cleanup and allocator-aware behavior
- Client API and executor plumbing
- Recipe API exposure for client memory knobs
- Swarm GC made configurable with per-round cadence
- Documentation and unit tests related to client memory behavior

- Unrelated recipe semantic refactors
- Unrelated server workflow behavior changes
- fastdigest API upgrade (tracked separately)

- `pytest tests/unit_test/client/ex_process/memory_test.py` ✅ (7 passed)
- `pytest tests/unit_test/fuel/utils/memory_utils_test.py` ✅ (10 passed)
- `pytest tests/unit_test/recipe/server_memory_gc_rounds_test.py` ✅ (9
passed)
- `pytest tests/unit_test/app_common/ccwf/test_swarm_memory_gc.py` ✅ (7
passed)
- `pytest tests/unit_test/recipe/swarm_recipe_test.py` ✅ (15 passed)
- `pytest tests/unit_test/app_opt/tf/tf_recipe_no_cuda_cache_test.py` ✅
(skipped without TF, runs in TF CI)

- [x] Client memory env/API unit tests
- [x] Memory utils unit tests
- [x] Server memory GC rounds recipe tests
- [x] Swarm aggregator GC cadence unit tests
- [x] Swarm recipe memory param tests (default=1, old name rejected,
cuda passthrough)
- [x] TF recipe no-cuda-empty-cache tests (run in TF CI)
- [ ] Run full targeted memory-management unit suite in stable
CI/runtime
- [ ] Run full PR CI

- Baseline reference PR style/content: NVIDIA#4200

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
@YuanTingHsieh
Copy link
Collaborator Author

this is a cherry-pick additional comments will be addressed in a different PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants