Add functionality for a balanced device selection across clients #3661

ZiyueXu77 · 2025-09-05T18:12:01Z

Improvement over device-sampling, not tie to release.

Description

Current behavior is random selection over available device list, treating devices from any clients the same, With this PR, we can select between totally random and balanced selection (default to balanced which will lead to less randomness for simulation).
Balanced selection try to select and dispatch global model to devices such that each leaf client will have relatively similar number of active devices at any time point.

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).
Breaking change (fix or new feature that would cause existing functionality to change).
New tests added to cover the changes.
Quick tests passed locally by running ./runtest.sh.
In-line docstrings updated.
Documentation updated.

…andom

Copilot

Pull Request Overview

This PR introduces balanced device selection functionality for edge federated learning, allowing the system to select devices evenly across clients instead of purely random selection.

Adds balanced device sampling strategy as the default selection method
Implements timeout and monitoring capabilities for device availability
Refactors device management to track client-to-device mappings for balanced distribution

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
nvflare/edge/tools/edge_fed_buff_recipe.py	Adds device sampling strategy configuration, simulation environment processing, and class rename
nvflare/edge/assessors/model_update.py	Implements device wait timeout functionality and enhanced device status logging
nvflare/edge/assessors/device_manager.py	Updates abstract method signature to check both devices and clients
nvflare/edge/assessors/buff_device_manager.py	Implements balanced device sampling algorithm and client tracking
examples/advanced/edge/jobs/pt_job_adv.py	Updates example to use new initial_min_client_num parameter

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

nvflare/edge/assessors/device_manager.py

nvflare/edge/assessors/model_update.py

nvflare/edge/assessors/buff_device_manager.py

nvflare/edge/tools/edge_fed_buff_recipe.py

greptile-apps

Greptile Overview

Greptile Summary

Added balanced device selection strategy to improve fairness across clients in edge federated learning. Previously, device selection was purely random, treating all devices equally regardless of their client origin. This change introduces a "balanced" sampling strategy (now the default) that distributes device selection more evenly across leaf clients.

Key Changes:

Added device_sampling_strategy parameter to BuffDeviceManager ("balanced" or "random")
Added initial_min_client_num parameter to wait for minimum number of clients before starting
Implemented _balanced_device_sampling() method that uses round-robin distribution across clients
Renamed abstract method has_enough_devices() to has_enough_devices_and_clients() to reflect new client-level checks
Updated configuration classes and example job to use new parameters

Issues Found:

CRITICAL: Two method calls in model_update.py still use the old method name has_enough_devices() instead of the renamed has_enough_devices_and_clients(), which will cause AttributeError at runtime

Confidence Score: 1/5

This PR contains critical bugs that will cause runtime failures
Score reflects two critical method naming errors that were missed during refactoring. The abstract method has_enough_devices was renamed to has_enough_devices_and_clients in the base class and implementation, but two call sites in model_update.py (lines 192 and 238) were not updated. This will cause AttributeError exceptions at runtime when these code paths are executed. The balanced sampling logic itself appears sound, but the incomplete refactoring makes this PR unsafe to merge.
Pay immediate attention to nvflare/edge/assessors/model_update.py - contains two broken method calls that must be fixed before merging

Important Files Changed

File Analysis

Filename	Score	Overview
nvflare/edge/assessors/buff_device_manager.py	3/5	Added balanced device sampling strategy with new parameters `initial_min_client_num` and `device_sampling_strategy`. Implementation looks sound but needs testing with edge cases (empty client mappings, uneven device distribution).
nvflare/edge/assessors/device_manager.py	5/5	Renamed abstract method `has_enough_devices` to `has_enough_devices_and_clients` to reflect enhanced functionality. Clean interface change.
nvflare/edge/tools/edge_fed_buff_recipe.py	5/5	Added new configuration parameters (`initial_min_client_num`, `device_sampling_strategy`) to `DeviceManagerConfig` and properly wired them through to `BuffDeviceManager`. Clean configuration extension.
examples/advanced/edge/jobs/pt_job_adv.py	5/5	Updated example to use new `initial_min_client_num` parameter set to `num_leaf_nodes` for waiting for all clients before starting. Minor whitespace cleanup included.

Sequence Diagram

sequenceDiagram
    participant Client as Client/Leaf Node
    participant Assessor as ModelUpdateAssessor
    participant DevMgr as BuffDeviceManager
    participant ModelMgr as BuffModelManager
    
    Client->>Assessor: StateUpdateReport (available_devices)
    Assessor->>DevMgr: update_available_devices()
    DevMgr->>DevMgr: Build device_client_map
    
    Assessor->>DevMgr: has_enough_devices_and_clients()
    DevMgr->>DevMgr: Check usable devices >= holes
    DevMgr->>DevMgr: Check unique clients >= initial_min_client_num
    DevMgr-->>Assessor: boolean result
    
    alt Enough devices and clients
        Assessor->>DevMgr: should_fill_selection()
        DevMgr-->>Assessor: num_holes >= min_hole_to_fill
        
        alt Should fill selection
            Assessor->>ModelMgr: generate_new_model() (if initial)
            Assessor->>DevMgr: fill_selection()
            
            alt Balanced Strategy
                DevMgr->>DevMgr: _balanced_device_sampling()
                DevMgr->>DevMgr: Count devices per client
                DevMgr->>DevMgr: Distribute holes evenly across clients
                DevMgr->>DevMgr: Update current_selection & used_devices
            else Random Strategy
                DevMgr->>DevMgr: Random selection from usable devices
                DevMgr->>DevMgr: Update current_selection & used_devices
            end
            
            DevMgr-->>Assessor: Selection filled
        end
    else Waiting for devices
        Assessor->>Assessor: Start/continue wait timer
    end
    
    Assessor->>ModelMgr: get_current_model()
    ModelMgr-->>Assessor: Current model DXO
    
    Assessor->>DevMgr: get_selection()
    DevMgr-->>Assessor: Device selection dict
    
    Assessor-->>Client: StateUpdateReply (model, device_selection)

Additional Comments (2)

nvflare/edge/assessors/model_update.py, line 192 (link)

syntax: method renamed to has_enough_devices_and_clients in this PR but call not updated
nvflare/edge/assessors/model_update.py, line 238 (link)

syntax: method renamed to has_enough_devices_and_clients but call not updated

_{4 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

Greptile Overview

Greptile Summary

This PR renames the method has_enough_devices() to has_enough_devices_and_clients() in two call sites within model_update.py. This change improves code clarity by making the function name accurately reflect its behavior - it checks both device availability and client count requirements before proceeding with device selection.

Updated function calls on lines 192 and 238 to use the new method name
No logic changes, purely a refactoring for better code readability
Aligns with the broader balanced device selection feature that distributes devices evenly across clients

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The changes are purely cosmetic - renaming a method call to better reflect its functionality. No logic was modified, no new code was introduced, and the renamed method already exists in the codebase with the correct implementation. The changes are consistent across both call sites.
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
nvflare/edge/assessors/model_update.py	5/5	Renamed function calls from `has_enough_devices()` to `has_enough_devices_and_clients()` for improved clarity (lines 192, 238)

Sequence Diagram

sequenceDiagram
    participant Client
    participant Assessor as ModelUpdateAssessor
    participant DevMgr as BuffDeviceManager
    
    Client->>Assessor: process_child_update(StateUpdateReport)
    Assessor->>DevMgr: update_available_devices(devices)
    DevMgr-->>DevMgr: Update device_client_map
    
    alt Waiting for devices
        Assessor->>DevMgr: has_enough_devices_and_clients()
        DevMgr-->>DevMgr: Check usable devices >= holes
        DevMgr-->>DevMgr: Check unique_clients >= initial_min_client_num
        DevMgr-->>Assessor: True/False
        
        alt Enough devices and clients
            Assessor-->>Assessor: Reset device_wait_start_time
        else Not enough
            Assessor-->>Assessor: Continue waiting or timeout
        end
    end
    
    Assessor->>DevMgr: should_fill_selection()
    DevMgr-->>Assessor: True if holes >= min_hole_to_fill
    
    alt Should fill selection
        Assessor->>DevMgr: has_enough_devices_and_clients()
        DevMgr-->>Assessor: True
        
        Assessor->>DevMgr: fill_selection(model_version)
        
        alt device_sampling_strategy == "balanced"
            DevMgr-->>DevMgr: _balanced_device_sampling()
            DevMgr-->>DevMgr: Distribute devices evenly across clients
        else device_sampling_strategy == "random"
            DevMgr-->>DevMgr: Random selection from usable devices
        end
        
        DevMgr-->>Assessor: Selection filled
    end
    
    Assessor-->>Client: StateUpdateReply with device_selection

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

ZiyueXu77 · 2025-11-06T17:42:47Z

/build

add functionality to balance device selection across clients beyond r…

7eaf9d1

…andom

ZiyueXu77 requested review from Copilot and yanchengnv and removed request for Copilot September 5, 2025 18:12

address PR conflicts

9fad481

Copilot AI review requested due to automatic review settings September 15, 2025 14:56

Copilot AI reviewed Sep 15, 2025

View reviewed changes

ZiyueXu77 and others added 4 commits September 15, 2025 10:59

Merge branch 'main' into device_select

cfb4c10

Merge branch 'NVIDIA:main' into device_select

c4c1029

Merge branch 'NVIDIA:main' into device_select

19d74b3

Merge branch 'main' into device_select

3a37dcc

greptile-apps bot reviewed Nov 6, 2025

View reviewed changes

update func name

404837f

greptile-apps bot reviewed Nov 6, 2025

View reviewed changes

ZiyueXu77 requested a review from YuanTingHsieh November 7, 2025 16:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add functionality for a balanced device selection across clients #3661

Add functionality for a balanced device selection across clients #3661

Uh oh!

ZiyueXu77 commented Sep 5, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

ZiyueXu77 commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add functionality for a balanced device selection across clients #3661

Are you sure you want to change the base?

Add functionality for a balanced device selection across clients #3661

Uh oh!

Conversation

ZiyueXu77 commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Types of changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 1/5

Important Files Changed

Sequence Diagram

Additional Comments (2)

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

ZiyueXu77 commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ZiyueXu77 commented Sep 5, 2025 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading