Skip to content

Conversation

@ZiyueXu77
Copy link
Collaborator

@ZiyueXu77 ZiyueXu77 commented Sep 5, 2025

Improvement over device-sampling, not tie to release.

Description

Current behavior is random selection over available device list, treating devices from any clients the same, With this PR, we can select between totally random and balanced selection (default to balanced which will lead to less randomness for simulation).
Balanced selection try to select and dispatch global model to devices such that each leaf client will have relatively similar number of active devices at any time point.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

@ZiyueXu77 ZiyueXu77 requested review from Copilot and yanchengnv and removed request for Copilot September 5, 2025 18:12
Copilot AI review requested due to automatic review settings September 15, 2025 14:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces balanced device selection functionality for edge federated learning, allowing the system to select devices evenly across clients instead of purely random selection.

  • Adds balanced device sampling strategy as the default selection method
  • Implements timeout and monitoring capabilities for device availability
  • Refactors device management to track client-to-device mappings for balanced distribution

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
nvflare/edge/tools/edge_fed_buff_recipe.py Adds device sampling strategy configuration, simulation environment processing, and class rename
nvflare/edge/assessors/model_update.py Implements device wait timeout functionality and enhanced device status logging
nvflare/edge/assessors/device_manager.py Updates abstract method signature to check both devices and clients
nvflare/edge/assessors/buff_device_manager.py Implements balanced device sampling algorithm and client tracking
examples/advanced/edge/jobs/pt_job_adv.py Updates example to use new initial_min_client_num parameter

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Added balanced device selection strategy to improve fairness across clients in edge federated learning. Previously, device selection was purely random, treating all devices equally regardless of their client origin. This change introduces a "balanced" sampling strategy (now the default) that distributes device selection more evenly across leaf clients.

Key Changes:

  • Added device_sampling_strategy parameter to BuffDeviceManager ("balanced" or "random")
  • Added initial_min_client_num parameter to wait for minimum number of clients before starting
  • Implemented _balanced_device_sampling() method that uses round-robin distribution across clients
  • Renamed abstract method has_enough_devices() to has_enough_devices_and_clients() to reflect new client-level checks
  • Updated configuration classes and example job to use new parameters

Issues Found:

  • CRITICAL: Two method calls in model_update.py still use the old method name has_enough_devices() instead of the renamed has_enough_devices_and_clients(), which will cause AttributeError at runtime

Confidence Score: 1/5

  • This PR contains critical bugs that will cause runtime failures
  • Score reflects two critical method naming errors that were missed during refactoring. The abstract method has_enough_devices was renamed to has_enough_devices_and_clients in the base class and implementation, but two call sites in model_update.py (lines 192 and 238) were not updated. This will cause AttributeError exceptions at runtime when these code paths are executed. The balanced sampling logic itself appears sound, but the incomplete refactoring makes this PR unsafe to merge.
  • Pay immediate attention to nvflare/edge/assessors/model_update.py - contains two broken method calls that must be fixed before merging

Important Files Changed

File Analysis

Filename Score Overview
nvflare/edge/assessors/buff_device_manager.py 3/5 Added balanced device sampling strategy with new parameters initial_min_client_num and device_sampling_strategy. Implementation looks sound but needs testing with edge cases (empty client mappings, uneven device distribution).
nvflare/edge/assessors/device_manager.py 5/5 Renamed abstract method has_enough_devices to has_enough_devices_and_clients to reflect enhanced functionality. Clean interface change.
nvflare/edge/tools/edge_fed_buff_recipe.py 5/5 Added new configuration parameters (initial_min_client_num, device_sampling_strategy) to DeviceManagerConfig and properly wired them through to BuffDeviceManager. Clean configuration extension.
examples/advanced/edge/jobs/pt_job_adv.py 5/5 Updated example to use new initial_min_client_num parameter set to num_leaf_nodes for waiting for all clients before starting. Minor whitespace cleanup included.

Sequence Diagram

sequenceDiagram
    participant Client as Client/Leaf Node
    participant Assessor as ModelUpdateAssessor
    participant DevMgr as BuffDeviceManager
    participant ModelMgr as BuffModelManager
    
    Client->>Assessor: StateUpdateReport (available_devices)
    Assessor->>DevMgr: update_available_devices()
    DevMgr->>DevMgr: Build device_client_map
    
    Assessor->>DevMgr: has_enough_devices_and_clients()
    DevMgr->>DevMgr: Check usable devices >= holes
    DevMgr->>DevMgr: Check unique clients >= initial_min_client_num
    DevMgr-->>Assessor: boolean result
    
    alt Enough devices and clients
        Assessor->>DevMgr: should_fill_selection()
        DevMgr-->>Assessor: num_holes >= min_hole_to_fill
        
        alt Should fill selection
            Assessor->>ModelMgr: generate_new_model() (if initial)
            Assessor->>DevMgr: fill_selection()
            
            alt Balanced Strategy
                DevMgr->>DevMgr: _balanced_device_sampling()
                DevMgr->>DevMgr: Count devices per client
                DevMgr->>DevMgr: Distribute holes evenly across clients
                DevMgr->>DevMgr: Update current_selection & used_devices
            else Random Strategy
                DevMgr->>DevMgr: Random selection from usable devices
                DevMgr->>DevMgr: Update current_selection & used_devices
            end
            
            DevMgr-->>Assessor: Selection filled
        end
    else Waiting for devices
        Assessor->>Assessor: Start/continue wait timer
    end
    
    Assessor->>ModelMgr: get_current_model()
    ModelMgr-->>Assessor: Current model DXO
    
    Assessor->>DevMgr: get_selection()
    DevMgr-->>Assessor: Device selection dict
    
    Assessor-->>Client: StateUpdateReply (model, device_selection)
Loading

Additional Comments (2)

  1. nvflare/edge/assessors/model_update.py, line 192 (link)

    syntax: method renamed to has_enough_devices_and_clients in this PR but call not updated

  2. nvflare/edge/assessors/model_update.py, line 238 (link)

    syntax: method renamed to has_enough_devices_and_clients but call not updated

4 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR renames the method has_enough_devices() to has_enough_devices_and_clients() in two call sites within model_update.py. This change improves code clarity by making the function name accurately reflect its behavior - it checks both device availability and client count requirements before proceeding with device selection.

  • Updated function calls on lines 192 and 238 to use the new method name
  • No logic changes, purely a refactoring for better code readability
  • Aligns with the broader balanced device selection feature that distributes devices evenly across clients

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The changes are purely cosmetic - renaming a method call to better reflect its functionality. No logic was modified, no new code was introduced, and the renamed method already exists in the codebase with the correct implementation. The changes are consistent across both call sites.
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
nvflare/edge/assessors/model_update.py 5/5 Renamed function calls from has_enough_devices() to has_enough_devices_and_clients() for improved clarity (lines 192, 238)

Sequence Diagram

sequenceDiagram
    participant Client
    participant Assessor as ModelUpdateAssessor
    participant DevMgr as BuffDeviceManager
    
    Client->>Assessor: process_child_update(StateUpdateReport)
    Assessor->>DevMgr: update_available_devices(devices)
    DevMgr-->>DevMgr: Update device_client_map
    
    alt Waiting for devices
        Assessor->>DevMgr: has_enough_devices_and_clients()
        DevMgr-->>DevMgr: Check usable devices >= holes
        DevMgr-->>DevMgr: Check unique_clients >= initial_min_client_num
        DevMgr-->>Assessor: True/False
        
        alt Enough devices and clients
            Assessor-->>Assessor: Reset device_wait_start_time
        else Not enough
            Assessor-->>Assessor: Continue waiting or timeout
        end
    end
    
    Assessor->>DevMgr: should_fill_selection()
    DevMgr-->>Assessor: True if holes >= min_hole_to_fill
    
    alt Should fill selection
        Assessor->>DevMgr: has_enough_devices_and_clients()
        DevMgr-->>Assessor: True
        
        Assessor->>DevMgr: fill_selection(model_version)
        
        alt device_sampling_strategy == "balanced"
            DevMgr-->>DevMgr: _balanced_device_sampling()
            DevMgr-->>DevMgr: Distribute devices evenly across clients
        else device_sampling_strategy == "random"
            DevMgr-->>DevMgr: Random selection from usable devices
        end
        
        DevMgr-->>Assessor: Selection filled
    end
    
    Assessor-->>Client: StateUpdateReply with device_selection
Loading

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@ZiyueXu77
Copy link
Collaborator Author

/build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants