-
Notifications
You must be signed in to change notification settings - Fork 223
Add functionality for a balanced device selection across clients #3661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces balanced device selection functionality for edge federated learning, allowing the system to select devices evenly across clients instead of purely random selection.
- Adds balanced device sampling strategy as the default selection method
- Implements timeout and monitoring capabilities for device availability
- Refactors device management to track client-to-device mappings for balanced distribution
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| nvflare/edge/tools/edge_fed_buff_recipe.py | Adds device sampling strategy configuration, simulation environment processing, and class rename |
| nvflare/edge/assessors/model_update.py | Implements device wait timeout functionality and enhanced device status logging |
| nvflare/edge/assessors/device_manager.py | Updates abstract method signature to check both devices and clients |
| nvflare/edge/assessors/buff_device_manager.py | Implements balanced device sampling algorithm and client tracking |
| examples/advanced/edge/jobs/pt_job_adv.py | Updates example to use new initial_min_client_num parameter |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Added balanced device selection strategy to improve fairness across clients in edge federated learning. Previously, device selection was purely random, treating all devices equally regardless of their client origin. This change introduces a "balanced" sampling strategy (now the default) that distributes device selection more evenly across leaf clients.
Key Changes:
- Added
device_sampling_strategyparameter toBuffDeviceManager("balanced" or "random") - Added
initial_min_client_numparameter to wait for minimum number of clients before starting - Implemented
_balanced_device_sampling()method that uses round-robin distribution across clients - Renamed abstract method
has_enough_devices()tohas_enough_devices_and_clients()to reflect new client-level checks - Updated configuration classes and example job to use new parameters
Issues Found:
- CRITICAL: Two method calls in
model_update.pystill use the old method namehas_enough_devices()instead of the renamedhas_enough_devices_and_clients(), which will causeAttributeErrorat runtime
Confidence Score: 1/5
- This PR contains critical bugs that will cause runtime failures
- Score reflects two critical method naming errors that were missed during refactoring. The abstract method
has_enough_deviceswas renamed tohas_enough_devices_and_clientsin the base class and implementation, but two call sites inmodel_update.py(lines 192 and 238) were not updated. This will causeAttributeErrorexceptions at runtime when these code paths are executed. The balanced sampling logic itself appears sound, but the incomplete refactoring makes this PR unsafe to merge. - Pay immediate attention to
nvflare/edge/assessors/model_update.py- contains two broken method calls that must be fixed before merging
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| nvflare/edge/assessors/buff_device_manager.py | 3/5 | Added balanced device sampling strategy with new parameters initial_min_client_num and device_sampling_strategy. Implementation looks sound but needs testing with edge cases (empty client mappings, uneven device distribution). |
| nvflare/edge/assessors/device_manager.py | 5/5 | Renamed abstract method has_enough_devices to has_enough_devices_and_clients to reflect enhanced functionality. Clean interface change. |
| nvflare/edge/tools/edge_fed_buff_recipe.py | 5/5 | Added new configuration parameters (initial_min_client_num, device_sampling_strategy) to DeviceManagerConfig and properly wired them through to BuffDeviceManager. Clean configuration extension. |
| examples/advanced/edge/jobs/pt_job_adv.py | 5/5 | Updated example to use new initial_min_client_num parameter set to num_leaf_nodes for waiting for all clients before starting. Minor whitespace cleanup included. |
Sequence Diagram
sequenceDiagram
participant Client as Client/Leaf Node
participant Assessor as ModelUpdateAssessor
participant DevMgr as BuffDeviceManager
participant ModelMgr as BuffModelManager
Client->>Assessor: StateUpdateReport (available_devices)
Assessor->>DevMgr: update_available_devices()
DevMgr->>DevMgr: Build device_client_map
Assessor->>DevMgr: has_enough_devices_and_clients()
DevMgr->>DevMgr: Check usable devices >= holes
DevMgr->>DevMgr: Check unique clients >= initial_min_client_num
DevMgr-->>Assessor: boolean result
alt Enough devices and clients
Assessor->>DevMgr: should_fill_selection()
DevMgr-->>Assessor: num_holes >= min_hole_to_fill
alt Should fill selection
Assessor->>ModelMgr: generate_new_model() (if initial)
Assessor->>DevMgr: fill_selection()
alt Balanced Strategy
DevMgr->>DevMgr: _balanced_device_sampling()
DevMgr->>DevMgr: Count devices per client
DevMgr->>DevMgr: Distribute holes evenly across clients
DevMgr->>DevMgr: Update current_selection & used_devices
else Random Strategy
DevMgr->>DevMgr: Random selection from usable devices
DevMgr->>DevMgr: Update current_selection & used_devices
end
DevMgr-->>Assessor: Selection filled
end
else Waiting for devices
Assessor->>Assessor: Start/continue wait timer
end
Assessor->>ModelMgr: get_current_model()
ModelMgr-->>Assessor: Current model DXO
Assessor->>DevMgr: get_selection()
DevMgr-->>Assessor: Device selection dict
Assessor-->>Client: StateUpdateReply (model, device_selection)
Additional Comments (2)
4 files reviewed, 2 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR renames the method has_enough_devices() to has_enough_devices_and_clients() in two call sites within model_update.py. This change improves code clarity by making the function name accurately reflect its behavior - it checks both device availability and client count requirements before proceeding with device selection.
- Updated function calls on lines 192 and 238 to use the new method name
- No logic changes, purely a refactoring for better code readability
- Aligns with the broader balanced device selection feature that distributes devices evenly across clients
Confidence Score: 5/5
- This PR is safe to merge with minimal risk
- The changes are purely cosmetic - renaming a method call to better reflect its functionality. No logic was modified, no new code was introduced, and the renamed method already exists in the codebase with the correct implementation. The changes are consistent across both call sites.
- No files require special attention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| nvflare/edge/assessors/model_update.py | 5/5 | Renamed function calls from has_enough_devices() to has_enough_devices_and_clients() for improved clarity (lines 192, 238) |
Sequence Diagram
sequenceDiagram
participant Client
participant Assessor as ModelUpdateAssessor
participant DevMgr as BuffDeviceManager
Client->>Assessor: process_child_update(StateUpdateReport)
Assessor->>DevMgr: update_available_devices(devices)
DevMgr-->>DevMgr: Update device_client_map
alt Waiting for devices
Assessor->>DevMgr: has_enough_devices_and_clients()
DevMgr-->>DevMgr: Check usable devices >= holes
DevMgr-->>DevMgr: Check unique_clients >= initial_min_client_num
DevMgr-->>Assessor: True/False
alt Enough devices and clients
Assessor-->>Assessor: Reset device_wait_start_time
else Not enough
Assessor-->>Assessor: Continue waiting or timeout
end
end
Assessor->>DevMgr: should_fill_selection()
DevMgr-->>Assessor: True if holes >= min_hole_to_fill
alt Should fill selection
Assessor->>DevMgr: has_enough_devices_and_clients()
DevMgr-->>Assessor: True
Assessor->>DevMgr: fill_selection(model_version)
alt device_sampling_strategy == "balanced"
DevMgr-->>DevMgr: _balanced_device_sampling()
DevMgr-->>DevMgr: Distribute devices evenly across clients
else device_sampling_strategy == "random"
DevMgr-->>DevMgr: Random selection from usable devices
end
DevMgr-->>Assessor: Selection filled
end
Assessor-->>Client: StateUpdateReply with device_selection
1 file reviewed, no comments
|
/build |
Improvement over device-sampling, not tie to release.
Description
Current behavior is random selection over available device list, treating devices from any clients the same, With this PR, we can select between totally random and balanced selection (default to balanced which will lead to less randomness for simulation).
Balanced selection try to select and dispatch global model to devices such that each leaf client will have relatively similar number of active devices at any time point.
Types of changes
./runtest.sh.