-
Notifications
You must be signed in to change notification settings - Fork 223
Add huggingface example to hierarchical #3780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds HuggingFace language model training support to the hierarchical federated learning system and addresses latency issues by optimizing model serialization. The changes include removing inefficient .tolist() conversions that were causing performance bottlenecks in model transmission, adding model size logging for better monitoring, and providing a complete HuggingFace SFT example for hierarchical FL.
- Optimized model serialization by keeping numpy arrays instead of converting to lists
- Added comprehensive HuggingFace SFT example with task processor, job configuration, and data preprocessing utilities
- Enhanced logging with model size information for better monitoring and debugging
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| nvflare/edge/assessors/model_update.py | Added model size logging with numpy import |
| nvflare/edge/assessors/buff_model_manager.py | Removed tolist() conversion and added model size logging |
| examples/tutorials/.../hf_sft_model.py (3 files) | Added model_name_or_path attribute to CausalLMModel class |
| examples/advanced/edge/utils/preprocess_dolly.py | New data preprocessing utility for Dolly dataset |
| examples/advanced/edge/jobs/processors/models/hf_sft_model.py | New HuggingFace model wrapper for edge processing |
| examples/advanced/edge/jobs/processors/hf_sft_task_processor.py | New comprehensive task processor for HuggingFace SFT training |
| examples/advanced/edge/jobs/processors/cifar10_pt_task_processor.py | Refactored to move setup logic from training method |
| examples/advanced/edge/jobs/hf_sft_job.py | New job configuration script for HuggingFace SFT federated learning |
| examples/advanced/edge/README.md | Updated documentation with HuggingFace example instructions |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
examples/advanced/edge/jobs/processors/hf_sft_task_processor.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
chesterxgchen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For newly added examples.
Please following the following convention
- no need to add src directory, just keep it flat.
- for the model file, no need to name something like hf_sft_model.py, you are already in the given example or particular directory. just called model.py
- the client side training script should be called client.py
- the job code should be called job.py ( instead of hf_sft.job
- try to use one folder for one job, not to share the code. It may looks efficient by sharing the code for the example creator, it is confusing for the user trying to figure out which one relevant. The example is not provide the simplest way for the end user to understand the concept.
For all new examples, should ideally have the code structure like this
client.py
server.py ( may be missing for builtin FL algorithms)
model.py
job.py
download_data.py ( optional)
preapre_data.py (optional)
Unless to make it consistent, its very hard to do automated testing. and confusing to user as well.
We are too freely name our files, such as cifar10_fl_job.py, fl_job.py, cifar10_fedavt_train.py etc.
Lets be consistent.
| # print model size in MB | ||
| model_size = sum([v.nbytes for v in new_model.values()]) / (1024 * 1024) | ||
| self.log_info(fl_ctx, f"new model size: {model_size:.2f} MB") | ||
|
|
||
| # update the current model | ||
| # convert new_model items from numpy arrays to lists for serialization | ||
| new_model = {k: v.tolist() if isinstance(v, np.ndarray) else v for k, v in new_model.items()} | ||
| # Keep numpy arrays for efficient FOBS serialization through the hierarchy | ||
| # If converting to list, the model size will be much larger and slower to serialize | ||
| # Note that for cases where the device expects list (e.g. ExecuTorch simulation), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could break the current ET examples with mobile, we maybe merge this PR after release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point! let's do this PR after release to avoid any broken pieces
| from transformers import AutoModelForCausalLM | ||
|
|
||
|
|
||
| class CausalLMModel(torch.nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this class under src and why it can't be named model.py ?
Similar to other tutorial notebooks. is this because the video ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it definitely can, I was just avoiding making too much change in one PR. Shall we do a holistic move to update all examples following the new convention (after release)?
Fixes # .
Description
Add one example to hierarchical fl for huggingface language model training, addressed the latency issue by removing tolist() process
Types of changes
./runtest.sh.