Skip to content

Conversation

@ZiyueXu77
Copy link
Collaborator

Fixes # .

Description

Add one example to hierarchical fl for huggingface language model training, addressed the latency issue by removing tolist() process

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

Copilot AI review requested due to automatic review settings October 13, 2025 17:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds HuggingFace language model training support to the hierarchical federated learning system and addresses latency issues by optimizing model serialization. The changes include removing inefficient .tolist() conversions that were causing performance bottlenecks in model transmission, adding model size logging for better monitoring, and providing a complete HuggingFace SFT example for hierarchical FL.

  • Optimized model serialization by keeping numpy arrays instead of converting to lists
  • Added comprehensive HuggingFace SFT example with task processor, job configuration, and data preprocessing utilities
  • Enhanced logging with model size information for better monitoring and debugging

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
nvflare/edge/assessors/model_update.py Added model size logging with numpy import
nvflare/edge/assessors/buff_model_manager.py Removed tolist() conversion and added model size logging
examples/tutorials/.../hf_sft_model.py (3 files) Added model_name_or_path attribute to CausalLMModel class
examples/advanced/edge/utils/preprocess_dolly.py New data preprocessing utility for Dolly dataset
examples/advanced/edge/jobs/processors/models/hf_sft_model.py New HuggingFace model wrapper for edge processing
examples/advanced/edge/jobs/processors/hf_sft_task_processor.py New comprehensive task processor for HuggingFace SFT training
examples/advanced/edge/jobs/processors/cifar10_pt_task_processor.py Refactored to move setup logic from training method
examples/advanced/edge/jobs/hf_sft_job.py New job configuration script for HuggingFace SFT federated learning
examples/advanced/edge/README.md Updated documentation with HuggingFace example instructions

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Collaborator

@chesterxgchen chesterxgchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For newly added examples.
Please following the following convention

  1. no need to add src directory, just keep it flat.
  2. for the model file, no need to name something like hf_sft_model.py, you are already in the given example or particular directory. just called model.py
  3. the client side training script should be called client.py
  4. the job code should be called job.py ( instead of hf_sft.job
  5. try to use one folder for one job, not to share the code. It may looks efficient by sharing the code for the example creator, it is confusing for the user trying to figure out which one relevant. The example is not provide the simplest way for the end user to understand the concept.

For all new examples, should ideally have the code structure like this

client.py 
server.py ( may be missing for builtin FL algorithms) 
model.py 
job.py 

download_data.py ( optional) 
preapre_data.py (optional) 

Unless to make it consistent, its very hard to do automated testing. and confusing to user as well.

We are too freely name our files, such as cifar10_fl_job.py, fl_job.py, cifar10_fedavt_train.py etc.

Lets be consistent.

Comment on lines +146 to +153
# print model size in MB
model_size = sum([v.nbytes for v in new_model.values()]) / (1024 * 1024)
self.log_info(fl_ctx, f"new model size: {model_size:.2f} MB")

# update the current model
# convert new_model items from numpy arrays to lists for serialization
new_model = {k: v.tolist() if isinstance(v, np.ndarray) else v for k, v in new_model.items()}
# Keep numpy arrays for efficient FOBS serialization through the hierarchy
# If converting to list, the model size will be much larger and slower to serialize
# Note that for cases where the device expects list (e.g. ExecuTorch simulation),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could break the current ET examples with mobile, we maybe merge this PR after release.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! let's do this PR after release to avoid any broken pieces

from transformers import AutoModelForCausalLM


class CausalLMModel(torch.nn.Module):
Copy link
Collaborator

@chesterxgchen chesterxgchen Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this class under src and why it can't be named model.py ?

Similar to other tutorial notebooks. is this because the video ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it definitely can, I was just avoiding making too much change in one PR. Shall we do a holistic move to update all examples following the new convention (after release)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants