Skip to content

Conversation

@holgerroth
Copy link
Collaborator

Fixes # .

Description

Multi-Node Distributed Training Support for NVFlare LLM Fine-tuning
This PR enables multi-node distributed training with NVFlare for LLM fine-tuning across SLURM clusters.

Key Changes:

  • Fixed rank vs local_rank distinction for proper multi-node coordination (only global rank 0 communicates with FL server)
  • Created wrapper script (run_multinode_training.sh) to handle srun + torchrun coordination across nodes
  • Fixed CUDA device mapping to use local_rank (0-7 per node) instead of global rank
  • Added WandB integration with conditional enabling via environment variable
  • Updated SLURM script for production mode deployment
  • Added comprehensive documentation (MULTINODE.md)

Impact:

  • Enables training across 2+ nodes with 8 GPUs per node (16+ total GPUs)
  • Supports InfiniBand with GPUDirect RDMA for optimal performance
  • Maintains single FL client per site while distributing training across nodes

Tested on: 2 nodes × 8 GPUs (16 total GPUs) with SLURM + InfiniBand

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

@holgerroth holgerroth requested a review from ZiyueXu77 October 7, 2025 21:03
@holgerroth holgerroth changed the title [HuggingFace] Multinode client support [HuggingFace] Multi-node client support Oct 7, 2025
Copy link
Collaborator

@ZiyueXu77 ZiyueXu77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! some minor comments

@holgerroth holgerroth marked this pull request as ready for review October 8, 2025 19:42
@holgerroth
Copy link
Collaborator Author

/build

@holgerroth holgerroth requested a review from ZiyueXu77 October 8, 2025 19:42
Copy link
Collaborator

@chesterxgchen chesterxgchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add few suggestion

@holgerroth holgerroth marked this pull request as draft October 15, 2025 08:50
@holgerroth
Copy link
Collaborator Author

Added a separate PR for file restructuring in LLM tutorial #3850 @chesterxgchen

@holgerroth holgerroth marked this pull request as ready for review November 19, 2025 22:56
@holgerroth
Copy link
Collaborator Author

/build

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 19, 2025

Greptile Overview

Greptile Summary

This PR enables multi-node distributed training for NVFlare LLM fine-tuning across SLURM clusters. The implementation properly distinguishes between global rank (for FL operations) and local_rank (for CUDA devices), and introduces a wrapper script pattern to coordinate srun + torchrun across nodes.

Critical Issues Found:

  • Deadlock bug in client.py:257: The while flare.is_running(): check only executes on rank 0, causing ranks 1-15 to hang indefinitely when the FL session ends. The is_running status must be broadcast to all ranks before the loop condition.
  • Race condition in client.py:143: Directory cleanup uses rank == 0 check, which is correct, but the comment/logic should be verified for consistency.
  • GPU detection bug in client_wrapper.sh:28: Fails with range notation in CUDA_VISIBLE_DEVICES (e.g., "0-7" gives NGPUS=1 instead of 8).

Other Issues:

  • Documentation in MULTINODE.md doesn't mention the deadlock bug, creating a false sense that the implementation is complete.
  • Hardcoded NVIDIA-specific values in nvflare.slurm reduce portability.

Architecture is Sound:
The overall design is well thought out - using a wrapper script to handle multi-node coordination, keeping FL client on rank 0 only, and properly using local_rank for GPU mapping. However, the synchronization implementation is incomplete and will cause production failures.

Confidence Score: 0/5

  • This PR has a critical deadlock bug that will cause all multi-node training runs to hang indefinitely when FL sessions end
  • The deadlock in client.py:257 is a show-stopper - non-rank-0 processes will never know when to exit the training loop because flare.is_running() is only checked on rank 0 without broadcasting the result. This means every multi-node training run will hang with 15 out of 16 processes stuck in an infinite loop. Additionally, the GPU detection bug in the wrapper script could cause silent failures. While the overall architecture is solid, these bugs must be fixed before merge.
  • client.py requires immediate attention for the deadlock bug, client_wrapper.sh needs GPU detection fix, and MULTINODE.md should document the synchronization requirements

Important Files Changed

File Analysis

Filename Score Overview
examples/advanced/llm_hf/client.py 0/5 Critical deadlock bug: non-rank-0 processes hang indefinitely when FL session ends (line 257), plus directory cleanup race condition (line 143)
examples/advanced/llm_hf/client_wrapper.sh 3/5 GPU detection breaks with range notation in CUDA_VISIBLE_DEVICES (line 28), otherwise solid wrapper implementation
examples/advanced/llm_hf/MULTINODE.md 2/5 Documentation is comprehensive but misleading - describes a broadcast synchronization pattern that isn't actually implemented in client.py
examples/advanced/llm_hf/nvflare.slurm 4/5 Hardcoded NVIDIA-specific account/partition names need placeholders for portability, otherwise solid SLURM configuration

Sequence Diagram

sequenceDiagram
    participant SLURM as SLURM Master Node
    participant Server as NVFlare Server
    participant Client as NVFlare Client (Rank 0)
    participant Wrapper as client_wrapper.sh
    participant Node0 as Node 0 (Ranks 0-7)
    participant Node1 as Node 1 (Ranks 8-15)
    
    SLURM->>Server: Start NVFlare server
    SLURM->>Client: Start NVFlare client
    SLURM->>Client: Submit FL job via job.py
    
    loop Each FL Round
        Client->>Wrapper: Execute: bash client_wrapper.sh client.py
        Wrapper->>Wrapper: Detect multi-node setup (NNODES=2)
        Wrapper->>Node0: srun launches torchrun --node_rank=0
        Wrapper->>Node1: srun launches torchrun --node_rank=1
        
        Node0->>Node0: Spawn 8 processes (ranks 0-7)
        Node1->>Node1: Spawn 8 processes (ranks 8-15)
        
        Note over Node0,Node1: Only Rank 0 calls flare.receive()
        Server->>Client: Send global model
        Client->>Node0: Rank 0 receives model
        Node0->>Node0: Rank 0 broadcasts model to ranks 1-7
        Node0->>Node1: Rank 0 broadcasts model to ranks 8-15
        
        Note over Node0,Node1: All 16 ranks train via PyTorch DDP
        Node0->>Node0: Training with NCCL P2P/CUMEM
        Node1->>Node1: Training with NCCL P2P/CUMEM
        Node0->>Node1: Cross-node sync via InfiniBand RDMA
        
        Note over Node0,Node1: Only Rank 0 calls flare.send()
        Node0->>Client: Rank 0 sends trained model
        Client->>Server: Submit model updates
        
        Note over Node0,Node1: CRITICAL BUG: Non-rank-0 processes<br/>don't know when to exit loop<br/>(flare.is_running() not broadcast)
    end
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. examples/advanced/llm_hf/client.py, line 164-171 (link)

    style: dataset info printing and logging_steps calculation only happen on local_rank == 0, but should use rank == 0 for multi-node consistency (otherwise both nodes print)

10 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format

Copy link
Collaborator

@YuanTingHsieh YuanTingHsieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good solution and reference for the users, added some suggestions

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. examples/advanced/llm_hf/client.py, line 257-374 (link)

    logic: critical deadlock: non-rank-0 processes hang forever when FL session ends

    flare.is_running() only returns meaningful values on rank 0 (since flare.init(rank=rank) makes it no-op on other ranks). When the FL server signals completion, only rank 0 exits the loop - all other processes continue waiting at line 260's flare.receive() indefinitely.

    need to broadcast loop continuation signal before checking flare.is_running():

    while True:
        # broadcast whether to continue (must happen before any rank-0-only operations)
        if rank == 0:
            should_continue = flare.is_running()
        else:
            should_continue = None
        
        if dist.is_initialized():
            continue_obj = [should_continue]
            dist.broadcast_object_list(continue_obj, src=0)
            should_continue = continue_obj[0]
        
        if not should_continue:
            break
        
        # rest of existing loop code...
        if rank == 0:
            input_model = flare.receive(timeout=600)
            # ... etc
    

10 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

try srun job on client

use bash running script

running but hanging

successful multinode training

consolidate documentation

increase flare.init timeout

add wandb

use singleton job

update doc

update docs

add multinode readme

rename files

enable multi-gpu

use fallback tenborsboard logging

fix simulator run

Signed-off-by: Holger Roth <[email protected]>

restore client name based on user id

Fix federated Stats Advanced folders (#3753)

1) Clean up Advanced Federated-statistics to streamline the folder
structure
2) df_stats won't repeat the hello-tabulare-stats, instead, focusing on
more the implementation and configuration options
3) image_stats, new download data scripts
4) for both replace the implementation with recipes. Also add the
adult.json and image_statistics.json to the demo folders so that the
visualization notebook won't fail the unit tests

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

add wandb license

formatting; print client name

hello-pt: restore requirements installation and handle Colab (#3760)

Fixes # .

A few sentences describing the changes proposed in this pull request.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

fix code that overwrite the original notebook. Fix Colab issue for hello-pt  (#3761)

Fixes # .
1 )The conftest.py overwrite the original notebook, but in the process,
skip a some cell as well. fix that
2) the hello-pt support on colab was broken

A few sentences describing the changes proposed in this pull request.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Add colab support 3 (#3762)

Fixes # .

A few sentences describing the changes proposed in this pull request.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Update the swarm learning example. (#3759)

Update the swarm learning example under
exaemples/advanced/swarm_learning.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [x] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Co-authored-by: Chester Chen <[email protected]>

Reset task_data in client after executor is run. (#3763)

This fixes https://nvbugspro.nvidia.com/bug/5570625

The client relies on task_data in executor. This change clears the
task_data from FLContext after the executor is run.

The default file streaming size is changed to 0 so it has the same
behavior as 2.6.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: Chester Chen <[email protected]>

Fix SNPAuthorizer issue (#3764)

The AMD KDS has a rate limitation for the "fetch" endpoint.

So we need to cache the ARK/ASK AND the VCEK as well.

We also added exponential backoff to avoid hitting this rate limit.

- Added a mechanism to cache the AMD VCEK based on the Chip ID and
Reported TCB info
- Added exponential backoff

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

fix image_stats.ipynb (#3768)

Fixes # .

A few sentences describing the changes proposed in this pull request.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Fixed XGB Plugin Compiling Errors (#3767)

g++ 12 or newer cleaned up headers and the XGB plugin doesn't compile
anymore. Added <cstdint> headers.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Co-authored-by: Chester Chen <[email protected]>

Updates on llm and xgb examples (#3765)

Fixes # .

Updates for new transformer / peft / trl versions, some changed arg
names, etc.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

fix notebooks in advanced directory 4 (#3769)

Fixes # .

A few sentences describing the changes proposed in this pull request.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Update CC docs [skip ci] (#3756)

Update CC docs

Update CC docs

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: Chester Chen <[email protected]>

FedStats: Improve error messages (#3770)

Fixes # .

Adding some warnings that make it easier to debug in case stats are
missing or have mismatching names across clients.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Fix Conftest.py (#3771)

Make a mistake on last commit with conftest.py
I assume the notebook only read from disk once.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Remove production board text [skip ci] (#3772)

Remove production board text

Remove production board text

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Add missing init in tf recipes (#3773)

Add missing init in tf recipes

Add missing init in tf recipes

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Co-authored-by: Chester Chen <[email protected]>

Hello-word documentation update [skip ci] (#3778)

Read Me and Hello-world rst update

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Fix colab commands in hello-pt (#3781)

Fixes # .

Adjust the order of commands so the notebook can run directly on Colab.
Fix download command to not require user prompt.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Lower the log severity on timeout.  No change in logic. (#3782)

When timeout happens, the log message is recorded at `ERROR` level.
However, later logic may recover from timeout. Therefore, this log
should be in `WARNING.`

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [x] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Misc Notebook updates (#3777)

1. Add kaggle download library support
2. make sure the xgboost can run the linked notebooks in the one
notebook
3  Tutorials
    3.1 fix Job Recipe Notebook POCEnv ==> PocEnv
    3.2 skip ProdEnv execution
    3.3 delete unit test file for notebooks folders
3.4 fix a bug in security notebook tutorial (file path) change the job
CLI to FLARE API
3.5 add clean up section for keyCloak example ( stop docker, clean up
POC)
    3.6 misc changes to default value on the notebook

A few sentences describing the changes proposed in this pull request.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Force new token generation when check a new job (#3787)

Fixes FLARE-2671.

This change ensures a new token is always generated when a new job
arrives.

Previously, if a job completed quickly and the existing token hadn’t yet
expired (token expiration is set to 100 seconds), the same token could
be reused for the next job. This would cause the nonce check to fail, as
the nonce had already been seen.

By forcing token regeneration for each new job, we avoid reuse and
ensure the nonce remains unique.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Make admin server timeout configurable (#3786)

Fixes #3730 .

Make admin server timeout configurable VIA "admin_timeout" in
"local/resources.json" inside a startup kit, for example:

```
$ cat ./workspace/example_project/prod_00/server1/local/resources.json.default
{
    "format_version": 2,
    "servers": [
        {
            "admin_storage": "transfer",
            "max_num_clients": 100,
            "heart_beat_timeout": 600,
            "download_job_url": "http://download.server.com/",
            "admin_timeout": 10.0
        }
    ],
```

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: Chester Chen <[email protected]>

Do not generate start_all.sh and use async grpc by default (#3783)

Fixes https://jirasw.nvidia.com/browse/FLARE-2677

Made 2 changes to provisioning,

1. By default, it will not generate start_all.sh. Use -s option to
generate it.
2. Use synch grpc driver for both client and server by default.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

[BioNeMo] Use decomposer register widget (#3784)

update other bionemo examples

Fixes # .

Update bionemo examples and tutorial for 2.7 with decomposer register
widget

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Signed-off-by: Holger Roth <[email protected]>
Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Include a section describing how to build KBS docker images [skip ci] (#3785)

The build process for KBS requires a few steps and it's error prone.
This PR adds a section of that deployment guide, which describes how to
build KBS docker images directly.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [x] Documentation updated.

Co-authored-by: Chester Chen <[email protected]>

Fixed a certificate issue with newer OpenSSL (#3775)

Newer OpenSSL (from Ubuntu 25.04) doesn't accept the certs generated by
provision.
This PR fixed the problem by adding Authority Key ID and Key Usage to
the cert extensions.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Isaac Yang <[email protected]>

Add diagrams and docs improvements [skip ci] (#3788)

Add diagrams and docs improvements

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Fix broken links (#3789)

Fix broken links

Fix broken links

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Co-authored-by: Chester Chen <[email protected]>

Documentation structure and what is new updates [skip ci] (#3779)

1. Clean up the documents, fix missing reference
2. Update whats new to make sure it reflect key points
3. Update best practice to indicate 'best" is for lower-level API
4. remove duplicate files
5. update programming guides

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Fix image_stats integration test (#3790)

Fix CI after changes in #3753

Add the old job into CI for testing.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Add System architecture and Security Architecture documentations [skip ci] (#3794)

1. Add system architecture
2. Add cellnet architecture
3. Add Security Architecture

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Updated CC related User Guide [skip ci] (#3795)

Updated CC related user guide to match the current system.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Co-authored-by: Chester Chen <[email protected]>

Add allow_out_ports (#3796)

Follow changes in the CVM builder

Follow changes in the CVM builder

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Architecture documentation fix [skip ci] (#3797)

Fixes # .
 fix some mistakes

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

[HE Tutorial] add missing codes (#3793)

Fixes # .

Upgrade nvflare version and add missing codes.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Signed-off-by: Holger Roth <[email protected]>

fix requirements bug (#3798)

* FLARE-2676

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Remove HA from docs (#3799)

Remove references to HA from docs since it has been removed already.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Remove duplicate toctree (#3800)

Remove duplicate toctree

Remove duplicate toctree

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: Copilot <[email protected]>
Co-authored-by: Chester Chen <[email protected]>

UPDATE Confidential computing documentation [skip ci] (#3801)

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

update CC documentation (#3802)

Fixes # .

A few sentences describing the changes proposed in this pull request.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Documents fixes [skip ci] (#3803)

Fixes # .

A few sentences describing the changes proposed in this pull request.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Update documentation (#3804)

Fixes # .

A few sentences describing the changes proposed in this pull request.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

add Deployment Guide for SecureAI reference (#3805)

Fixes # .

A few sentences describing the changes proposed in this pull request.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Tweak the documentation (#3809)

Fixes # .

A few sentences describing the changes proposed in this pull request.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Fix webpage links (#3812)

Fix broken links and issues for the web page.

Fix website link consistency: Replace hardcoded documentation paths with
version-aware template literals

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Bump vite from 6.3.6 to 6.4.1 in /web (#3807)

Bumps [vite](https://github.com/vitejs/vite/tree/HEAD/packages/vite)
from 6.3.6 to 6.4.1.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/vitejs/vite/releases">vite's
releases</a>.</em></p>
<blockquote>
<h2>[email protected]</h2>
<p>Please refer to <a
href="https://github.com/vitejs/vite/blob/[email protected]/packages/create-vite/CHANGELOG.md">CHANGELOG.md</a>
for details.</p>
<h2>v6.4.1</h2>
<p>Please refer to <a
href="https://github.com/vitejs/vite/blob/v6.4.1/packages/vite/CHANGELOG.md">CHANGELOG.md</a>
for details.</p>
<h2>[email protected]</h2>
<p>Please refer to <a
href="https://github.com/vitejs/vite/blob/[email protected]/packages/create-vite/CHANGELOG.md">CHANGELOG.md</a>
for details.</p>
<h2>v6.4.0</h2>
<p>Please refer to <a
href="https://github.com/vitejs/vite/blob/v6.4.0/packages/vite/CHANGELOG.md">CHANGELOG.md</a>
for details.</p>
<h2>v6.3.7</h2>
<p>Please refer to <a
href="https://github.com/vitejs/vite/blob/v6.3.7/packages/vite/CHANGELOG.md">CHANGELOG.md</a>
for details.</p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/vitejs/vite/commit/50034340401b4043bb0b158f18ffb7ae1b7f5c86"><code>5003434</code></a>
fix(preview): use host url to open browser (<a
href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/19836">#19836</a>)</li>
<li><a
href="https://github.com/vitejs/vite/commit/bf9728e79e8df905de457e15001e65e33cf13f0e"><code>bf9728e</code></a>
release: v6.3.0-beta.2</li>
<li><a
href="https://github.com/vitejs/vite/commit/380c10e665e78ef732a8d7b6c8f60a1226fc4c3b"><code>380c10e</code></a>
fix(hmr): run HMR handler sequentially (<a
href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/19793">#19793</a>)</li>
<li><a
href="https://github.com/vitejs/vite/commit/8bed1de5710f2a097af0e22a196545446d98f988"><code>8bed1de</code></a>
fix: addWatchFile doesn't work if base is specified (fixes <a
href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/19792">#19792</a>)
(<a
href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/19794">#19794</a>)</li>
<li><a
href="https://github.com/vitejs/vite/commit/0a0c50a7ed38017469ed6dcec941c2d8d0efd0d0"><code>0a0c50a</code></a>
refactor: simplify pluginFilter implementation (<a
href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/19828">#19828</a>)</li>
<li><a
href="https://github.com/vitejs/vite/commit/59d0b35b30f3a38be33c0a9bdc177945b6f7eb1b"><code>59d0b35</code></a>
perf(css): avoid constructing <code>renderedModules</code> (<a
href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/19775">#19775</a>)</li>
<li><a
href="https://github.com/vitejs/vite/commit/175a83909f02d3b554452a7bd02b9f340cdfef70"><code>175a839</code></a>
fix: reject requests with <code>#</code> in request-target (<a
href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/19830">#19830</a>)</li>
<li><a
href="https://github.com/vitejs/vite/commit/e2e11b15a6083777ee521e26a3f79c3859abd411"><code>e2e11b1</code></a>
fix(module-runner): allow already resolved id as entry (<a
href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/19768">#19768</a>)</li>
<li><a
href="https://github.com/vitejs/vite/commit/7200deec91a501fb84734e23906f80808734540c"><code>7200dee</code></a>
fix: correct the behavior when multiple transform filter options are
specifie...</li>
<li><a
href="https://github.com/vitejs/vite/commit/b1251720d47f15615ea354991cdaa90d9a94aae5"><code>b125172</code></a>
fix(css): remove empty chunk imports correctly when chunk file name
contained...</li>
<li>Additional commits viewable in <a
href="https://github.com/vitejs/vite/commits/[email protected]/packages/vite">compare
view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=vite&package-manager=npm_and_yarn&previous-version=6.3.6&new-version=6.4.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/NVIDIA/NVFlare/network/alerts).

</details>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Update Hello-PT and Stats (#3810)

Fixes # .

Fix a few things in hello-pt example and notebook. Improve the
tensorboard logs by logging loss end of each epoch.
Remove duplicated cell in df_stats and install quantile requirement.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Signed-off-by: Holger Roth <[email protected]>

Bump astro from 5.13.2 to 5.14.4 in /web (#3776)

Bumps
[astro](https://github.com/withastro/astro/tree/HEAD/packages/astro)
from 5.13.2 to 5.14.4.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/withastro/astro/releases">astro's
releases</a>.</em></p>
<blockquote>
<h2>[email protected]</h2>
<h3>Patch Changes</h3>
<ul>
<li><a
href="https://redirect.github.com/withastro/astro/pull/14509">#14509</a>
<a
href="https://github.com/withastro/astro/commit/7e04caf9a4a75c75f06c4207fae601a5fd251735"><code>7e04caf</code></a>
Thanks <a
href="https://github.com/ArmandPhilippot"><code>@​ArmandPhilippot</code></a>!
- Fixes an error in the docs that specified an incorrect version for the
<code>security.allowedDomains</code> release.</li>
</ul>
<h2>[email protected]</h2>
<h3>Patch Changes</h3>
<ul>
<li>
<p><a
href="https://redirect.github.com/withastro/astro/pull/14505">#14505</a>
<a
href="https://github.com/withastro/astro/commit/28b2a1db4f3f265632f280b0dbc4c5f241c387e2"><code>28b2a1d</code></a>
Thanks <a
href="https://github.com/matthewp"><code>@​matthewp</code></a>! - Fixes
<code>Cannot set property manifest</code> error in test utilities by
adding a protected setter for the manifest property</p>
</li>
<li>
<p><a
href="https://redirect.github.com/withastro/astro/pull/14235">#14235</a>
<a
href="https://github.com/withastro/astro/commit/c4d84bb654c9a5064b243e971c3b5b280e2b3791"><code>c4d84bb</code></a>
Thanks <a href="https://github.com/toxeeec"><code>@​toxeeec</code></a>!
- Fixes a bug where the &quot;tap&quot; prefetch strategy worked only on
the first clicked link with view transitions enabled</p>
</li>
</ul>
<h2>[email protected]</h2>
<h3>Patch Changes</h3>
<ul>
<li><a
href="https://redirect.github.com/withastro/astro/pull/14440">#14440</a>
<a
href="https://github.com/withastro/astro/commit/a3e16ab6dd0bef9ab6259f23bfeebed747e27497"><code>a3e16ab</code></a>
Thanks <a
href="https://github.com/florian-lefebvre"><code>@​florian-lefebvre</code></a>!
- Fixes a case where the URLs generated by the experimental Fonts API
would be incorrect in dev</li>
</ul>
<h2>[email protected]</h2>
<h3>Minor Changes</h3>
<ul>
<li>
<p><a
href="https://redirect.github.com/withastro/astro/pull/13520">#13520</a>
<a
href="https://github.com/withastro/astro/commit/a31edb8daad8632bacd1861adf6ac720695f7173"><code>a31edb8</code></a>
Thanks <a
href="https://github.com/openscript"><code>@​openscript</code></a>! -
Adds a new property <code>routePattern</code> available to
<code>GetStaticPathsOptions</code></p>
<p>This provides the original, dynamic segment definition in a routing
file path (e.g. <code>/[...locale]/[files]/[slug]</code>) from the Astro
render context that would not otherwise be available within the scope of
<code>getStaticPaths()</code>. This can be useful to calculate the
<code>params</code> and <code>props</code> for each page route.</p>
<p>For example, you can now localize your route segments and return an
array of static paths by passing <code>routePattern</code> to a custom
<code>getLocalizedData()</code> helper function. The <code>params</code>
object will be set with explicit values for each route segment (e.g.
<code>locale</code>, <code>files</code>, and <code>slug)</code>. Then,
these values will be used to generate the routes and can be used in your
page template via <code>Astro.params</code>.</p>
<pre lang="astro"><code>// src/pages/[...locale]/[files]/[slug].astro
<p>import { getLocalizedData } from &quot;../../../utils/i18n&quot;;
export async function getStaticPaths({ routePattern
}) { const response = await fetch('...'); const data = await
response.json(); console.log(routePattern);
// [...locale]/[files]/[slug] // Call your custom helper with
<code>routePattern</code> to generate the static
paths return data.flatMap((file) =&gt; getLocalizedData(file,
routePattern)); } const { locale, files,
slug } = Astro.params;
</code></pre></p>
<p>For more information about this advanced routing pattern, see Astro's
<a
href="https://docs.astro.build/en/reference/routing-reference/#routepattern">routing
reference</a>.</p>
</li>
<li>
<p><a
href="https://redirect.github.com/withastro/astro/pull/13651">#13651</a>
<a
href="https://github.com/withastro/astro/commit/dcfbd8c9d5dc798d1bcb9b36531c2eded301050d"><code>dcfbd8c</code></a>
Thanks <a href="https://github.com/ADTC"><code>@​ADTC</code></a>! - Adds
a new <code>SvgComponent</code> type</p>
<p>You can now more easily enforce type safety for your
<code>.svg</code> assets by directly importing <code>SVGComponent</code>
from <code>astro/types</code>:</p>
<pre lang="astro"><code>---
// src/components/Logo.astro
import type { SvgComponent } from 'astro/types';
import HomeIcon from './Home.svg';
interface Link {
  url: string;
  text: string;
</code></pre>
</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/withastro/astro/blob/main/packages/astro/CHANGELOG.md">astro's
changelog</a>.</em></p>
<blockquote>
<h2>5.14.4</h2>
<h3>Patch Changes</h3>
<ul>
<li><a
href="https://redirect.github.com/withastro/astro/pull/14509">#14509</a>
<a
href="https://github.com/withastro/astro/commit/7e04caf9a4a75c75f06c4207fae601a5fd251735"><code>7e04caf</code></a>
Thanks <a
href="https://github.com/ArmandPhilippot"><code>@​ArmandPhilippot</code></a>!
- Fixes an error in the docs that specified an incorrect version for the
<code>security.allowedDomains</code> release.</li>
</ul>
<h2>5.14.3</h2>
<h3>Patch Changes</h3>
<ul>
<li>
<p><a
href="https://redirect.github.com/withastro/astro/pull/14505">#14505</a>
<a
href="https://github.com/withastro/astro/commit/28b2a1db4f3f265632f280b0dbc4c5f241c387e2"><code>28b2a1d</code></a>
Thanks <a
href="https://github.com/matthewp"><code>@​matthewp</code></a>! - Fixes
<code>Cannot set property manifest</code> error in test utilities by
adding a protected setter for the manifest property</p>
</li>
<li>
<p><a
href="https://redirect.github.com/withastro/astro/pull/14235">#14235</a>
<a
href="https://github.com/withastro/astro/commit/c4d84bb654c9a5064b243e971c3b5b280e2b3791"><code>c4d84bb</code></a>
Thanks <a href="https://github.com/toxeeec"><code>@​toxeeec</code></a>!
- Fixes a bug where the &quot;tap&quot; prefetch strategy worked only on
the first clicked link with view transitions enabled</p>
</li>
</ul>
<h2>5.14.2</h2>
<h3>Patch Changes</h3>
<ul>
<li>
<p><a
href="https://redirect.github.com/withastro/astro/pull/14459">#14459</a>
<a
href="https://github.com/withastro/astro/commit/916f9c2e094f19562cfe722ca0a5fafb0f313c2e"><code>916f9c2</code></a>
Thanks <a
href="https://github.com/florian-lefebvre"><code>@​florian-lefebvre</code></a>!
- Improves font files URLs in development when using the experimental
fonts API by showing the subset if present</p>
</li>
<li>
<p><a
href="https://github.com/withastro/astro/commit/b8ca69b97149becefaf89bf21853de9c905cdbb7"><code>b8ca69b</code></a>
Thanks <a
href="https://github.com/ascorbic"><code>@​ascorbic</code></a>! - Aligns
dev image server file base with Vite rules</p>
</li>
<li>
<p><a
href="https://redirect.github.com/withastro/astro/pull/14469">#14469</a>
<a
href="https://github.com/withastro/astro/commit/1c090b00c1f5c3d8e938ac873fc63ab2f1ae37f1"><code>1c090b0</code></a>
Thanks <a href="https://github.com/delucis"><code>@​delucis</code></a>!
- Updates <code>tinyexec</code> dependency</p>
</li>
<li>
<p><a
href="https://redirect.github.com/withastro/astro/pull/14460">#14460</a>
<a
href="https://github.com/withastro/astro/commit/008dc75d860eadbb394e86dac68c7f4962e40489"><code>008dc75</code></a>
Thanks <a
href="https://github.com/florian-lefebvre"><code>@​florian-lefebvre</code></a>!
- Fixes a case where <code>astro:config/server</code> values typed as
URLs would be serialized as strings</p>
</li>
<li>
<p><a
href="https://redirect.github.com/withastro/astro/pull/13730">#13730</a>
<a
href="https://github.com/withastro/astro/commit/72603676818d1c433ac2751843a8a9b0cc9b48c9"><code>7260367</code></a>
Thanks <a
href="https://github.com/razonyang"><code>@​razonyang</code></a>! -
Fixes a bug in i18n, where Astro caused an infinite loop when a locale
that doesn't have an index, and Astro falls back to the index of the
default locale.</p>
</li>
<li>
<p><a
href="https://github.com/withastro/astro/commit/6ee63bfac4856f21b4d4633021b3d2ee059e553f"><code>6ee63bf</code></a>
Thanks <a
href="https://github.com/matthewp"><code>@​matthewp</code></a>! - Adds
<code>security.allowedDomains</code> configuration to validate
<code>X-Forwarded-Host</code> headers in SSR</p>
<p>The <code>X-Forwarded-Host</code> header will now only be trusted if
it matches one of the configured allowed host patterns. This prevents <a
href="https://owasp.org/www-project-web-security-testing-guide/latest/4-Web_Application_Security_Testing/07-Input_Validation_Testing/17-Testing_for_Host_Header_Injection">host
header injection attacks</a> that can lead to cache poisoning and other
security vulnerabilities.</p>
<p>Configure allowed host patterns to enable
<code>X-Forwarded-Host</code> support:</p>
<pre lang="js"><code>// astro.config.mjs
export default defineConfig({
  output: 'server',
  adapter: node(),
  security: {
    allowedDomains: [
      { hostname: 'example.com' },
      { hostname: '*.example.com' },
      { hostname: 'cdn.example.com', port: '443' },
    ],
  },
});
</code></pre>
<p>The patterns support wildcards (<code>*</code> and <code>**</code>)
for flexible hostname matching and can optionally specify protocol and
port.</p>
</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/withastro/astro/commit/3412859d37b3282a967278eba86f22cdb373eac7"><code>3412859</code></a>
[ci] release (<a
href="https://github.com/withastro/astro/tree/HEAD/packages/astro/issues/14510">#14510</a>)</li>
<li><a
href="https://github.com/withastro/astro/commit/7e04caf9a4a75c75f06c4207fae601a5fd251735"><code>7e04caf</code></a>
docs: fix <code>security.allowedDomains</code> version (<a
href="https://github.com/withastro/astro/tree/HEAD/packages/astro/issues/14509">#14509</a>)</li>
<li><a
href="https://github.com/withastro/astro/commit/fe1d35cc950b16a6462102b98b48753d27395e03"><code>fe1d35c</code></a>
[ci] release (<a
href="https://github.com/withastro/astro/tree/HEAD/packages/astro/issues/14507">#14507</a>)</li>
<li><a
href="https://github.com/withastro/astro/commit/7926882013c2f493aeb2fe9b162e515e65e68e81"><code>7926882</code></a>
[ci] format</li>
<li><a
href="https://github.com/withastro/astro/commit/c4d84bb654c9a5064b243e971c3b5b280e2b3791"><code>c4d84bb</code></a>
fix(prefetch): Fix &quot;tap&quot; prefetch strategy when view
transitions are enabled ...</li>
<li><a
href="https://github.com/withastro/astro/commit/3bb14b7dbbc236f55096631401703a290321031e"><code>3bb14b7</code></a>
[ci] release (<a
href="https://github.com/withastro/astro/tree/HEAD/packages/astro/issues/14466">#14466</a>)</li>
<li><a
href="https://github.com/withastro/astro/commit/7a5aafff7b6d424164bf76d25c231d8860a26e25"><code>7a5aaff</code></a>
[ci] format</li>
<li><a
href="https://github.com/withastro/astro/commit/28b2a1db4f3f265632f280b0dbc4c5f241c387e2"><code>28b2a1d</code></a>
Fix failing x-forwarded-host tests (<a
href="https://github.com/withastro/astro/tree/HEAD/packages/astro/issues/14505">#14505</a>)</li>
<li><a
href="https://github.com/withastro/astro/commit/ec307b02e3e866fa53ea6715b5f6f05dbb323953"><code>ec307b0</code></a>
[ci] format</li>
<li><a
href="https://github.com/withastro/astro/commit/6ee63bfac4856f21b4d4633021b3d2ee059e553f"><code>6ee63bf</code></a>
Merge commit from fork</li>
<li>Additional commits viewable in <a
href="https://github.com/withastro/astro/commits/[email protected]/packages/astro">compare
view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=astro&package-manager=npm_and_yarn&previous-version=5.13.2&new-version=5.14.4)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/NVIDIA/NVFlare/network/alerts).

</details>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Use Python 3.9 typing (#3611)

Use python 3.9 typing syntax. The most changes are about implementing
[PEP 585](https://peps.python.org/pep-0585/) , that is, replacing `Dict`
with `dict`, replacing `List` with `list` and replacing `Tuple` with
`tuple`.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).

Signed-off-by: cyy <[email protected]>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <[email protected]>

Fix docs and add missing diagram (#3815)

Fix docs and add missing diagram, polish hello-flower example.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Simplify quantization code (#3612)

Refactor the quantisation code. This PR is for preparing a new
quantisation scheme. The use of `QuantState.from_dict` and
`QuantState.as_dict` assumes the latest bitsandbytes version.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).

---------

Signed-off-by: cyy <[email protected]>
Co-authored-by: Chester Chen <[email protected]>
Co-authored-by: Ziyue Xu <[email protected]>

Adjust supported minimum  Python versions to 3.9 (#3665)

This PR life minimum Python version to 3.9.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [x] Documentation updated.

Signed-off-by: Yuanyuan Chen <[email protected]>

Enhance MLflow receiver (#3657)

Previously, if users did not explicitly specify a `tracking_uri`, MLflow
would default to using `./mlruns`, which is a local directory on the FL
server. This made it difficult for users to access logged metrics and
artifacts, as the default path was not exposed or retrievable outside
the server environment.

This PR introduces an enhancement to set a more accessible default
tracking_uri when none is provided by the user. Specifically, the
default is now set to:

```
file://[workspace]/[job_id]/mlflow
```

This change enables users to retrieve the logged metrics and artifacts
using the FlareAPI, as they are stored in a job-specific, accessible
path within the workspace.

- This PR depends on #3655.
- Ensures MLflow logs are stored in a consistent, retrievable location
tied to the job and workspace.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Fix edge simulator (#3813)

When the task response is "RETRY", the simulator should keep retrying
instead of just return and shutdown.

When the task response is "RETRY", the simulator should keep retrying
instead of just return and shutdown.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: Copilot <[email protected]>

Deployment guide of confidential ACI [skip ci] (#3816)

Detailed steps on deploying confidential ACI.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [x] Documentation updated.

Co-authored-by: Chester Chen <[email protected]>

Add Azure CVM deployment guide [skip ci] (#3820)

Add the document on deploying Azure CVM, and performing attestation.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [x] Documentation updated.

Update release notes (#3818)

Fix a syntax error in ACI doc [skip ci] (#3823)

…nt [skip ci]

Fix minor document format issue.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [x] Documentation updated.

Update sub_start.sh with one additional option (#3821)

Add one more option `--once` to sub_start.sh. When this option is used
with sub_start.sh, like `sub_start.sh --once` it will go directly to
start NVFlare. If it fails, such as missing dependencies or other
issues, sub_start.sh exits with the exit code returned by python.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [x] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Update cc docs (#3824)

Fixes FLARE-2689

Update CC docs

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Add FedNCA publication (#3806)

Adds our publication to the list of publications and talks.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [ ] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [x] Documentation updated.

---------

Co-authored-by: Holger Roth <[email protected]>

CC document updates and others (#3826)

CC document updates, cc provisioning tool updates and authorizers for
Azure updates.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [x] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Add Tensor Stream component for efficient safetensors-based model tensor streaming (#3741)

Add GPU CC docs (#3825)

Add GPU CC docs

Add GPU CC docs

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: Copilot <[email protected]>
Co-authored-by: Zhihong Zhang <[email protected]>

Update FAQ (#3832)

Update FAQ

Update FAQ with more up to date byoc information

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: Holger Roth <[email protected]>
Co-authored-by: Copilot <[email protected]>

Android app enhancements for state management and status on UI (#3819)

Add missing statuses and fix state handling, improve display of status
on UI with training progress.

Add missing statuses and improves state handling, improve display of
status on UI with training progress.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: Chester Chen <[email protected]>

address comments

address comments

run on slurm

address comments

formatting

New CC token verification mechanism (#3829)

This PR introduces a new confidential computing token verification
mechanism to replace the previous job-based verification approach.

Previously, the verification mechanism was tied to specific jobs, which
required generating a new set of tokens for each new job. This approach
was inefficient and error-prone. The new mechanism provides a
persistent, cross-site token validation system that ensures secure and
consistent communication between components.

1. Client Registration

When a client sends a registration request to the server:

  - The client includes its token in the request.

  - The server validates the client’s token.

  - The server responds with its own token.

  - The client validates the server’s token.

2. Periodic Cross-Site Validation

Each site (server or client) periodically triggers a cross-site token
validation event (e.g., every 5–10 minutes):

  - The initiating site (e.g., siteA) starts the validation event.

  - All sites, including siteA, generate new tokens for this event.

  - siteA validates tokens from all participating sites.

3. Failure Handling

If any token validation fails:

The affected site will shut itself down.

Optionally, it may attempt to trigger a system-wide shutdown to prevent
inconsistent states.

4. Benefits

  - Removes dependency on per-job token generation.

- Enables periodic, automated validation to detect and isolate
compromised sites.

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Add ADAQUANT quantization scheme (#3628)

This PR adds a new quantization scheme: ADAQUANT, as introduced in the
paper [Opportunistic Block Dropout for Efficiently Training Large-scale
Neural Networks through Federated
Learning](https://www.ijcai.org/proceedings/2023/0394.pdf).

ADAQUANT converts float tensors into integer tensors. Combined with an
additional compression process to pack low-bit integers, it can reach
near 10X quantisation rate, as indicated in the following test results:
```
2025-08-24 11:10:17,096 - INFO - Quantized 147/147 params. Before quantization: 5716.26 MB. After quantization: 0.00 MB with meta: 602.34 MB.
2025-08-24 11:12:25,513 - INFO - Dequantized 147/147 params. Before dequantization: 5716.26 MB with meta: 602.34 MB. After dequantization: 5716.26 MB.
```
These results were reported by running according under
`NVFlare/examples/advanced/llm_hf` with the command
```
python3 llm_hf_fl_job.py --client_ids dolly --data_path ${PWD}/dataset --workspace_dir ${PWD}/workspace/hf_sft_nf4 --job_dir ${PWD}/workspace/jobs/hf_sft_nf4 --train_mode SFT --quantize_mode adaquant
```
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [X] Quick tests passed locally by running `./runtest.sh`.
- [X] In-line docstrings updated.
- [ ] Documentation updated.

---------

Signed-off-by: cyy <[email protected]>
Signed-off-by: Yuanyuan Chen <[email protected]>

Add CC GPU notes (#3840)

Fixes FLARE-2688.

Add GPU passthrough CVM instructions

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line do…
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@holgerroth
Copy link
Collaborator Author

/build

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. examples/advanced/llm_hf/client.py, line 143-148 (link)

    logic: use rank == 0 instead of just rank == 0 check - in multi-node setup, both nodes have local_rank == 0, causing race condition on shared filesystem

10 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

@holgerroth
Copy link
Collaborator Author

/build

Copy link
Collaborator

@YuanTingHsieh YuanTingHsieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@YuanTingHsieh YuanTingHsieh merged commit ce7aefd into NVIDIA:main Nov 20, 2025
20 checks passed
@holgerroth holgerroth deleted the multinode branch November 20, 2025 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants