How does NVFLARE manages resources and concurrent jobs? #3883

virginiafdez · 2025-12-09T14:38:17Z

virginiafdez
Dec 9, 2025

In our legacy FL system set-up, the need to allow multiple jobs for a specific network of one server and multiple clients, with assurance that each job running did not error out for lack of pre-allocated resources, required the a set-up like the following:

That is there were multiple nets with replicated server, client and admin containers, the clients having dedicated GPUs. We wondered if NVFLARE's current infrastructure would support a single set of admin, server and client containers where, if a client has a job running using N GPUs on start, there is a guarantee that the compute availability will stay constant throughout the job regardless of other jobs starting after.

alcatraz7698 · 2025-12-09T23:39:37Z

alcatraz7698
Dec 9, 2025

Does the client instance (or physical server) have multiple GPUs?

You can compose by specifying devices in the Docker Compose file using the nvidia-container toolkit syntax.
If you specify cuda0 for client1-1 of client instance 1 and cuda1 for client1-2, only the GPUs owned by the container will be used, so there will be no interference.
You can configure this in the nvflare configuration's resource.json file. You can configure GPU devices, memory allocation, etc.
Alternatively, you can specify resources when creating a job.
If two containers within a single instance are using all resources without mandatory resource allocation from the upper layer, process scheduling provided by the OS layer will take precedence.
nvflare has a feature for managing concurrently running jobs (I haven't used it).
However, given your current architecture, net-1 and net-2 are completely different sets. N1 and N2 will use separate API servers, FL servers, clients, network groups, root certificates, etc. Simply put, since N1 and N2 are independent, they don't know what each other is doing and thus can't schedule each other. At the very least, they will try to preemptively schedule processes within their allocated resource range.
I'm not sure why the two networks need to be separated, but here's a suggestion:
Use only one API server and one FL server.
Separate the clients into two instances, such as client1-1 and 1-2.
(These are all groups provisioned from the same project.)

Create both "client1-1" and "client1-2" containers on the client instances.
When creating containers, use methods like "1" and "2" above to properly distribute resource allocation.

Then, when running a job, configure the participant list so that only one container per instance participates, such as "client1-1", "2-1", and "3-1".

Otherwise, just keep one Net set and run it by allocating resources to each job as in step 3, referring to the following.
https://nvflare.readthedocs.io/en/2.7.1/programming_guide/resource_manager_and_consumer.html

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How does NVFLARE manages resources and concurrent jobs? #3883

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How does NVFLARE manages resources and concurrent jobs? #3883

Uh oh!

virginiafdez Dec 9, 2025

Replies: 1 comment

Uh oh!

alcatraz7698 Dec 9, 2025

virginiafdez
Dec 9, 2025

alcatraz7698
Dec 9, 2025