fix: Fix for Ray deployment in nemo-run script #266

athitten · 2025-10-03T20:16:30Z

Adds fixes for nemo-run script to enable nemo-fw checkpoint deployment with Ray backend. Works for single node, multiple replicas and multi node, single replica. Does not work for multi node, multi replica case. This will be addressed in a follow up PR.

Some accuracy numbers (GSM8k, 10%) attained using this script for ray serve backend:

Model (nemo 2.0 ckpt)	Config	Accuracy	Baseline
Llama 3.1 8B Instruct	Single node, 8 replicas (1 replica per GPU)	73.48%	71.97 % (stderr: 3.9)
Llama 3.1 405B	1 replica, TP=8, PP=2, 2 nodes	90.91%	89.39 %

Signed-off-by: Abhishree <[email protected]>

copy-pr-bot · 2025-10-03T20:16:34Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…ialization errors Signed-off-by: Abhishree Thittenamane <[email protected]>

Signed-off-by: Abhishree <[email protected]>

athitten · 2025-11-04T03:21:12Z

scripts/evaluation_with_nemo_run.py

-  --tensor_parallelism_size {tensor_model_parallel_size} \
-  --pipeline_parallelism_size {pipeline_model_parallel_size} \
+  --tensor_model_parallel_size {tensor_model_parallel_size} \
+  --pipeline_model_parallel_size {pipeline_model_parallel_size} \


Use the same name between Pytriton and Ray script for TP, PP. PR in ED for this: NVIDIA-NeMo/Export-Deploy#501

athitten · 2025-11-04T05:23:14Z

scripts/evaluation_with_nemo_run.py

-        "max_batch_size": args.batch_size,
-        "devices": args.devices,
+        "max_batch_size": args.batch_size, #TODO check in llama 405B run
+        "num_gpus": args.devices if args.serving_backend == "pytriton" else args.devices * args.nodes,


Ray requires num_gpus to be total num of gpus in case of multi node.

Signed-off-by: Abhishree Thittenamane <[email protected]>

athitten · 2025-11-05T20:28:09Z

/ok to test 522994a

Signed-off-by: Abhishree <[email protected]>

athitten · 2025-11-06T03:45:10Z

/ok to test a4d4956

Add use_with_ray_cluster metadata

a9d7f07

Signed-off-by: Abhishree <[email protected]>

athitten and others added 2 commits October 3, 2025 13:24

Add comment on using GitArchivePackager in scripts dir to prevent ser…

d93c79b

…ialization errors Signed-off-by: Abhishree Thittenamane <[email protected]>

Add additional fixes

42bb7b9

Signed-off-by: Abhishree <[email protected]>

github-actions bot added the scripts label Nov 4, 2025

athitten commented Nov 4, 2025

View reviewed changes

athitten added 2 commits November 5, 2025 12:25

Merge branch 'main' into athitten/nemo_run_ray

7d194b8

Signed-off-by: Abhishree Thittenamane <[email protected]>

Remove comment for max_batch_size

522994a

Signed-off-by: Abhishree Thittenamane <[email protected]>

athitten changed the title ~~Fix for Ray deployment in nemo-run script~~ fix: Fix for Ray deployment in nemo-run script Nov 5, 2025

Minor fixes

a4d4956

Signed-off-by: Abhishree <[email protected]>

copy-pr-bot bot temporarily deployed to test November 6, 2025 03:45 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 6, 2025 03:45 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 6, 2025 03:47 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Fix for Ray deployment in nemo-run script #266

fix: Fix for Ray deployment in nemo-run script #266

Uh oh!

athitten commented Oct 3, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Oct 3, 2025

Uh oh!

athitten Nov 4, 2025

Uh oh!

athitten Nov 4, 2025

Uh oh!

athitten commented Nov 5, 2025

Uh oh!

athitten commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: Fix for Ray deployment in nemo-run script #266

Are you sure you want to change the base?

fix: Fix for Ray deployment in nemo-run script #266

Uh oh!

Conversation

athitten commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Oct 3, 2025

Uh oh!

athitten Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

athitten Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

athitten commented Nov 5, 2025

Uh oh!

athitten commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

athitten commented Oct 3, 2025 •

edited

Loading