Skip to content

Conversation

@athitten
Copy link
Contributor

@athitten athitten commented Oct 3, 2025

Adds fixes for nemo-run script to enable nemo-fw checkpoint deployment with Ray backend. Works for single node, multiple replicas and multi node, single replica. Does not work for multi node, multi replica case. This will be addressed in a follow up PR.

Some accuracy numbers (GSM8k, 10%) attained using this script for ray serve backend:

Model (nemo 2.0 ckpt) Config Accuracy Baseline
Llama 3.1 8B Instruct Single node, 8 replicas (1 replica per GPU) 73.48% 71.97 % (stderr: 3.9)
Llama 3.1 405B 1 replica, TP=8, PP=2, 2 nodes 90.91% 89.39 %

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 3, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

athitten and others added 2 commits October 3, 2025 13:24
…ialization errors

Signed-off-by: Abhishree Thittenamane <[email protected]>
Signed-off-by: Abhishree <[email protected]>
--tensor_parallelism_size {tensor_model_parallel_size} \
--pipeline_parallelism_size {pipeline_model_parallel_size} \
--tensor_model_parallel_size {tensor_model_parallel_size} \
--pipeline_model_parallel_size {pipeline_model_parallel_size} \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the same name between Pytriton and Ray script for TP, PP. PR in ED for this: NVIDIA-NeMo/Export-Deploy#501

"max_batch_size": args.batch_size,
"devices": args.devices,
"max_batch_size": args.batch_size, #TODO check in llama 405B run
"num_gpus": args.devices if args.serving_backend == "pytriton" else args.devices * args.nodes,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ray requires num_gpus to be total num of gpus in case of multi node.

Signed-off-by: Abhishree Thittenamane <[email protected]>
@athitten athitten changed the title Fix for Ray deployment in nemo-run script fix: Fix for Ray deployment in nemo-run script Nov 5, 2025
@athitten
Copy link
Contributor Author

athitten commented Nov 5, 2025

/ok to test 522994a

Signed-off-by: Abhishree <[email protected]>
@athitten
Copy link
Contributor Author

athitten commented Nov 6, 2025

/ok to test a4d4956

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants