-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
When I deploy my MB checkpoint with scripts/deploy/nlp/deploy_inframework_triton.py, I get an error when sending any request to the server.
Steps/Code to reproduce bug
- Start nemo nightly container. In my case it was on a SLURM cluster:
srun --container-image nvcr.io/nvidian/nemo:nightly --container-mounts /lustre:/lustre --overlap --pty /bin/bash
- Deploy the model
export HF_HOME=/lustre/fsw/coreai_dlalgo_compeval/martas/models/hf/ # requires access to meta-llama/Llama-3.1-8B-Instruct
python \
/opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py \
--megatron_checkpoint /lustre/fsw/coreai_dlalgo_ci/nemo_export_deploy_eval_checkpoints/mbridge/meta-llama/Llama-3.1-8B-Instruct/iter_0000000 \
--model_format megatron \
--triton_model_name megatron_model \
--server_address 0.0.0.0 \
--server_port 8886 \
--num_gpus 1 \
--num_nodes 1 \
--tensor_model_parallel_size 1 \
--pipeline_model_parallel_size 1 \
--context_parallel_size 1 \
--expert_model_parallel_size 1 \
--max_batch_size 2 \
--triton_port 8000 \
--triton_http_address 0.0.0.0 \
--inference_max_seq_length 16384
- Send test request
export FULL_ENDPOINT_URL="http://0.0.0.0:8886/v1/completions/"
export MODEL_NAME="megatron_model"
curl -X POST ${FULL_ENDPOINT_URL} -H "Content-Type: application/json" -d '{
"prompt": "Write Python code that can add a list of numbers together.",
"model": "'"$MODEL_NAME"'",
"temperature": 0.6,
"top_p": 0.95,
"max_tokens": 256,
"stream": false
}'
This results in a following error:
pytriton.client.exceptions.PyTritonClientInferenceServerError: Error occurred during inference request. Message: Failed to process the request(s) for model 'megatron_model_0_0', message: TritonModelException: Model execute error: Traceback (most recent call last):
File "/tmp/folderO0M00P/1/model.py", line 492, in execute
raise triton_responses_ors_error
c_python_backend_utils.TritonModelException: Traceback (most recent call last):
File "/opt/venv/lib/python3.12/site-packages/pytriton/proxy/inference.py", line 393, in _handle_requests
async for responses in self._model_callable(requests):
File "/opt/venv/lib/python3.12/site-packages/pytriton/proxy/inference.py", line 85, in _callable
yield inference_callable(requests)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/pytriton/decorators.py", line 213, in batch
outputs = wrapped(*args, **new_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/pytriton/decorators.py", line 672, in wrapper
return wrapped(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/Export-Deploy/nemo_deploy/llm/megatronllm_deployable.py", line 390, in triton_infer_fn
output_infer = self._infer_fn(
^^^^^^^^^^^^^^^
File "/opt/Export-Deploy/nemo_deploy/llm/megatronllm_deployable.py", line 472, in _infer_fn
results = self.generate(prompts, inference_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/Export-Deploy/nemo_deploy/llm/megatronllm_deployable.py", line 243, in generate
results = self.mcore_engine.generate(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/megatron-lm/megatron/core/inference/engines/static_engine.py", line 318, in generate
return self.generate_using_dynamic_engine(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/megatron-lm/megatron/core/inference/engines/static_engine.py", line 227, in generate_using_dynamic_engine
return self.dynamic_engine.generate(prompts=prompts, sampling_params=sampling_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/megatron-lm/megatron/core/inference/engines/dynamic_engine.py", line 945, in generate
result = self.step_modern()
^^^^^^^^^^^^^^^^^^
File "/opt/megatron-lm/megatron/core/inference/engines/dynamic_engine.py", line 914, in step_modern
return self._loop.run_until_complete(self.async_step(verbose=verbose))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/base_events.py", line 663, in run_until_complete
self._check_running()
File "/usr/lib/python3.12/asyncio/base_events.py", line 624, in _check_running
raise RuntimeError(
RuntimeError: Cannot run the event loop while another loop is running
Expected behavior
Request should be processed with no issues and server should return a response
Additional context
The problem is not present in the 25.11 docker container. I encountered it only when using nvcr.io/nvidian/nemo:nightly image
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working