Skip to content

Inference with Megatron Bridge checkpoint deployed with PyTriton fails with "RuntimeError: Cannot run the event loop while another loop is running" #529

@marta-sd

Description

@marta-sd

Describe the bug

When I deploy my MB checkpoint with scripts/deploy/nlp/deploy_inframework_triton.py, I get an error when sending any request to the server.

Steps/Code to reproduce bug

  1. Start nemo nightly container. In my case it was on a SLURM cluster:
srun --container-image nvcr.io/nvidian/nemo:nightly --container-mounts /lustre:/lustre --overlap --pty /bin/bash
  1. Deploy the model
export HF_HOME=/lustre/fsw/coreai_dlalgo_compeval/martas/models/hf/  # requires access to meta-llama/Llama-3.1-8B-Instruct

python \
  /opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py \
  --megatron_checkpoint /lustre/fsw/coreai_dlalgo_ci/nemo_export_deploy_eval_checkpoints/mbridge/meta-llama/Llama-3.1-8B-Instruct/iter_0000000 \
  --model_format megatron \
  --triton_model_name megatron_model \
  --server_address 0.0.0.0 \
  --server_port 8886 \
  --num_gpus 1 \
  --num_nodes 1 \
  --tensor_model_parallel_size 1 \
  --pipeline_model_parallel_size 1 \
  --context_parallel_size 1 \
  --expert_model_parallel_size 1 \
  --max_batch_size 2 \
  --triton_port 8000 \
  --triton_http_address 0.0.0.0 \
  --inference_max_seq_length 16384
  1. Send test request
export FULL_ENDPOINT_URL="http://0.0.0.0:8886/v1/completions/"
export MODEL_NAME="megatron_model"
curl -X POST ${FULL_ENDPOINT_URL} -H "Content-Type: application/json" -d '{
  "prompt": "Write Python code that can add a list of numbers together.",
  "model": "'"$MODEL_NAME"'",
  "temperature": 0.6,
  "top_p": 0.95,
  "max_tokens": 256,
  "stream": false
}'

This results in a following error:

pytriton.client.exceptions.PyTritonClientInferenceServerError: Error occurred during inference request. Message: Failed to process the request(s) for model 'megatron_model_0_0', message: TritonModelException: Model execute error: Traceback (most recent call last):
  File "/tmp/folderO0M00P/1/model.py", line 492, in execute
    raise triton_responses_ors_error
c_python_backend_utils.TritonModelException: Traceback (most recent call last):
  File "/opt/venv/lib/python3.12/site-packages/pytriton/proxy/inference.py", line 393, in _handle_requests
    async for responses in self._model_callable(requests):
  File "/opt/venv/lib/python3.12/site-packages/pytriton/proxy/inference.py", line 85, in _callable
    yield inference_callable(requests)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/pytriton/decorators.py", line 213, in batch
    outputs = wrapped(*args, **new_kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/pytriton/decorators.py", line 672, in wrapper
    return wrapped(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/Export-Deploy/nemo_deploy/llm/megatronllm_deployable.py", line 390, in triton_infer_fn
    output_infer = self._infer_fn(
                   ^^^^^^^^^^^^^^^
  File "/opt/Export-Deploy/nemo_deploy/llm/megatronllm_deployable.py", line 472, in _infer_fn
    results = self.generate(prompts, inference_params)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/Export-Deploy/nemo_deploy/llm/megatronllm_deployable.py", line 243, in generate
    results = self.mcore_engine.generate(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/megatron-lm/megatron/core/inference/engines/static_engine.py", line 318, in generate
    return self.generate_using_dynamic_engine(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/megatron-lm/megatron/core/inference/engines/static_engine.py", line 227, in generate_using_dynamic_engine
    return self.dynamic_engine.generate(prompts=prompts, sampling_params=sampling_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/megatron-lm/megatron/core/inference/engines/dynamic_engine.py", line 945, in generate
    result = self.step_modern()
             ^^^^^^^^^^^^^^^^^^
  File "/opt/megatron-lm/megatron/core/inference/engines/dynamic_engine.py", line 914, in step_modern
    return self._loop.run_until_complete(self.async_step(verbose=verbose))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/base_events.py", line 663, in run_until_complete
    self._check_running()
  File "/usr/lib/python3.12/asyncio/base_events.py", line 624, in _check_running
    raise RuntimeError(
RuntimeError: Cannot run the event loop while another loop is running

Expected behavior

Request should be processed with no issues and server should return a response

Additional context

The problem is not present in the 25.11 docker container. I encountered it only when using nvcr.io/nvidian/nemo:nightly image

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions