Skip to content

DeadLock when BLS model requires resources #8358

@zeruniverse

Description

@zeruniverse

Description
A clear and concise description of what the bug is.

Rate Limiter only checks whether the highest priority instance can be allocated. This will create deadlock when the highest priority instance (say model C) requires resources that is held by another running BLS model (say model B), and B tries to call a model (say model A, even if A doesn't require any resource) with lower priority, so A will be blocked by C forever because B never ends.

I made a PR that might solve this problem triton-inference-server/core#448

Triton Information
What version of Triton are you using?

25.07

To Reproduce
Steps to reproduce the behavior.

Two python models aaa and bbb.

aaa config.pbtxt

name: "aaa"
backend: "python"
max_batch_size: 0
input [
  { name: "X", data_type: TYPE_INT32, dims: [ 1 ] }
]
output [
  { name: "Y", data_type: TYPE_INT32, dims: [ 1 ] }
]

instance_group [
  {
    count: 1
    kind: KIND_CPU
  }
]

aaa model.py

import time
import numpy as np
import triton_python_backend_utils as pb_utils

class TritonPythonModel:
    def initialize(self, args):
        pass

    def execute(self, requests):
        responses = []
        for request in requests:
            in_tensor = pb_utils.get_input_tensor_by_name(request, "X")
            if in_tensor is None:
                raise pb_utils.TritonModelException("Missing input tensor 'X'")

            x = in_tensor.as_numpy().reshape(-1)[0]
            x_val = int(x)
            print("[A] Received request with x = {}".format(x_val), flush=True)

            time.sleep(1)

            y = np.array([x_val + 1], dtype=np.int32)
            out_tensor = pb_utils.Tensor("Y", y)
            responses.append(pb_utils.InferenceResponse(output_tensors=[out_tensor]))
        return responses

    def finalize(self):
        pass

bbb config.pbtxt

name: "bbb"
backend: "python"
max_batch_size: 0

input  [{ name: "X", data_type: TYPE_INT32, dims: [ 1 ] }]
output [{ name: "Y", data_type: TYPE_INT32, dims: [ 1 ] }]

instance_group [
  {
    count: 5
    kind: KIND_CPU
    rate_limiter {
      resources [
        {
          name: "R2"
          global: True
          count: 2
        }
      ]
    }
  }
]

bbb model.py

import numpy as np
import triton_python_backend_utils as pb_utils

class TritonPythonModel:
    def initialize(self, args):
        self.a_model = "aaa"
        self.pref_mem_cpu = pb_utils.PreferredMemory(
            pb_utils.TRITONSERVER_MEMORY_CPU, 0
        )

    def execute(self, requests):
        responses = []
        for request in requests:
            x_tensor = pb_utils.get_input_tensor_by_name(request, "X")
            if x_tensor is None:
                raise pb_utils.TritonModelException("Missing input tensor 'X'")

            x_val = int(x_tensor.as_numpy().reshape(-1)[0])
            print(f"[B] Received X = {x_val}", flush=True)

            if x_val < 0:
                raise pb_utils.TritonModelException("X must be non-negative")

            current = np.array([x_val], dtype=np.int32)

            for i in range(x_val):
                if request.is_cancelled():
                    print(f"[B] Cancelled at iteration {i}/{x_val}", flush=True)
                    raise pb_utils.TritonModelException("Request was cancelled by client")

                infer_request = pb_utils.InferenceRequest(
                    model_name=self.a_model,
                    requested_output_names=["Y"],
                    inputs=[pb_utils.Tensor("X", current)],
                    preferred_memory=self.pref_mem_cpu,
                )
                print("B: inferring", flush=True)
                infer_response = infer_request.exec()
                if infer_response.has_error():
                    raise pb_utils.TritonModelException(
                        infer_response.error().message()
                    )
                out_tensor = pb_utils.get_output_tensor_by_name(infer_response, "Y")
                current = out_tensor.as_numpy().astype(np.int32)

            responses.append(
                pb_utils.InferenceResponse(
                    output_tensors=[pb_utils.Tensor("Y", current)]
                )
            )
        return responses

    def finalize(self):
        pass

Use following command to start triton:

tritonserver --model-repository=/models --exit-timeout-secs=0 --rate-limit execution_count --rate-limit-resource=R2:2 --log-verbose=2

Make 5 concurrent calls to model bbb and you will see it stucks. The last log shows something like

I0817 12:03:00.485035 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING"

Expected behavior

Deadlocks should be avoided in this situation, I proposed a possible solution in this PR. This PR slightly changes scheduling logic where high priority job requiring more resources might wait longer. But the overall resource utilization should be improved.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions