DeadLock when BLS model requires resources

**Description**
A clear and concise description of what the bug is.

Rate Limiter only checks whether the highest priority instance can be allocated. This will create deadlock when the highest priority instance (say model C) requires resources that is held by another running BLS model (say model B), and B tries to call a model (say model A, even if A doesn't require any resource) with lower priority, so A will be blocked by C forever because B never ends.

I made a PR that might solve this problem https://github.com/triton-inference-server/core/pull/448

**Triton Information**
What version of Triton are you using?

25.07

**To Reproduce**
Steps to reproduce the behavior.

Two python models `aaa` and `bbb`.

**aaa config.pbtxt**

```
name: "aaa"
backend: "python"
max_batch_size: 0
input [
  { name: "X", data_type: TYPE_INT32, dims: [ 1 ] }
]
output [
  { name: "Y", data_type: TYPE_INT32, dims: [ 1 ] }
]

instance_group [
  {
    count: 1
    kind: KIND_CPU
  }
]
```

**aaa model.py**

```python
import time
import numpy as np
import triton_python_backend_utils as pb_utils

class TritonPythonModel:
    def initialize(self, args):
        pass

    def execute(self, requests):
        responses = []
        for request in requests:
            in_tensor = pb_utils.get_input_tensor_by_name(request, "X")
            if in_tensor is None:
                raise pb_utils.TritonModelException("Missing input tensor 'X'")

            x = in_tensor.as_numpy().reshape(-1)[0]
            x_val = int(x)
            print("[A] Received request with x = {}".format(x_val), flush=True)

            time.sleep(1)

            y = np.array([x_val + 1], dtype=np.int32)
            out_tensor = pb_utils.Tensor("Y", y)
            responses.append(pb_utils.InferenceResponse(output_tensors=[out_tensor]))
        return responses

    def finalize(self):
        pass
```

**bbb config.pbtxt**

```
name: "bbb"
backend: "python"
max_batch_size: 0

input  [{ name: "X", data_type: TYPE_INT32, dims: [ 1 ] }]
output [{ name: "Y", data_type: TYPE_INT32, dims: [ 1 ] }]

instance_group [
  {
    count: 5
    kind: KIND_CPU
    rate_limiter {
      resources [
        {
          name: "R2"
          global: True
          count: 2
        }
      ]
    }
  }
]
```

**bbb model.py**

```python
import numpy as np
import triton_python_backend_utils as pb_utils

class TritonPythonModel:
    def initialize(self, args):
        self.a_model = "aaa"
        self.pref_mem_cpu = pb_utils.PreferredMemory(
            pb_utils.TRITONSERVER_MEMORY_CPU, 0
        )

    def execute(self, requests):
        responses = []
        for request in requests:
            x_tensor = pb_utils.get_input_tensor_by_name(request, "X")
            if x_tensor is None:
                raise pb_utils.TritonModelException("Missing input tensor 'X'")

            x_val = int(x_tensor.as_numpy().reshape(-1)[0])
            print(f"[B] Received X = {x_val}", flush=True)

            if x_val < 0:
                raise pb_utils.TritonModelException("X must be non-negative")

            current = np.array([x_val], dtype=np.int32)

            for i in range(x_val):
                if request.is_cancelled():
                    print(f"[B] Cancelled at iteration {i}/{x_val}", flush=True)
                    raise pb_utils.TritonModelException("Request was cancelled by client")

                infer_request = pb_utils.InferenceRequest(
                    model_name=self.a_model,
                    requested_output_names=["Y"],
                    inputs=[pb_utils.Tensor("X", current)],
                    preferred_memory=self.pref_mem_cpu,
                )
                print("B: inferring", flush=True)
                infer_response = infer_request.exec()
                if infer_response.has_error():
                    raise pb_utils.TritonModelException(
                        infer_response.error().message()
                    )
                out_tensor = pb_utils.get_output_tensor_by_name(infer_response, "Y")
                current = out_tensor.as_numpy().astype(np.int32)

            responses.append(
                pb_utils.InferenceResponse(
                    output_tensors=[pb_utils.Tensor("Y", current)]
                )
            )
        return responses

    def finalize(self):
        pass
```

Use following command to start triton：

```
tritonserver --model-repository=/models --exit-timeout-secs=0 --rate-limit execution_count --rate-limit-resource=R2:2 --log-verbose=2
```

Make 5 concurrent calls to model `bbb` and you will see it stucks. The last log shows something like

```
I0817 12:03:00.485035 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING"
```


**Expected behavior**

Deadlocks should be avoided in this situation, I proposed a possible solution in this [PR](https://github.com/triton-inference-server/core/pull/448/files). This PR slightly changes scheduling logic where high priority job requiring more resources might wait longer. But the overall resource utilization should be improved.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeadLock when BLS model requires resources #8358

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DeadLock when BLS model requires resources #8358

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions