-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Description
A clear and concise description of what the bug is.
Rate Limiter only checks whether the highest priority instance can be allocated. This will create deadlock when the highest priority instance (say model C) requires resources that is held by another running BLS model (say model B), and B tries to call a model (say model A, even if A doesn't require any resource) with lower priority, so A will be blocked by C forever because B never ends.
I made a PR that might solve this problem triton-inference-server/core#448
Triton Information
What version of Triton are you using?
25.07
To Reproduce
Steps to reproduce the behavior.
Two python models aaa and bbb.
aaa config.pbtxt
name: "aaa"
backend: "python"
max_batch_size: 0
input [
{ name: "X", data_type: TYPE_INT32, dims: [ 1 ] }
]
output [
{ name: "Y", data_type: TYPE_INT32, dims: [ 1 ] }
]
instance_group [
{
count: 1
kind: KIND_CPU
}
]
aaa model.py
import time
import numpy as np
import triton_python_backend_utils as pb_utils
class TritonPythonModel:
def initialize(self, args):
pass
def execute(self, requests):
responses = []
for request in requests:
in_tensor = pb_utils.get_input_tensor_by_name(request, "X")
if in_tensor is None:
raise pb_utils.TritonModelException("Missing input tensor 'X'")
x = in_tensor.as_numpy().reshape(-1)[0]
x_val = int(x)
print("[A] Received request with x = {}".format(x_val), flush=True)
time.sleep(1)
y = np.array([x_val + 1], dtype=np.int32)
out_tensor = pb_utils.Tensor("Y", y)
responses.append(pb_utils.InferenceResponse(output_tensors=[out_tensor]))
return responses
def finalize(self):
passbbb config.pbtxt
name: "bbb"
backend: "python"
max_batch_size: 0
input [{ name: "X", data_type: TYPE_INT32, dims: [ 1 ] }]
output [{ name: "Y", data_type: TYPE_INT32, dims: [ 1 ] }]
instance_group [
{
count: 5
kind: KIND_CPU
rate_limiter {
resources [
{
name: "R2"
global: True
count: 2
}
]
}
}
]
bbb model.py
import numpy as np
import triton_python_backend_utils as pb_utils
class TritonPythonModel:
def initialize(self, args):
self.a_model = "aaa"
self.pref_mem_cpu = pb_utils.PreferredMemory(
pb_utils.TRITONSERVER_MEMORY_CPU, 0
)
def execute(self, requests):
responses = []
for request in requests:
x_tensor = pb_utils.get_input_tensor_by_name(request, "X")
if x_tensor is None:
raise pb_utils.TritonModelException("Missing input tensor 'X'")
x_val = int(x_tensor.as_numpy().reshape(-1)[0])
print(f"[B] Received X = {x_val}", flush=True)
if x_val < 0:
raise pb_utils.TritonModelException("X must be non-negative")
current = np.array([x_val], dtype=np.int32)
for i in range(x_val):
if request.is_cancelled():
print(f"[B] Cancelled at iteration {i}/{x_val}", flush=True)
raise pb_utils.TritonModelException("Request was cancelled by client")
infer_request = pb_utils.InferenceRequest(
model_name=self.a_model,
requested_output_names=["Y"],
inputs=[pb_utils.Tensor("X", current)],
preferred_memory=self.pref_mem_cpu,
)
print("B: inferring", flush=True)
infer_response = infer_request.exec()
if infer_response.has_error():
raise pb_utils.TritonModelException(
infer_response.error().message()
)
out_tensor = pb_utils.get_output_tensor_by_name(infer_response, "Y")
current = out_tensor.as_numpy().astype(np.int32)
responses.append(
pb_utils.InferenceResponse(
output_tensors=[pb_utils.Tensor("Y", current)]
)
)
return responses
def finalize(self):
passUse following command to start triton:
tritonserver --model-repository=/models --exit-timeout-secs=0 --rate-limit execution_count --rate-limit-resource=R2:2 --log-verbose=2
Make 5 concurrent calls to model bbb and you will see it stucks. The last log shows something like
I0817 12:03:00.485035 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING"
Expected behavior
Deadlocks should be avoided in this situation, I proposed a possible solution in this PR. This PR slightly changes scheduling logic where high priority job requiring more resources might wait longer. But the overall resource utilization should be improved.