Open
Description
Description
When deploying an ONNX model using the Triton Inference Server's ONNX runtime backend, the inference performance on the CPU is noticeably slower compared to running the same model using the ONNXRuntime Python client directly. This performance discrepancy is observed under identical conditions, including the same hardware, model, and input data.
Triton Information
TRITON_VERSION <= 24.09
To Reproduce
model used:
wget -O model.onnx https://github.com/onnx/models/raw/refs/heads/main/validated/vision/classification/densenet-121/model/densenet-12.onnx
Triton server (ONNX runtime)
config.pbtxt
name: "test_densenet"
platform: "onnxruntime_onnx"
Python clients
Triton client
import numpy as np
import tritonclient.grpc as grpcclient
import tritonclient.grpc._infer_input as infer_input
grpcclient = grpcclient.InferenceServerClient(url='localhost:9178')
i = infer_input.InferInput('data_0', [1, 3, 224, 224], 'FP32')
i.set_data_from_numpy(np.zeros((1, 3, 224, 224), dtype=np.float32))
%%timeit
res = grpcclient.infer(model_name="test_densenet", inputs=[i])
results: 473 ms ± 87.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
ONNX Runtime
import onnxruntime as ort
ort_sess = ort.InferenceSession('model.onnx')
test_inputs = {"data_0": np.zeros((1, 3, 224, 224), dtype=np.float32)}
%%timeit
ort_sess.run(["fc6_1"], test_inputs)
results: 159 ms ± 23.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Metadata
Metadata
Assignees
Labels
No labels