Skip to content

Python TensorBuffer.read() is very slow #5755

@laclouis5

Description

@laclouis5

Environment

Computer: MacBook Pro M1 Pro
macOS version: 26.2 (Tahoe)
Python version: 3.11
LiteRT version: 2.1.2

This issue has also been observed on a x86_64 Ubuntu 24 CUDA environment and is likely present on other platforms using the Python API.

Issue

The Python TensorBuffer.read() method is very slow, likely due to un-optimized type conversion in the Python wrapper.

A call to this method triggers a call to the BuildPyListFromFloat() C function (or other similar ones).

PyObject* TensorBufferWrapper::ReadTensor(PyObject* buffer_capsule,
int num_elements,
const std::string& dtype) {
if (!PyCapsule_CheckExact(buffer_capsule)) {
return ReportError("ReadTensor: invalid capsule");
}
void* ptr = PyCapsule_GetPointer(
buffer_capsule, litert_wrapper_utils::kLiteRtTensorBufferName.data());
if (!ptr) {
return ReportError("ReadTensor: null pointer in capsule");
}
TensorBuffer tb = TensorBuffer::WrapCObject(
static_cast<LiteRtTensorBuffer>(ptr), OwnHandle::kNo);
if (dtype == "float32") {
std::vector data(num_elements, 0.f);
if (auto status = tb.Read<float>(absl::MakeSpan(data)); !status)
return ConvertErrorToPyExc(status.Error());
return BuildPyListFromFloat(data);
}
if (dtype == "int32") {
std::vector data(num_elements, 0);
if (auto status = tb.Read<int32_t>(absl::MakeSpan(data)); !status)
return ConvertErrorToPyExc(status.Error());
return BuildPyListFromInt32(data);
}
if (dtype == "int8") {
std::vector<int8_t> data(num_elements, 0);
if (auto status = tb.Read<int8_t>(absl::MakeSpan(data)); !status)
return ConvertErrorToPyExc(status.Error());
return BuildPyListFromInt8(data);
}
return ReportError("ReadTensor: unsupported dtype '" + dtype + "'");
}

This C function is likely the bottleneck: it sequentially converts each element to a Python float and insert it into a Python list.

PyObject* BuildPyListFromFloat(absl::Span<const float> data) {
PyObject* py_list = PyList_New(data.size());
for (size_t i = 0; i < data.size(); i++) {
PyList_SetItem(py_list, i, PyFloat_FromDouble(data[i]));
}
return py_list;
}

Finally, the Python list is converted to a NumPy array in the Python wrapper., making the previous conversion unnecessary.

def read(self, num_elements: int, output_dtype):
"""Reads data from this tensor buffer.
Args:
num_elements: Number of elements to read.
output_dtype: NumPy dtype for the output (e.g., np.float32, np.int8).
Returns:
A NumPy array containing the tensor data.
Example:
# Get output as NumPy array
output_array = tensor_buffer.read(4, np.float32).reshape((1, 4))
Raises:
ValueError: If output_dtype is not a NumPy dtype or is not supported.
"""
if not isinstance(output_dtype, type) or not hasattr(
np, output_dtype.__name__
):
raise ValueError(f"output_dtype must be a NumPy dtype (e.g., np.float32)")
dtype_str = self._dtype_to_str(output_dtype)
data_list = _tb.ReadTensor(self._capsule, num_elements, dtype_str)
return np.array(data_list, dtype=output_dtype)

This issue makes reading data from a TensorBuffer in Python orders of magnitude slower than inference on some models. Some reading operations took more than 100ms even for small tensors (around 20 MB).

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions