Description
Background:
I previously deployed ColBERT in Python using the fastembed library with GPU support.
During this deployment, I observed that it utilized only 2 GB of GPU memory out of the 16 GB available on my GPU.
Current Deployment:
To address this limited memory usage, I redeployed ColBERT on Triton Inference Server using the ONNX backend, expecting better GPU memory utilization.
However, I still observe that the deployment only utilizes approximately 2 GB of GPU memory, leaving most of the GPU memory unused.
Issue:
It appears that both fastembed and Triton deployments are not fully utilizing the available GPU memory.
I suspect there might be specific settings, configurations, or optimizations that could allow ColBERT to use more GPU memory.
Questions:
- Are there specific settings in Triton Inference Server, ONNX backend, or ColBERT configurations to increase GPU memory usage?
- Could this behavior be related to batch size, ONNX graph optimization, or other resource allocation parameters?
- Is this limited memory usage expected for ColBERT models, or could it indicate a bottleneck in deployment?
I am using model.onnx file available at hugginface for colbert.
here is my config.pbtxt file:
`name: "colbert-ir_colbertv2.0"
platform: "onnxruntime_onnx"
backend: "onnxruntime"
max_batch_size: 25
input [
{
name: "attention_mask"
data_type: TYPE_INT64
dims: [-1]
},
{
name: "input_ids"
data_type: TYPE_INT64
dims: [-1]
}
]
output [
{
name: "contextual"
data_type: TYPE_FP32
dims: [-1, -1]
}
]
optimization {
priority: PRIORITY_DEFAULT
input_pinned_memory {
enable: true
}
output_pinned_memory {
enable: true
}
}
dynamic_batching {
preferred_batch_size: [4]
max_queue_delay_microseconds: 0
}
instance_group [
{
name: "colbert-ir_colbertv2.0"
kind: KIND_GPU
count: 1
gpus: [0]
}
]
default_model_filename: "model.onnx"`