-
Notifications
You must be signed in to change notification settings - Fork 630
Open
Description
Thanks for the awesome work!
currently, I've been struggling with an issue while working with speedster which I will lay out below:
1. I've been able to optimize onnx model ( from HuggingFace, and is based on Donut https://github.com/clovaai/donut )
code used:
import numpy as np
from speedster import optimize_model
from speedster import save_model
import numpy as np
import torch
import os
Provide input data for the model
input_data = [((np.array(torch.randn(5, 3),dtype=np.int64), np.array(torch.randn(5, 3, 1024),dtype=np.float32), ), torch.tensor([0, 1, 0, 1, 1])) for _ in range(100)]
Run Speedster optimization
optimized_model = optimize_model(
"./models/onnx/decoder_model.onnx",
input_data=input_data,
optimization_time="unconstrained",
device="gpu:0",
metric_drop_ths=0.8
)
save_model(optimized_model, "./models/speedster")
output:
2023-07-19 14:22:43 | INFO | Running Speedster on GPU:0
2023-07-19 14:25:33 | INFO | Benchmark performance of original model
2023-07-19 14:26:10 | INFO | Original model latency: 0.023933820724487305 sec/iter
2023-07-19 14:26:11 | INFO | [1/1] Running ONNX Optimization Pipeline
2023-07-19 14:26:11 | INFO | Optimizing with ONNXCompiler and q_type: None.
2023-07-19 14:26:14 | WARNING | TensorrtExecutionProvider for onnx is not available. If you want to use it, please add the path to tensorrt to the LD_LIBRARY_PATH environment variable. CUDA provider will be used instead.
2023-07-19 14:26:16 | INFO | Optimized model latency: 0.02505326271057129 sec/iter
2023-07-19 14:26:16 | INFO | Optimizing with ONNXCompiler and q_type: QuantizationType.HALF.
2023-07-19 14:26:44 | INFO | Optimized model latency: 0.3438906669616699 sec/iter
2023-07-19 14:26:44 | INFO | Optimizing with ONNXTensorRTCompiler and q_type: None.
2023-07-19 14:28:18 | INFO | Optimized model latency: 0.004456996917724609 sec/iter
2023-07-19 14:28:18 | INFO | Optimizing with ONNXTensorRTCompiler and q_type: QuantizationType.HALF.
2023-07-19 14:28:51 | INFO | Optimized model latency: 0.003861665725708008 sec/iter
2023-07-19 14:28:51 | INFO | Optimizing with ONNXTensorRTCompiler and q_type: QuantizationType.STATIC.
2023-07-19 14:33:56 | INFO | Optimized model latency: 0.004480838775634766 sec/iter
[Speedster results on Tesla V100-SXM2-16GB]
Metric Original Model Optimized Model Improvement
----------- ---------------- ----------------- -------------
backend NUMPY TensorRT
latency 0.0239 sec/batch 0.0039 sec/batch 6.20x
throughput 208.91 data/sec 1294.78 data/sec 6.20x
model size 743.98 MB 254.43 MB -65%
metric drop 0.5291
techniques fp16
- I am just hitting a wall when trying to perform inference.
code used:
from speedster import load_model
from nebullvm.tools.benchmark import benchmark
import numpy
import tensorflow as tf
optimized_model = load_model("../opt/models/speedster/")
print('speedster onnx model loaded')
device = "cuda" if torch.cuda.is_available() else "cpu"
dummy_input = torch.randn(1, 3, 300, 400, dtype=torch.float).to(device)
print(type(dummy_input))
output = optimized_model(dummy_input)
print(output)
observation:
2023-07-19 14:35:43 | WARNING | Debug: Got extra keywords in NvidiaInferenceLearner::from_engine_path: {'class_name': 'NumpyONNXTensorRTInferenceLearner', 'module_name': 'nebullvm.operations.inference_learners.tensor_rt'}
speedster onnx model loaded
<class 'torch.Tensor'>
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
[<ipython-input-9-ea33d0034b2d>](https://localhost:8080/#) in <cell line: 20>()
18
19 # Use the accelerated version of your ONNX model in production
---> 20 output = optimized_model(dummy_input)
21 print(output)
5 frames
[/usr/local/lib/python3.10/dist-packages/polygraphy/cuda/cuda.py](https://localhost:8080/#) in dtype(self, new)
296 def dtype(self, new):
297 self._dtype = new
--> 298 self.itemsize = np.dtype(new).itemsize
299
300 @property
TypeError: Cannot interpret 'torch.float32' as a data type
So my question would be what are the types of parameters I did to include for optimized_model() method here . Previously, I've been passing the following to original model to get it working
def run_prediction(test_sample, model=model, processor=processor):
pixel_values = processor(test_sample, return_tensors="pt").pixel_values
task_prompt = "<s>"
decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids
outputs = model.generate(
pixel_values.to(device),
decoder_input_ids=decoder_input_ids.to(device),
max_length=model.decoder.config.max_position_embeddings,
early_stopping=True,
pad_token_id=processor.tokenizer.pad_token_id,
eos_token_id=processor.tokenizer.eos_token_id,
use_cache=False,
num_beams=1,
bad_words_ids=[[processor.tokenizer.unk_token_id]],
return_dict_in_generate=True,
)
prediction = processor.batch_decode(outputs.sequences)[0]
prediction = processor.token2json(prediction)
return prediction
Please let me know if you require additional information. thanks.
Metadata
Metadata
Assignees
Labels
No labels