How to generate and perform inference for an ONNX model

Thanks for the awesome work!
currently, I've been struggling with an issue while working with speedster which I will lay out below:
**1. I've been able to optimize onnx model ( from HuggingFace, and is based on Donut https://github.com/clovaai/donut )**

**code used:** 

```
import numpy as np
from speedster import optimize_model
from speedster import save_model
import numpy as np
import torch
import os

Provide input data for the model
input_data = [((np.array(torch.randn(5, 3),dtype=np.int64), np.array(torch.randn(5, 3, 1024),dtype=np.float32), ), torch.tensor([0, 1, 0, 1, 1])) for _ in range(100)]

Run Speedster optimization
optimized_model = optimize_model(
    "./models/onnx/decoder_model.onnx",
    input_data=input_data,
    optimization_time="unconstrained",
    device="gpu:0",
    metric_drop_ths=0.8
)

save_model(optimized_model, "./models/speedster")
```

**output:**
```
2023-07-19 14:22:43 | INFO     | Running Speedster on GPU:0
2023-07-19 14:25:33 | INFO     | Benchmark performance of original model
2023-07-19 14:26:10 | INFO     | Original model latency: 0.023933820724487305 sec/iter
2023-07-19 14:26:11 | INFO     | [1/1] Running ONNX Optimization Pipeline
2023-07-19 14:26:11 | INFO     | Optimizing with ONNXCompiler and q_type: None.
2023-07-19 14:26:14 | WARNING  | TensorrtExecutionProvider for onnx is not available. If you want to use it, please  add the path to tensorrt to the LD_LIBRARY_PATH environment variable. CUDA provider will be used instead. 
2023-07-19 14:26:16 | INFO     | Optimized model latency: 0.02505326271057129 sec/iter
2023-07-19 14:26:16 | INFO     | Optimizing with ONNXCompiler and q_type: QuantizationType.HALF.
2023-07-19 14:26:44 | INFO     | Optimized model latency: 0.3438906669616699 sec/iter
2023-07-19 14:26:44 | INFO     | Optimizing with ONNXTensorRTCompiler and q_type: None.
2023-07-19 14:28:18 | INFO     | Optimized model latency: 0.004456996917724609 sec/iter
2023-07-19 14:28:18 | INFO     | Optimizing with ONNXTensorRTCompiler and q_type: QuantizationType.HALF.
2023-07-19 14:28:51 | INFO     | Optimized model latency: 0.003861665725708008 sec/iter
2023-07-19 14:28:51 | INFO     | Optimizing with ONNXTensorRTCompiler and q_type: QuantizationType.STATIC.
2023-07-19 14:33:56 | INFO     | Optimized model latency: 0.004480838775634766 sec/iter

[Speedster results on Tesla V100-SXM2-16GB]
Metric       Original Model    Optimized Model    Improvement
-----------  ----------------  -----------------  -------------
backend      NUMPY             TensorRT
latency      0.0239 sec/batch  0.0039 sec/batch   6.20x
throughput   208.91 data/sec   1294.78 data/sec   6.20x
model size   743.98 MB         254.43 MB          -65%
metric drop                    0.5291
techniques                     fp16
```

2.  I am just hitting a wall when trying to perform inference. 
**code used:**

```
from speedster import load_model
from nebullvm.tools.benchmark import benchmark
import numpy
import tensorflow as tf

optimized_model = load_model("../opt/models/speedster/")
print('speedster onnx model loaded')

device = "cuda" if torch.cuda.is_available() else "cpu"
dummy_input = torch.randn(1, 3, 300, 400, dtype=torch.float).to(device)
print(type(dummy_input))

output = optimized_model(dummy_input)
print(output)

```
**observation:**

```
2023-07-19 14:35:43 | WARNING  | Debug: Got extra keywords in NvidiaInferenceLearner::from_engine_path: {'class_name': 'NumpyONNXTensorRTInferenceLearner', 'module_name': 'nebullvm.operations.inference_learners.tensor_rt'}
speedster onnx model loaded
<class 'torch.Tensor'>
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-9-ea33d0034b2d>](https://localhost:8080/#) in <cell line: 20>()
     18 
     19 # Use the accelerated version of your ONNX model in production
---> 20 output = optimized_model(dummy_input)
     21 print(output)

5 frames
[/usr/local/lib/python3.10/dist-packages/polygraphy/cuda/cuda.py](https://localhost:8080/#) in dtype(self, new)
    296     def dtype(self, new):
    297         self._dtype = new
--> 298         self.itemsize = np.dtype(new).itemsize
    299 
    300     @property

TypeError: Cannot interpret 'torch.float32' as a data type
```

**So my question would be what are the types of parameters I did to include for optimized_model() method here .  Previously, I've been passing the following to original model to get it working** 

```
def run_prediction(test_sample, model=model, processor=processor):
    pixel_values = processor(test_sample, return_tensors="pt").pixel_values
    task_prompt = "<s>"
    decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids
    outputs = model.generate(
        pixel_values.to(device),
        decoder_input_ids=decoder_input_ids.to(device),
        max_length=model.decoder.config.max_position_embeddings,
        early_stopping=True,
        pad_token_id=processor.tokenizer.pad_token_id,
        eos_token_id=processor.tokenizer.eos_token_id,
        use_cache=False,
        num_beams=1,
        bad_words_ids=[[processor.tokenizer.unk_token_id]],
        return_dict_in_generate=True,
    )
    prediction = processor.batch_decode(outputs.sequences)[0]
    prediction = processor.token2json(prediction)
    return prediction 
```

Please let me know if you require additional information. thanks.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to generate and perform inference for an ONNX model #350

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to generate and perform inference for an ONNX model #350

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions