Skip to content

Commit 81b1179

Browse files
committed
update to tensorrt 10.3
1 parent f2b9015 commit 81b1179

7 files changed

+281
-240
lines changed

README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@ My implementation of [BiSeNetV1](https://arxiv.org/abs/1808.00897) and [BiSeNetV
66
mIOUs and fps on cityscapes val set:
77
| none | ss | ssc | msf | mscf | fps(fp32/fp16/int8) | link |
88
|------|:--:|:---:|:---:|:----:|:---:|:----:|
9-
| bisenetv1 | 75.44 | 76.94 | 77.45 | 78.86 | 25/78/141 | [download](https://github.com/CoinCheung/BiSeNet/releases/download/0.0.0/model_final_v1_city_new.pth) |
10-
| bisenetv2 | 74.95 | 75.58 | 76.53 | 77.08 | 26/67/95 | [download](https://github.com/CoinCheung/BiSeNet/releases/download/0.0.0/model_final_v2_city.pth) |
9+
| bisenetv1 | 75.44 | 76.94 | 77.45 | 78.86 | 112/239/435 | [download](https://github.com/CoinCheung/BiSeNet/releases/download/0.0.0/model_final_v1_city_new.pth) |
10+
| bisenetv2 | 74.95 | 75.58 | 76.53 | 77.08 | 103/161/198 | [download](https://github.com/CoinCheung/BiSeNet/releases/download/0.0.0/model_final_v2_city.pth) |
1111

1212
mIOUs on cocostuff val2017 set:
1313
| none | ss | ssc | msf | mscf | link |

tensorrt/CMakeLists.txt

+3-3
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@ CMAKE_MINIMUM_REQUIRED(VERSION 3.17)
22

33
PROJECT(segment)
44

5-
set(CMAKE_CXX_FLAGS "-std=c++14 -O2")
6-
set(CMAKE_NVCC_FLAGS "-std=c++14 -O2")
5+
set(CMAKE_CXX_FLAGS "-std=c++17 -O2")
6+
set(CMAKE_NVCC_FLAGS "-std=c++20 -O2")
77

88

99
link_directories(/usr/local/cuda/lib64)
@@ -21,7 +21,7 @@ add_executable(segment segment.cpp trt_dep.cpp read_img.cpp)
2121
target_include_directories(
2222
segment PUBLIC ${CUDA_INCLUDE_DIRS} ${CUDNN_INCLUDE_DIRS} ${OpenCV_INCLUDE_DIRS})
2323
target_link_libraries(
24-
segment -lnvinfer -lnvinfer_plugin -lnvparsers -lnvonnxparser -lkernels
24+
segment -lnvinfer -lnvinfer_plugin -lnvonnxparser -lkernels
2525
${CUDA_LIBRARIES}
2626
${OpenCV_LIBRARIES})
2727

tensorrt/README.md

+15-20
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,12 @@ Then we can use either c++ or python to compile the model and run inference.
1717

1818
#### 1. My platform
1919

20-
* ubuntu 18.04
21-
* nvidia Tesla T4 gpu, driver newer than 450.80
22-
* cuda 11.3, cudnn 8
23-
* cmake 3.22.0
20+
* ubuntu 22.04
21+
* nvidia A40 gpu, driver newer than 555.42.06
22+
* cuda 12.1, cudnn 8
23+
* cmake 3.22.1
2424
* opencv built from source
25-
* tensorrt 8.2.5.1
25+
* tensorrt 10.3.0.26
2626

2727

2828

@@ -39,14 +39,14 @@ This would generate a `./segment` in the `tensorrt/build` directory.
3939

4040
#### 3. Convert onnx to tensorrt model
4141
If you can successfully compile the source code, you can parse the onnx model to tensorrt model with one of the following commands.
42-
For fp32, command is:
43-
```
44-
$ ./segment compile /path/to/onnx.model /path/to/saved_model.trt
45-
```
46-
If your gpu support acceleration with fp16 inferenece, you can add a `--fp16` option to in this step:
42+
For fp32/fp16/bf16, command is:
4743
```
44+
$ ./segment compile /path/to/onnx.model /path/to/saved_model.trt --fp32
4845
$ ./segment compile /path/to/onnx.model /path/to/saved_model.trt --fp16
46+
$ ./segment compile /path/to/onnx.model /path/to/saved_model.trt --bf16
4947
```
48+
Make sure that your gpu support acceleration with fp16/bf16 inferenece when you set these options.<br>
49+
5050
Building an int8 engine is also supported. Firstly, you should make sure your gpu support int8 inference, or you model will not be faster than fp16/fp32. Then you should prepare certain amount of images for int8 calibration. In this example, I use train set of cityscapes for calibration. The command is like this:
5151
```
5252
$ rm calibrate_int8 # delete this if exists
@@ -72,26 +72,21 @@ $ ./segment test /path/to/saved_model.trt
7272

7373

7474
#### 6. Tips:
75-
1. ~Since tensorrt 7.0.0 cannot parse well the `bilinear interpolation` op exported from pytorch, I replace them with pytorch `nn.PixelShuffle`, which would bring some performance overhead(more flops and parameters), and make inference a bit slower. Also due to the `nn.PixelShuffle` op, you **must** export the onnx model with input size to be *n* times of 32.~
76-
If you are using 7.2.3.4 or newer versions, you should not have problem with `interpolate` anymore.
7775

78-
2. ~There would be some problem for tensorrt 7.0.0 to parse the `nn.AvgPool2d` op from pytorch with onnx opset11. So I use opset10 to export the model.~
79-
Likewise, you do not need to worry about this anymore with version newer than 7.2.3.4.
76+
The speed(fps) is tested on a single nvidia A40 gpu with `batchsize=1` and `cropsize=(1024,2048)`, which might be different from your platform and settings. You should evaluate the speed considering your own platform and cropsize. Also note that the performance would be affected if your gpu is concurrently working on other tasks. Please make sure no other program is running on your gpu when you test the speed.
8077

81-
3. The speed(fps) is tested on a single nvidia Tesla T4 gpu with `batchsize=1` and `cropsize=(1024,2048)`. Please note that T4 gpu is almost 2 times slower than 2080ti, you should evaluate the speed considering your own platform and cropsize. Also note that the performance would be affected if your gpu is concurrently working on other tasks. Please make sure no other program is running on your gpu when you test the speed.
8278

83-
4. On my platform, after compiling with tensorrt, the model size of bisenetv1 is 29Mb(fp16) and 128Mb(fp32), and the size of bisenetv2 is 16Mb(fp16) and 42Mb(fp32). However, the fps of bisenetv1 is 68(fp16) and 23(fp32), while the fps of bisenetv2 is 59(fp16) and 21(fp32). It is obvious that bisenetv2 has fewer parameters than bisenetv1, but the speed is otherwise. I am not sure whether it is because tensorrt has worse optimization strategy in some ops used in bisenetv2(such as depthwise convolution) or because of the limitation of the gpu on different ops. Please tell me if you have better idea on this.
8479

85-
5. int8 mode is not always greatly faster than fp16 mode. For example, I tested with bisenetv1-cityscapes and tensorrt 8.2.5.1. With v100 gpu and driver 515.65, the fp16/int8 fps is 185.89/186.85, while with t4 gpu and driver 450.80, it is 78.77/142.31.
80+
### Using python (this is not updated to tensorrt 10.3)
8681

82+
You can also use python script to compile and run inference of your model. <br>
8783

88-
### Using python
89-
90-
You can also use python script to compile and run inference of your model.
84+
Following is still the usage method of tensorrt 8.2.<br>
9185

9286

9387
#### 1. Compile model to onnx
9488

89+
9590
With this command:
9691
```
9792
$ cd BiSeNet/tensorrt

tensorrt/batch_stream.hpp

-1
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,6 @@ class BatchStream : public IBatchStream
5252

5353
void reset(int firstBatch) override
5454
{
55-
cout << "mBatchCount: " << mBatchCount << endl;
5655
mBatchCount = firstBatch;
5756
}
5857

tensorrt/segment.cpp

+54-45
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
#include <array>
1414
#include <sstream>
1515
#include <random>
16+
#include <unordered_map>
1617

1718
#include "trt_dep.hpp"
1819
#include "read_img.hpp"
@@ -27,8 +28,7 @@ using nvinfer1::IBuilderConfig;
2728
using nvinfer1::IRuntime;
2829
using nvinfer1::IExecutionContext;
2930
using nvinfer1::ILogger;
30-
using nvinfer1::Dims3;
31-
using nvinfer1::Dims2;
31+
using nvinfer1::Dims;
3232
using Severity = nvinfer1::ILogger::Severity;
3333

3434
using std::string;
@@ -39,6 +39,7 @@ using std::vector;
3939
using std::cout;
4040
using std::endl;
4141
using std::array;
42+
using std::stringstream;
4243

4344
using cv::Mat;
4445

@@ -53,81 +54,84 @@ void test_speed(vector<string> args);
5354

5455

5556
int main(int argc, char* argv[]) {
56-
if (argc < 3) {
57-
cout << "usage is ./segment compile/run/test\n";
58-
std::abort();
59-
}
57+
CHECK (argc >= 3, "usage is ./segment compile/run/test");
6058

6159
vector<string> args;
6260
for (int i{1}; i < argc; ++i) args.emplace_back(argv[i]);
6361

6462
if (args[0] == "compile") {
65-
if (argc < 4) {
66-
cout << "usage is: ./segment compile input.onnx output.trt [--fp16|--fp32]\n";
67-
cout << "or ./segment compile input.onnx output.trt --int8 /path/to/data_root /path/to/ann_file\n";
68-
std::abort();
69-
}
63+
stringstream ss;
64+
ss << "usage is: ./segment compile input.onnx output.trt [--fp16|--fp32|--bf16|--fp8]\n"
65+
<< "or ./segment compile input.onnx output.trt --int8 /path/to/data_root /path/to/ann_file\n";
66+
CHECK (argc >= 5, ss.str());
7067
compile_onnx(args);
7168
} else if (args[0] == "run") {
72-
if (argc < 5) {
73-
cout << "usage is ./segment run ./xxx.trt input.jpg result.jpg\n";
74-
std::abort();
75-
}
69+
CHECK (argc >= 5, "usage is ./segment run ./xxx.trt input.jpg result.jpg");
7670
run_with_trt(args);
7771
} else if (args[0] == "test") {
78-
if (argc < 3) {
79-
cout << "usage is ./segment test ./xxx.trt\n";
80-
std::abort();
81-
}
72+
CHECK (argc >= 3, "usage is ./segment test ./xxx.trt");
8273
test_speed(args);
74+
} else {
75+
CHECK (false, "usage is ./segment compile/run/test");
8376
}
8477

8578
return 0;
8679
}
8780

8881

8982
void compile_onnx(vector<string> args) {
83+
9084
string quant("fp32");
9185
string data_root("none");
9286
string data_file("none");
93-
if ((args.size() >= 4)) {
94-
if (args[3] == "--fp32") {
95-
quant = "fp32";
96-
} else if (args[3] == "--fp16") {
97-
quant = "fp16";
98-
} else if (args[3] == "--int8") {
99-
quant = "int8";
100-
data_root = args[4];
101-
data_file = args[5];
102-
} else {
103-
cout << "invalid args of quantization: " << args[3] << endl;
104-
std::abort();
105-
}
106-
}
87+
int opt_bsize = 1;
88+
89+
std::unordered_map<string, string> quant_map{
90+
{"--fp32", "fp32"},
91+
{"--fp16", "fp16"},
92+
{"--bf16", "bf16"},
93+
{"--fp8", "fp8"},
94+
{"--int8", "int8"},
95+
};
96+
CHECK (quant_map.find(args[3]) != quant_map.end(),
97+
"invalid args of quantization: " + args[3]);
98+
quant = quant_map[args[3]];
99+
if (quant == "int8") {
100+
data_root = args[4];
101+
data_file = args[5];
102+
}
103+
104+
if (args[3] == "--int8") {
105+
if (args.size() > 6) opt_bsize = std::stoi(args[6]);
106+
} else {
107+
if (args.size() > 4) opt_bsize = std::stoi(args[4]);
108+
}
107109

108-
TrtSharedEnginePtr engine = parse_to_engine(args[1], quant, data_root, data_file);
109-
serialize(engine, args[2]);
110+
SemanticSegmentTrt ss_trt;
111+
ss_trt.set_opt_batch_size(opt_bsize);
112+
ss_trt.parse_to_engine(args[1], quant, data_root, data_file);
113+
ss_trt.serialize(args[2]);
110114
}
111115

112116

113117
void run_with_trt(vector<string> args) {
114118

115-
TrtSharedEnginePtr engine = deserialize(args[1]);
119+
SemanticSegmentTrt ss_trt;
120+
ss_trt.deserialize(args[1]);
116121

117-
Dims3 i_dims = static_cast<Dims3&&>(
118-
engine->getBindingDimensions(engine->getBindingIndex("input_image")));
119-
Dims3 o_dims = static_cast<Dims3&&>(
120-
engine->getBindingDimensions(engine->getBindingIndex("preds")));
121-
const int iH{i_dims.d[2]}, iW{i_dims.d[3]};
122-
const int oH{o_dims.d[2]}, oW{o_dims.d[3]};
122+
vector<int> i_dims = ss_trt.get_input_shape();
123+
vector<int> o_dims = ss_trt.get_output_shape();
124+
125+
const int iH{i_dims[2]}, iW{i_dims[3]};
126+
const int oH{o_dims[2]}, oW{o_dims[3]};
123127

124128
// prepare image and resize
125129
vector<float> data; data.resize(iH * iW * 3);
126130
int orgH, orgW;
127131
read_data(args[2], &data[0], iH, iW, orgH, orgW);
128132

129133
// call engine
130-
vector<int> res = infer_with_engine(engine, data);
134+
vector<int> res = ss_trt.inference(data);
131135

132136
// generate colored out
133137
vector<vector<uint8_t>> color_map = get_color_map();
@@ -166,6 +170,11 @@ vector<vector<uint8_t>> get_color_map() {
166170

167171

168172
void test_speed(vector<string> args) {
169-
TrtSharedEnginePtr engine = deserialize(args[1]);
170-
test_fps_with_engine(engine);
173+
int opt_bsize = 1;
174+
if (args.size() > 2) opt_bsize = std::stoi(args[2]);
175+
176+
SemanticSegmentTrt ss_trt;
177+
ss_trt.set_opt_batch_size(opt_bsize);
178+
ss_trt.deserialize(args[1]);
179+
ss_trt.test_speed_fps();
171180
}

0 commit comments

Comments
 (0)