Optimize inference performance of ERNIE INT8 on CPU

Now, paddle ERNIE fp32 inference on CPU performance is ass below:
single thread： 251.464 m
20 threads：29.8818 ms
Our goal is to prove that with INT8 real kernel, ERNIE can get the performance gain.
@Sand3r- @wojtuss Please update your benchmark progress here.
@wzzju @luotao1 Please track the status here.