How muth time is spent in once forward and backward in a mini-batch 64? 

My experiment environment is CUDA 8.0.61, cudnn 6 and  a TITAN X (pascal).
In my implementations, time (s/mini-batch) of  DenseNet-BC (l=100, k=12) is 0.216s on Cifar10 with batch_size 64, but original that is 0.153s in training.
I wonder this is due to my implementation or tensorflow, so could tell me your time cost?