My experiment environment is CUDA 8.0.61, cudnn 6 and a TITAN X (pascal).
In my implementations, time (s/mini-batch) of DenseNet-BC (l=100, k=12) is 0.216s on Cifar10 with batch_size 64, but original that is 0.153s in training.
I wonder this is due to my implementation or tensorflow, so could tell me your time cost?