FD Gradient computation can be easily fixed to have 2x speedup. One might be able to use parallelization as well to achieve additional speed up