Optimize jacobian computation of AutodiffCostFunction

Torch autograd's jacobian, used by [LearnableCostFunction](https://github.com/facebookresearch/theseus/blob/main/theseus/core/cost_function.py#L163) computes cross-batch gradients which is undesirable. I haven't seen an out of the box solution, so we might need to do some manual backward() and proper use of `vmap()` to compute the jacobian on our own.