Open
Description
I think the dash onsite demonstrated the GradSim
method is slow for large models. This is because currently, pytorch and tensorflow don’t let you compute gradients per instance in a batch which gradient similarity requires. We can do this before time by storing the gradients but this becomes impossible for large models. Note that partial solutions include: a) using a subset of model weights, such as a final layer, to decrease memory overhead or b) reducing the dataset you're comparing against using something like ProtoSelect. Both of these are user-level interventions. I think our focus should be figuring out how to batch the gradient computations.