Measure variation in seeded evals

Currently we set `np.random.RandomState(seed)` to control which docs are sampled and `torch.manual_seed(seed)` to make inference determistic.

To check this doesn't matter:
- [ ] Re-run evals with different seed for np.random (doc selection)
- [ ] Re-run evals with different torch seed (next token selection)