Nano MDM trainer with DDP generation (perplexity / diversity traderoff with reference model?) Sweep learning rate Sweep width and see muP happening