|
| 1 | +#### Problem |
| 2 | + |
| 3 | +* We need to establish canonical Datakit parameters (e.g. deduplication strategy) |
| 4 | +* We should provide a reusable, reasonable efficient and continuously updating testbed for data experimentation (e.g. mixing) |
| 5 | + |
| 6 | +#### Solution |
| 7 | + |
| 8 | +* Use [RFC-0: Datakit](https://docs.google.com/document/d/1kDSzONg32zv2VnCO4FJiMP0fcjRSjgP0uTDpI4_C4O0) pipelines to define roughly 1T token Datakit Testbed dataset, sampled proportionally (by provenance) from up-to-date raw inputs |
| 9 | + * 1T because we want at least 500B on the output after deduplication |
| 10 | + * Recomputed on update to canonical raw sources |
| 11 | +* MoE experiment harness with simulated epoching and proportional mixing reusing the Grug MoE recipe |
| 12 | +* Baseline (deliberately trivial): |
| 13 | + * No deduplication |
| 14 | + * Single constant quality score |
| 15 | + * Topic by provenance |
| 16 | + |
| 17 | +The Testbed dataset will enable many experiments. The immediate questions: |
| 18 | + |
| 19 | +* What is the canonical deduplication strategy? |
| 20 | +* What is the canonical topic clustering? |
| 21 | +* How many and what quality scores should we use? |
| 22 | + |
| 23 | +#### Implementation |
| 24 | + |
| 25 | +* Reuse existing pieces: |
| 26 | + * Proportional mixing |
| 27 | + * Grug MoE recipe |
| 28 | + * Dedup |
| 29 | + * Datakit ferry |
| 30 | +* New work: |
| 31 | + * Revive and wire existing simulated epoching through the grug MoE |
| 32 | + * By provenance sampler on the normalized data |
| 33 | + |
| 34 | +#### Experiment protocol |
| 35 | + |
| 36 | +* Configuration ranking: train each configuration at a single compute-optimal baseline point, rank on Paloma macro |
| 37 | +* Canonical confirmation: run IsoFLOP on winner configuration and baseline |
| 38 | + |
| 39 | +#### Open Questions |
| 40 | + |
| 41 | +* Should we use/start-with dense models to reduce complexity? |
| 42 | + * Will: MoE's will probably show more data dependent effects at the same FLOP count, so don't think this is necessary\! |
| 43 | +* Is perplexity evaluation metric sufficient, or should we also use downstream task evals? |
| 44 | + * Will: Uncheatable is a good starting point \- MMLU logprob is not bad either |
| 45 | +* How to handle distribution drift when the Testbed gets updated with new raw sources? |
| 46 | + * Test if the canonical configuration survives Testbed re-sampling? |
0 commit comments