Skip to content

Commit 7bdd7ab

Browse files
ravwojdylaclaude
andcommitted
docs: add datakit testbed RFC
Drops the RFC-1 design doc into ``docs/design/`` so the plan sitting behind this PR's testbed pipeline has a home next to the other in-repo design docs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent b62c3bb commit 7bdd7ab

1 file changed

Lines changed: 46 additions & 0 deletions

File tree

docs/design/datakit_testbed.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
#### Problem
2+
3+
* We need to establish canonical Datakit parameters (e.g. deduplication strategy)
4+
* We should provide a reusable, reasonable efficient and continuously updating testbed for data experimentation (e.g. mixing)
5+
6+
#### Solution
7+
8+
* Use [RFC-0: Datakit](https://docs.google.com/document/d/1kDSzONg32zv2VnCO4FJiMP0fcjRSjgP0uTDpI4_C4O0) pipelines to define roughly 1T token Datakit Testbed dataset, sampled proportionally (by provenance) from up-to-date raw inputs
9+
* 1T because we want at least 500B on the output after deduplication
10+
* Recomputed on update to canonical raw sources
11+
* MoE experiment harness with simulated epoching and proportional mixing reusing the Grug MoE recipe
12+
* Baseline (deliberately trivial):
13+
* No deduplication
14+
* Single constant quality score
15+
* Topic by provenance
16+
17+
The Testbed dataset will enable many experiments. The immediate questions:
18+
19+
* What is the canonical deduplication strategy?
20+
* What is the canonical topic clustering?
21+
* How many and what quality scores should we use?
22+
23+
#### Implementation
24+
25+
* Reuse existing pieces:
26+
* Proportional mixing
27+
* Grug MoE recipe
28+
* Dedup
29+
* Datakit ferry
30+
* New work:
31+
* Revive and wire existing simulated epoching through the grug MoE
32+
* By provenance sampler on the normalized data
33+
34+
#### Experiment protocol
35+
36+
* Configuration ranking: train each configuration at a single compute-optimal baseline point, rank on Paloma macro
37+
* Canonical confirmation: run IsoFLOP on winner configuration and baseline
38+
39+
#### Open Questions
40+
41+
* Should we use/start-with dense models to reduce complexity?
42+
* Will: MoE's will probably show more data dependent effects at the same FLOP count, so don't think this is necessary\!
43+
* Is perplexity evaluation metric sufficient, or should we also use downstream task evals?
44+
* Will: Uncheatable is a good starting point \- MMLU logprob is not bad either
45+
* How to handle distribution drift when the Testbed gets updated with new raw sources?
46+
* Test if the canonical configuration survives Testbed re-sampling?

0 commit comments

Comments
 (0)