Skip to content

Commit 2693852

Browse files
committed
[evals] Add raw capability eval datasets
1 parent 6de9e5f commit 2693852

4 files changed

Lines changed: 922 additions & 0 deletions

File tree

docs/tutorials/run-lm-evals.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -190,3 +190,26 @@ For deeper dives, see:
190190
- `docs/explanations/evaluation.md`
191191
- `experiments/evals/task_configs.py`
192192
- `experiments/evals/evals.py`
193+
194+
## Raw Perplexity Gap Datasets
195+
196+
The raw perplexity-gap workflow uses `default_raw_validation_sets()` from `experiments/defaults.py`. That bundle now includes:
197+
198+
- Paloma
199+
- Uncheatable Eval
200+
- Curated capability-family slices for:
201+
- `chat/wildchat`
202+
- `agent_traces/openhands_swe_rebench`
203+
- `reasoning_icl/gsm8k_main`
204+
- `reasoning_icl/global_mgsm_en`
205+
206+
These capability datasets are first normalized into reusable OpenAI-chat JSONL artifacts under each step's `oai/` output. Consumers that want Levanter chat tokenization can use `capability_chat_validation_components()`, which wraps those rows in `ChatLmDatasetFormat` with `MARIN_CHAT_TEMPLATE`. The raw gap finder still consumes plain `text`, so the same step also writes a derived `raw_text/` projection using Marin's chat-token surface. OpenHands traces keep the full system/user/tool conversation in the OAI artifact, while the raw-text projection scores only assistant-generated trace targets and final patches.
207+
208+
The curated default uses modest, reproducible slices for the larger structured corpora rather than mirroring whole Hugging Face datasets into GCS. That keeps cost and executor output size bounded while still giving useful coverage for base-model PPL comparisons.
209+
210+
If you want the gated chat sources as well, use `extended_raw_validation_sets()` instead of `default_raw_validation_sets()`. That currently adds:
211+
212+
- `chat/lima_train`
213+
- `chat/lmsys_chat_1m`
214+
215+
Those opt-in datasets stay out of the default bundle because access and licensing are more restrictive.

experiments/defaults.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -307,9 +307,20 @@ def default_validation_sets(tokenizer: str, base_path: str = "tokenized/") -> di
307307
@lru_cache
308308
def default_raw_validation_sets() -> dict[str, Any]:
309309
from experiments.evals.exp1600_uncheatable_evals import uncheatable_eval_raw_validation_sets
310+
from experiments.evals.raw_capability_eval_sets import capability_raw_validation_sets
310311

311312
validation_sets = dict(paloma_raw_validation_sets())
312313
validation_sets.update(uncheatable_eval_raw_validation_sets())
314+
validation_sets.update(capability_raw_validation_sets())
315+
return validation_sets
316+
317+
318+
@lru_cache
319+
def extended_raw_validation_sets() -> dict[str, Any]:
320+
from experiments.evals.raw_capability_eval_sets import opt_in_capability_raw_validation_sets
321+
322+
validation_sets = dict(default_raw_validation_sets())
323+
validation_sets.update(opt_in_capability_raw_validation_sets())
313324
return validation_sets
314325

315326

0 commit comments

Comments
 (0)