marin-community
diff --git a/‎docs/tutorials/run-lm-evals.md‎
Lines changed: 23 additions & 0 deletions b/‎docs/tutorials/run-lm-evals.md‎
Lines changed: 23 additions & 0 deletions
diff --git a/‎experiments/defaults.py‎
Lines changed: 11 additions & 0 deletions b/‎experiments/defaults.py‎
Lines changed: 11 additions & 0 deletions
@@ -190,3 +190,26 @@ For deeper dives, see:
 - `docs/explanations/evaluation.md`
 - `experiments/evals/task_configs.py`
 - `experiments/evals/evals.py`
+
+## Raw Perplexity Gap Datasets
+
+The raw perplexity-gap workflow uses `default_raw_validation_sets()` from `experiments/defaults.py`. That bundle now includes:
+
+- Paloma
+- Uncheatable Eval
+- Curated capability-family slices for:
+  - `chat/wildchat`
+  - `agent_traces/openhands_swe_rebench`
+  - `reasoning_icl/gsm8k_main`
+  - `reasoning_icl/global_mgsm_en`
+
+These capability datasets are first normalized into reusable OpenAI-chat JSONL artifacts under each step's `oai/` output. Consumers that want Levanter chat tokenization can use `capability_chat_validation_components()`, which wraps those rows in `ChatLmDatasetFormat` with `MARIN_CHAT_TEMPLATE`. The raw gap finder still consumes plain `text`, so the same step also writes a derived `raw_text/` projection using Marin's chat-token surface. OpenHands traces keep the full system/user/tool conversation in the OAI artifact, while the raw-text projection scores only assistant-generated trace targets and final patches.
+
+The curated default uses modest, reproducible slices for the larger structured corpora rather than mirroring whole Hugging Face datasets into GCS. That keeps cost and executor output size bounded while still giving useful coverage for base-model PPL comparisons.
+
+If you want the gated chat sources as well, use `extended_raw_validation_sets()` instead of `default_raw_validation_sets()`. That currently adds:
+
+- `chat/lima_train`
+- `chat/lmsys_chat_1m`
+
+Those opt-in datasets stay out of the default bundle because access and licensing are more restrictive.
@@ -307,9 +307,20 @@ def default_validation_sets(tokenizer: str, base_path: str = "tokenized/") -> di
 @lru_cache
 def default_raw_validation_sets() -> dict[str, Any]:
     from experiments.evals.exp1600_uncheatable_evals import uncheatable_eval_raw_validation_sets
+    from experiments.evals.raw_capability_eval_sets import capability_raw_validation_sets
 
     validation_sets = dict(paloma_raw_validation_sets())
     validation_sets.update(uncheatable_eval_raw_validation_sets())
+    validation_sets.update(capability_raw_validation_sets())
+    return validation_sets
+
+
+@lru_cache
+def extended_raw_validation_sets() -> dict[str, Any]:
+    from experiments.evals.raw_capability_eval_sets import opt_in_capability_raw_validation_sets
+
+    validation_sets = dict(default_raw_validation_sets())
+    validation_sets.update(opt_in_capability_raw_validation_sets())
     return validation_sets