Conversation
Token boundary rendering for literal examples was happening while scanning every short segment, making long-document gap reports CPU-bound. Keep compact token span metadata in the hot path and render boundaries only for literals that survive into the final report.
|
Claude finished @dlwh's task in 3m 15s —— View job Code review
Summary of changes Adds Issues found (1)
Nothing else flagged: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 15a2585136
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
| assert isinstance(dataset.input_path, str) | ||
| assert "/deu_Latn/test/*.parquet" in dataset.input_path | ||
| assert dataset.split == "validation" |
There was a problem hiding this comment.
Stop asserting missing
split field on raw eval datasets
fineweb2_multilingual_raw_validation_sets() returns RawTextEvaluationDataset objects created by raw_text_dataset(), and that dataclass has no split attribute (it only carries input_path/HF IDs/text key/tags). This assertion will raise AttributeError whenever the test runs, so the new test cannot validate the intended behavior and will fail immediately in CI environments where tests execute.
Useful? React with 👍 / 👎.
|
|
||
| assert isinstance(dataset.input_path, str) | ||
| assert "/deu_Latn/test/*.parquet" in dataset.input_path | ||
| assert dataset.split == "validation" |
There was a problem hiding this comment.
RawTextEvaluationDataset has no split field, so this assertion raises AttributeError and the test fails before it can verify anything else. The frozen dataclass only defines input_path, hf_dataset_id, hf_dataset_name, text_key, and tags — see
marin/lib/marin/src/marin/evaluation/perplexity_gap.py
Lines 37 to 44 in 15a2585
fineweb2_multilingual_raw_validation_sets likewise never sets a split — it only calls raw_text_dataset(...), which fills input_path, text_key, and tags. The "validation" designation for perplexity-gap comes later when _to_dataset_component maps input_path into UrlDatasetSourceConfig.validation_urls (
marin/lib/marin/src/marin/evaluation/perplexity_gap.py
Lines 180 to 199 in 15a2585
Either drop this line or replace it with a check that exercises the actual contract (e.g. call _to_dataset_component(dataset) and assert the resulting UrlDatasetSourceConfig has an empty train_urls and the pattern in validation_urls).
| assert dataset.split == "validation" |
Add raw FineWeb2 multilingual datasets for perplexity-gap reports and a durable runner that compares Marin 8B against Llama 3.1 8B and Qwen3 8B. The runner keeps Paloma and uncheatable in the same report so multilingual regressions are visible next to broad raw PPL.
Part of #5005