Skip to content

[evals] Add FineWeb2 multilingual gap rerun#5074

Open
dlwh wants to merge 4 commits intomainfrom
codex/fineweb2-multilingual-gap-reruns
Open

[evals] Add FineWeb2 multilingual gap rerun#5074
dlwh wants to merge 4 commits intomainfrom
codex/fineweb2-multilingual-gap-reruns

Conversation

@dlwh
Copy link
Copy Markdown
Member

@dlwh dlwh commented Apr 22, 2026

Add raw FineWeb2 multilingual datasets for perplexity-gap reports and a durable runner that compares Marin 8B against Llama 3.1 8B and Qwen3 8B. The runner keeps Paloma and uncheatable in the same report so multilingual regressions are visible next to broad raw PPL.

Part of #5005

dlwh added 3 commits April 21, 2026 16:21
Token boundary rendering for literal examples was happening while scanning every short segment, making long-document gap reports CPU-bound. Keep compact token span metadata in the hot path and render boundaries only for literals that survive into the final report.
@dlwh dlwh added the agent-generated Created by automation/agent label Apr 22, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 22, 2026

Claude finished @dlwh's task in 3m 15s —— View job


Code review

  • Check if PR should be reviewed
  • Find relevant CLAUDE.md/AGENTS.md files
  • Get PR summary
  • Review changes for bugs and CLAUDE.md compliance
  • Validate flagged issues
  • Post review comment

Summary of changes

Adds fineweb2_multilingual_raw_validation_sets() to experiments/evals/fineweb2_multilingual.py (returns dict[str, RawTextEvaluationDataset] keyed as fineweb2_multilingual/<config>), a new runner experiments/exp_model_perplexity_gap_fineweb2_multilingual.py that compares Marin‑8B‑base against Llama‑3.1‑8B and Qwen3‑8B over default_raw_validation_sets() ∪ fineweb2_multilingual_raw_validation_sets() on v5p‑8 / us-central1, and a unit test in tests/evals/test_fineweb2_multilingual.py.

Issues found (1)

  1. Test will fail at runtime — dataset.split does not exist. In tests/evals/test_fineweb2_multilingual.py:14, the assertion dataset.split == "validation" raises AttributeError because RawTextEvaluationDataset is a frozen dataclass whose only fields are input_path, hf_dataset_id, hf_dataset_name, text_key, and tags (perplexity_gap.py#L37-L44). fineweb2_multilingual_raw_validation_sets never sets a split — the "validation" designation is established later by _to_dataset_component via UrlDatasetSourceConfig.validation_urls, not on the dataset object itself. Inline suggestion posted to drop the line; if a split check is desired, exercise _to_dataset_component(dataset) instead. Fix this →

Nothing else flagged: ResourceConfig.with_tpu(..., regions=["us-central1"]) is valid (regions: Sequence[str] | None is a real field on ResourceConfig), the runner's dataset-dict merge has no key collisions (paloma/uncheatable prefixes vs. fineweb2_multilingual/*), and the new helper matches the existing raw_text_dataset contract.
• Branch: codex/fineweb2-multilingual-gap-reruns

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 15a2585136

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".


assert isinstance(dataset.input_path, str)
assert "/deu_Latn/test/*.parquet" in dataset.input_path
assert dataset.split == "validation"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop asserting missing split field on raw eval datasets

fineweb2_multilingual_raw_validation_sets() returns RawTextEvaluationDataset objects created by raw_text_dataset(), and that dataclass has no split attribute (it only carries input_path/HF IDs/text key/tags). This assertion will raise AttributeError whenever the test runs, so the new test cannot validate the intended behavior and will fail immediately in CI environments where tests execute.

Useful? React with 👍 / 👎.


assert isinstance(dataset.input_path, str)
assert "/deu_Latn/test/*.parquet" in dataset.input_path
assert dataset.split == "validation"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RawTextEvaluationDataset has no split field, so this assertion raises AttributeError and the test fails before it can verify anything else. The frozen dataclass only defines input_path, hf_dataset_id, hf_dataset_name, text_key, and tags — see

@dataclass(frozen=True)
class RawTextEvaluationDataset:
input_path: str | InputName | ExecutorStep | None = None
hf_dataset_id: str | None = None
hf_dataset_name: str | None = None
text_key: str = "text"
tags: tuple[str, ...] = ()
.

fineweb2_multilingual_raw_validation_sets likewise never sets a split — it only calls raw_text_dataset(...), which fills input_path, text_key, and tags. The "validation" designation for perplexity-gap comes later when _to_dataset_component maps input_path into UrlDatasetSourceConfig.validation_urls (

def _to_dataset_component(config: RawTextEvaluationDataset) -> DatasetComponent:
dataset_format = TextLmDatasetFormat(text_key=config.text_key)
if config.hf_dataset_id is not None:
source = HfDatasetSourceConfig(
id=config.hf_dataset_id,
name=config.hf_dataset_name,
format=dataset_format,
)
else:
if config.input_path is None:
raise ValueError("RawTextEvaluationDataset requires either input_path or hf_dataset_id.")
input_path = config.input_path
if isinstance(input_path, ExecutorStep):
input_path = input_path.as_input_name()
source = UrlDatasetSourceConfig(
train_urls=[],
validation_urls=[input_path], # type: ignore[list-item]
format=dataset_format,
)
return DatasetComponent(source=source, format=dataset_format, tags=list(config.tags))
), not via a dataset-level attribute, so there is nothing on the returned object to assert against here.

Either drop this line or replace it with a check that exercises the actual contract (e.g. call _to_dataset_component(dataset) and assert the resulting UrlDatasetSourceConfig has an empty train_urls and the pattern in validation_urls).

Suggested change
assert dataset.split == "validation"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant