[evals] Add FineWeb2 multilingual gap rerun by dlwh · Pull Request #5074 · marin-community/marin

dlwh · 2026-04-22T20:02:47Z

Add raw FineWeb2 multilingual datasets for perplexity-gap reports and a durable runner that compares Marin 8B against Llama 3.1 8B and Qwen3 8B. The runner keeps Paloma and uncheatable in the same report so multilingual regressions are visible next to broad raw PPL.

Part of #5005

Token boundary rendering for literal examples was happening while scanning every short segment, making long-document gap reports CPU-bound. Keep compact token span metadata in the hot path and render boundaries only for literals that survive into the final report.

…lazy-boundaries

claude · 2026-04-22T20:03:02Z

Claude finished @dlwh's task in 3m 15s —— View job

Code review

Check if PR should be reviewed
Find relevant CLAUDE.md/AGENTS.md files
Get PR summary
Review changes for bugs and CLAUDE.md compliance
Validate flagged issues
Post review comment

Summary of changes

Adds fineweb2_multilingual_raw_validation_sets() to experiments/evals/fineweb2_multilingual.py (returns dict[str, RawTextEvaluationDataset] keyed as fineweb2_multilingual/<config>), a new runner experiments/exp_model_perplexity_gap_fineweb2_multilingual.py that compares Marin‑8B‑base against Llama‑3.1‑8B and Qwen3‑8B over default_raw_validation_sets() ∪ fineweb2_multilingual_raw_validation_sets() on v5p‑8 / us-central1, and a unit test in tests/evals/test_fineweb2_multilingual.py.

Issues found (1)

Test will fail at runtime — dataset.split does not exist. In tests/evals/test_fineweb2_multilingual.py:14, the assertion dataset.split == "validation" raises AttributeError because RawTextEvaluationDataset is a frozen dataclass whose only fields are input_path, hf_dataset_id, hf_dataset_name, text_key, and tags (perplexity_gap.py#L37-L44). fineweb2_multilingual_raw_validation_sets never sets a split — the "validation" designation is established later by _to_dataset_component via UrlDatasetSourceConfig.validation_urls, not on the dataset object itself. Inline suggestion posted to drop the line; if a split check is desired, exercise _to_dataset_component(dataset) instead. Fix this →

Nothing else flagged: ResourceConfig.with_tpu(..., regions=["us-central1"]) is valid (regions: Sequence[str] | None is a real field on ResourceConfig), the runner's dataset-dict merge has no key collisions (paloma/uncheatable prefixes vs. fineweb2_multilingual/*), and the new helper matches the existing raw_text_dataset contract.
• Branch: codex/fineweb2-multilingual-gap-reruns

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 15a2585136

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-22T20:05:17Z

+
+    assert isinstance(dataset.input_path, str)
+    assert "/deu_Latn/test/*.parquet" in dataset.input_path
+    assert dataset.split == "validation"


Stop asserting missing split field on raw eval datasets

fineweb2_multilingual_raw_validation_sets() returns RawTextEvaluationDataset objects created by raw_text_dataset(), and that dataclass has no split attribute (it only carries input_path/HF IDs/text key/tags). This assertion will raise AttributeError whenever the test runs, so the new test cannot validate the intended behavior and will fail immediately in CI environments where tests execute.

Useful? React with 👍 / 👎.

claude · 2026-04-22T20:06:05Z

+
+    assert isinstance(dataset.input_path, str)
+    assert "/deu_Latn/test/*.parquet" in dataset.input_path
+    assert dataset.split == "validation"


RawTextEvaluationDataset has no split field, so this assertion raises AttributeError and the test fails before it can verify anything else. The frozen dataclass only defines input_path, hf_dataset_id, hf_dataset_name, text_key, and tags — see

marin/lib/marin/src/marin/evaluation/perplexity_gap.py

Lines 37 to 44 in 15a2585

@dataclass(frozen=True)

class RawTextEvaluationDataset:

input_path: str | InputName | ExecutorStep | None = None

hf_dataset_id: str | None = None

hf_dataset_name: str | None = None

text_key: str = "text"

tags: tuple[str, ...] = ()

.

fineweb2_multilingual_raw_validation_sets likewise never sets a split — it only calls raw_text_dataset(...), which fills input_path, text_key, and tags. The "validation" designation for perplexity-gap comes later when _to_dataset_component maps input_path into UrlDatasetSourceConfig.validation_urls (

marin/lib/marin/src/marin/evaluation/perplexity_gap.py

Lines 180 to 199 in 15a2585

def _to_dataset_component(config: RawTextEvaluationDataset) -> DatasetComponent:

dataset_format = TextLmDatasetFormat(text_key=config.text_key)

if config.hf_dataset_id is not None:

source = HfDatasetSourceConfig(

id=config.hf_dataset_id,

name=config.hf_dataset_name,

format=dataset_format,

)

else:

if config.input_path is None:

raise ValueError("RawTextEvaluationDataset requires either input_path or hf_dataset_id.")

input_path = config.input_path

if isinstance(input_path, ExecutorStep):

input_path = input_path.as_input_name()

source = UrlDatasetSourceConfig(

train_urls=[],

validation_urls=[input_path], # type: ignore[list-item]

format=dataset_format,

)

return DatasetComponent(source=source, format=dataset_format, tags=list(config.tags))

), not via a dataset-level attribute, so there is nothing on the returned object to assert against here.

Either drop this line or replace it with a check that exercises the actual contract (e.g. call _to_dataset_component(dataset) and assert the resulting UrlDatasetSourceConfig has an empty train_urls and the pattern in validation_urls).

Suggested change

assert dataset.split == "validation"

dlwh added 3 commits April 21, 2026 16:21

Merge remote-tracking branch 'origin/main' into codex/perplexity-gap-…

eacfcf2

…lazy-boundaries

Add FineWeb2 multilingual gap rerun entrypoint

15a2585

dlwh added the agent-generated Created by automation/agent label Apr 22, 2026

chatgpt-codex-connector Bot reviewed Apr 22, 2026

View reviewed changes

claude Bot reviewed Apr 22, 2026

View reviewed changes

Fix FineWeb2 raw eval test independence

7b0e8f7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[evals] Add FineWeb2 multilingual gap rerun#5074

[evals] Add FineWeb2 multilingual gap rerun#5074
dlwh wants to merge 4 commits intomainfrom
codex/fineweb2-multilingual-gap-reruns

dlwh commented Apr 22, 2026

Uh oh!

claude Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Uh oh!

claude Bot Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	@dataclass(frozen=True)
	class RawTextEvaluationDataset:
	input_path: str \| InputName \| ExecutorStep \| None = None
	hf_dataset_id: str \| None = None
	hf_dataset_name: str \| None = None
	text_key: str = "text"
	tags: tuple[str, ...] = ()

	def _to_dataset_component(config: RawTextEvaluationDataset) -> DatasetComponent:
	dataset_format = TextLmDatasetFormat(text_key=config.text_key)
	if config.hf_dataset_id is not None:
	source = HfDatasetSourceConfig(
	id=config.hf_dataset_id,
	name=config.hf_dataset_name,
	format=dataset_format,
	)
	else:
	if config.input_path is None:
	raise ValueError("RawTextEvaluationDataset requires either input_path or hf_dataset_id.")
	input_path = config.input_path
	if isinstance(input_path, ExecutorStep):
	input_path = input_path.as_input_name()
	source = UrlDatasetSourceConfig(
	train_urls=[],
	validation_urls=[input_path], # type: ignore[list-item]
	format=dataset_format,
	)
	return DatasetComponent(source=source, format=dataset_format, tags=list(config.tags))

Conversation

dlwh commented Apr 22, 2026

Uh oh!

claude Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Apr 22, 2026 •

edited

Loading