[evals] Add first-wave tabular PPL slices by dlwh · Pull Request #5129 · marin-community/marin

dlwh · 2026-04-23T16:46:14Z

Add byte-preserving tabular PPL helpers and opt-in table-record slices for early structured-data coverage, keeping numeric literals, delimiters, empty cells, and unicode intact for gap analysis. This is a first wave for tabular data only; time-series, geospatial, and broader conditional-likelihood follow-ons stay in #5059.

Part of #5005

First wave of structured-data PPL probes from issue #5059: serializers and staging steps for HuggingFace-hosted table datasets, with hard byte-preservation guarantees so numeric literals, delimiters, empty cells, and unicode survive into the text the tokenizer sees. - transform/structured_text/tabular.py: raw CSV/TSV staging that only ever splits at line boundaries, replicates the header into each chunk, and caps total bytes per source (20-40 MB band). - transform/structured_text/table_records.py: record-level serializers for ToTTo and WikiTableQuestions, plus a byte-capped HF staging step. - experiments/structured_evals.py: wires ToTTo + WikiTableQuestions into is_validation=True tokenize steps keyed under structured_text/* so they drop into mixture_for_evaluation alongside Paloma. - tests/transform/test_structured_text_*.py: 25 tests covering byte preservation, header replication, chunking invariants, size caps, unicode cells, and numeric-precision survival. Time-series (PR 2), geospatial + span classifier (PR 3), and conditional-likelihood eval + pilot gap report (PR 4) are tracked as follow-ups on #5059. Co-authored-by: David Hall <dlwh@users.noreply.github.com>

claude · 2026-04-23T16:47:20Z

Claude finished @dlwh's task in 10m 18s —— View job

Code review

Reviewed for bugs and CLAUDE.md/AGENTS.md compliance. 4 issues found, posted inline.

#	Severity	File	Issue
1	High	`lib/marin/src/marin/transform/structured_text/table_records.py` L204-207	`datasets.load_dataset(input_path, ...)` is called with a `gs://` URL produced by `download_hf` — `load_dataset`'s first positional arg does not accept GCS URLs, only HF dataset ids, local paths, or canonical loader names. Existing tests mock `_load_hf_iterable`, so this path is never exercised. The staging step will fail at runtime. (discussion)
2	High	`lib/marin/src/marin/transform/structured_text/tabular.py` L186-193	`_read_text_file_lines` catches `UnicodeDecodeError` and falls back to `errors="replace"`, silently writing U+FFFD-substituted text. This violates the module's documented byte-preservation contract and the root AGENTS.md "NEVER swallow exceptions" rule. For a PPL probe, the substituted bytes become spurious signal. (discussion)
3	Medium	`lib/marin/src/marin/transform/structured_text/tabular.py` L264-270	Header dedup `body_lines = [line for line in lines if line != header]` drops every line equal to the header, not just the first. Any data row that byte-matches the header silently vanishes. (discussion)
4	Low	`tests/transform/test_structured_text_table_records.py` L163-164	Tautological assertion: third disjunct subsumes the others, so the test does not actually verify the empty-cell-between-tabs case it documents. (discussion)

Issues 1 and 2 are blockers — the pipeline cannot run end-to-end as written, and the byte-preservation invariant the PR is built around is not actually enforced.

Tasks

Read PR diff and changed files
Audit CLAUDE.md / AGENTS.md compliance (root, lib/marin, experiments)
Scan for bugs and broken contracts
Validate the load_dataset(gs://...) call against existing patterns in transform_preference_data.py
Post inline comments
· branch claude/issue-5059-20260422-2245

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0244f8ca8f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-23T16:54:18Z

+                if cfg.preserve_header:
+                    header = _first_non_blank_line(lines)
+                    # Drop the header from the body — it gets reintroduced by serialize_csv_document
+                    body_lines = [line for line in lines if line != header] if header else lines


Drop only the first detected header line

body_lines = [line for line in lines if line != header] removes every row whose contents match the header, not just the initial header row. If a real data row repeats the header text (which happens in sectioned exports and some denormalized tables), that row is silently lost, so the staged output is no longer byte-preserving and downstream PPL measurements are computed on corrupted input. Remove only the single header occurrence at its original position instead of value-filtering the whole file.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-23T16:54:18Z

+    fn=download_hf,
+    config=HfDownloadConfig(
+        hf_dataset_id=versioned("GEM/totto"),
+        revision=versioned("main"),


Use immutable HF revisions for eval inputs

Using revision=versioned("main") makes this eval input drift as upstream datasets change, so identical pipeline configs can produce different structured-eval corpora over time. That breaks reproducibility of perplexity comparisons and contradicts the module’s own claim that revisions are pinned to commits (the same pattern is also used for WikiTableQuestions in this file).

Useful? React with 👍 / 👎.

claude · 2026-04-23T16:56:56Z

+    from datasets import load_dataset  # local import to keep module importable without `datasets`
+
+    dataset = load_dataset(input_path, subset, split=split)
+    return dataset


Bug — staging will fail at runtime. datasets.load_dataset(input_path, ...) does not accept gs:// URLs as the first positional argument. load_dataset's path parameter resolves either an HF dataset id, a local filesystem path, or a canonical loader name (e.g. "parquet"); fsspec is only honored for data_files=, not for the dataset path itself.

The input_path here is the GCS prefix produced by download_hf (see experiments/structured_evals.py#L55-L80 and #L105-L118). The existing tests mock _load_hf_iterable entirely, so this code path is never exercised in CI.

The canonical pattern in this repo for parquet outputs from download_hf is what transform_preference_data.py#L233-L242 does — glob the parquet files via fsspec and pass them to the canonical "parquet" loader, e.g.:

fs, root = url_to_fs(input_path) files = sorted(fs.find(root, withdirs=False)) data_files = [f"gs://{p}" if not p.startswith(("gs://", "/")) else p for p in files if p.endswith(".parquet")] dataset = load_dataset("parquet", data_files={split: data_files}, split=split, streaming=True)

Recommend adding an integration test that exercises the real loader against a small fixture parquet so this would have been caught before merge.

claude · 2026-04-23T16:57:09Z

+    with fs.open(path, "rb") as f:
+        raw = f.read()
+    try:
+        text = raw.decode("utf-8")
+    except UnicodeDecodeError:
+        text = raw.decode("utf-8", errors="replace")
+        logger.warning("Non-UTF-8 bytes in %s; replaced with U+FFFD", path)
+    return text.splitlines(keepends=True)


Silently violates the documented byte-preservation contract. The module docstring promises "byte-identical" text and serialize_csv_document claims byte-for-byte fidelity, but on a non-UTF-8 input file this catches UnicodeDecodeError and writes U+FFFD-replaced text into the staged JSONL. The pipeline exists to measure bits-per-byte on source bytes — a U+FFFD substitution becomes spurious PPL signal that nobody downstream will know to disregard.

Also conflicts with the root AGENTS.md error-handling rule "NEVER swallow exceptions unless specifically requested".

Either fail loudly on non-UTF-8 (preferred for an integrity-sensitive eval pipeline) or document explicitly that non-UTF-8 sources are excluded. Same concern applies to the broad except Exception: swallowing per-file read errors at L255-L259.

claude · 2026-04-23T16:57:17Z

+                if cfg.preserve_header:
+                    header = _first_non_blank_line(lines)
+                    # Drop the header from the body — it gets reintroduced by serialize_csv_document
+                    body_lines = [line for line in lines if line != header] if header else lines
+                else:
+                    header = None
+                    body_lines = lines


Header dedup drops every line equal to the header, not just the first occurrence. If a data row happens to byte-match the first non-blank line (common cases: short rows like a single empty value, files where the "header" is in fact data, or concatenated CSVs with repeated headers — but also any legitimate row whose value equals the header text), it silently disappears. That is a correctness hole for a module whose contract is byte preservation.

Suggest stripping the header by index instead, e.g.:

Suggested change

if cfg.preserve_header:

header = _first_non_blank_line(lines)

# Drop the header from the body — it gets reintroduced by serialize_csv_document

body_lines = [line for line in lines if line != header] if header else lines

else:

header = None

body_lines = lines

if cfg.preserve_header:

header = _first_non_blank_line(lines)

if header is not None:

header_idx = lines.index(header)

body_lines = lines[:header_idx] + lines[header_idx + 1 :]

else:

body_lines = lines

else:

header = None

body_lines = lines

claude · 2026-04-23T16:57:26Z

+    # Empty cell must be preserved as an empty field between tab delimiters.
+    assert "日本\t\n" in text or text.endswith("日本\t") or "日本\t" in text


Tautological assertion — the third disjunct "日本\t" in text is a strict superset of the other two, so the whole expression collapses to "日本\t" in text. That weaker check passes whenever 日本 is followed by any tab anywhere in the document, including in a non-empty trailing cell, so the empty-cell-survival property the comment claims to test isn't actually verified.

Suggest pinning the expected serialization more precisely, e.g.:

Suggested change

# Empty cell must be preserved as an empty field between tab delimiters.

assert "日本\t\n" in text or text.endswith("日本\t") or "日本\t" in text

# Empty cell must be preserved as an empty field between tab delimiters.

assert text.endswith("日本\t") or "日本\t\n" in text

dlwh added the agent-generated Created by automation/agent label Apr 23, 2026

chatgpt-codex-connector Bot reviewed Apr 23, 2026

View reviewed changes

claude Bot reviewed Apr 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[evals] Add first-wave tabular PPL slices#5129

[evals] Add first-wave tabular PPL slices#5129
dlwh wants to merge 1 commit intomainfrom
claude/issue-5059-20260422-2245

dlwh commented Apr 23, 2026

Uh oh!

claude Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Uh oh!

claude Bot Apr 23, 2026

Uh oh!

claude Bot Apr 23, 2026

Uh oh!

claude Bot Apr 23, 2026

Uh oh!

claude Bot Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		# Empty cell must be preserved as an empty field between tab delimiters.
		assert "日本\t\n" in text or text.endswith("日本\t") or "日本\t" in text

Conversation

dlwh commented Apr 23, 2026

Uh oh!

claude Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review

Tasks

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Apr 23, 2026 •

edited

Loading