Skip to content

[evals] Add first-wave tabular PPL slices#5129

Open
dlwh wants to merge 1 commit intomainfrom
claude/issue-5059-20260422-2245
Open

[evals] Add first-wave tabular PPL slices#5129
dlwh wants to merge 1 commit intomainfrom
claude/issue-5059-20260422-2245

Conversation

@dlwh
Copy link
Copy Markdown
Member

@dlwh dlwh commented Apr 23, 2026

Add byte-preserving tabular PPL helpers and opt-in table-record slices for early structured-data coverage, keeping numeric literals, delimiters, empty cells, and unicode intact for gap analysis. This is a first wave for tabular data only; time-series, geospatial, and broader conditional-likelihood follow-ons stay in #5059.

Part of #5005

First wave of structured-data PPL probes from issue #5059: serializers
and staging steps for HuggingFace-hosted table datasets, with hard
byte-preservation guarantees so numeric literals, delimiters, empty
cells, and unicode survive into the text the tokenizer sees.

- transform/structured_text/tabular.py: raw CSV/TSV staging that only
  ever splits at line boundaries, replicates the header into each
  chunk, and caps total bytes per source (20-40 MB band).
- transform/structured_text/table_records.py: record-level serializers
  for ToTTo and WikiTableQuestions, plus a byte-capped HF staging step.
- experiments/structured_evals.py: wires ToTTo + WikiTableQuestions
  into is_validation=True tokenize steps keyed under structured_text/*
  so they drop into mixture_for_evaluation alongside Paloma.
- tests/transform/test_structured_text_*.py: 25 tests covering byte
  preservation, header replication, chunking invariants, size caps,
  unicode cells, and numeric-precision survival.

Time-series (PR 2), geospatial + span classifier (PR 3), and
conditional-likelihood eval + pilot gap report (PR 4) are tracked as
follow-ups on #5059.

Co-authored-by: David Hall <dlwh@users.noreply.github.com>
@dlwh dlwh added the agent-generated Created by automation/agent label Apr 23, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 23, 2026

Claude finished @dlwh's task in 10m 18s —— View job


Code review

Reviewed for bugs and CLAUDE.md/AGENTS.md compliance. 4 issues found, posted inline.

# Severity File Issue
1 High lib/marin/src/marin/transform/structured_text/table_records.py L204-207 datasets.load_dataset(input_path, ...) is called with a gs:// URL produced by download_hfload_dataset's first positional arg does not accept GCS URLs, only HF dataset ids, local paths, or canonical loader names. Existing tests mock _load_hf_iterable, so this path is never exercised. The staging step will fail at runtime. (discussion)
2 High lib/marin/src/marin/transform/structured_text/tabular.py L186-193 _read_text_file_lines catches UnicodeDecodeError and falls back to errors="replace", silently writing U+FFFD-substituted text. This violates the module's documented byte-preservation contract and the root AGENTS.md "NEVER swallow exceptions" rule. For a PPL probe, the substituted bytes become spurious signal. (discussion)
3 Medium lib/marin/src/marin/transform/structured_text/tabular.py L264-270 Header dedup body_lines = [line for line in lines if line != header] drops every line equal to the header, not just the first. Any data row that byte-matches the header silently vanishes. (discussion)
4 Low tests/transform/test_structured_text_table_records.py L163-164 Tautological assertion: third disjunct subsumes the others, so the test does not actually verify the empty-cell-between-tabs case it documents. (discussion)

Issues 1 and 2 are blockers — the pipeline cannot run end-to-end as written, and the byte-preservation invariant the PR is built around is not actually enforced.

Tasks

  • Read PR diff and changed files
  • Audit CLAUDE.md / AGENTS.md compliance (root, lib/marin, experiments)
  • Scan for bugs and broken contracts
  • Validate the load_dataset(gs://...) call against existing patterns in transform_preference_data.py
  • Post inline comments
    · branch claude/issue-5059-20260422-2245

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0244f8ca8f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

if cfg.preserve_header:
header = _first_non_blank_line(lines)
# Drop the header from the body — it gets reintroduced by serialize_csv_document
body_lines = [line for line in lines if line != header] if header else lines
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Drop only the first detected header line

body_lines = [line for line in lines if line != header] removes every row whose contents match the header, not just the initial header row. If a real data row repeats the header text (which happens in sectioned exports and some denormalized tables), that row is silently lost, so the staged output is no longer byte-preserving and downstream PPL measurements are computed on corrupted input. Remove only the single header occurrence at its original position instead of value-filtering the whole file.

Useful? React with 👍 / 👎.

fn=download_hf,
config=HfDownloadConfig(
hf_dataset_id=versioned("GEM/totto"),
revision=versioned("main"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use immutable HF revisions for eval inputs

Using revision=versioned("main") makes this eval input drift as upstream datasets change, so identical pipeline configs can produce different structured-eval corpora over time. That breaks reproducibility of perplexity comparisons and contradicts the module’s own claim that revisions are pinned to commits (the same pattern is also used for WikiTableQuestions in this file).

Useful? React with 👍 / 👎.

Comment on lines +204 to +207
from datasets import load_dataset # local import to keep module importable without `datasets`

dataset = load_dataset(input_path, subset, split=split)
return dataset
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug — staging will fail at runtime. datasets.load_dataset(input_path, ...) does not accept gs:// URLs as the first positional argument. load_dataset's path parameter resolves either an HF dataset id, a local filesystem path, or a canonical loader name (e.g. "parquet"); fsspec is only honored for data_files=, not for the dataset path itself.

The input_path here is the GCS prefix produced by download_hf (see experiments/structured_evals.py#L55-L80 and #L105-L118). The existing tests mock _load_hf_iterable entirely, so this code path is never exercised in CI.

The canonical pattern in this repo for parquet outputs from download_hf is what transform_preference_data.py#L233-L242 does — glob the parquet files via fsspec and pass them to the canonical "parquet" loader, e.g.:

fs, root = url_to_fs(input_path)
files = sorted(fs.find(root, withdirs=False))
data_files = [f"gs://{p}" if not p.startswith(("gs://", "/")) else p for p in files if p.endswith(".parquet")]
dataset = load_dataset("parquet", data_files={split: data_files}, split=split, streaming=True)

Recommend adding an integration test that exercises the real loader against a small fixture parquet so this would have been caught before merge.

Comment on lines +186 to +193
with fs.open(path, "rb") as f:
raw = f.read()
try:
text = raw.decode("utf-8")
except UnicodeDecodeError:
text = raw.decode("utf-8", errors="replace")
logger.warning("Non-UTF-8 bytes in %s; replaced with U+FFFD", path)
return text.splitlines(keepends=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silently violates the documented byte-preservation contract. The module docstring promises "byte-identical" text and serialize_csv_document claims byte-for-byte fidelity, but on a non-UTF-8 input file this catches UnicodeDecodeError and writes U+FFFD-replaced text into the staged JSONL. The pipeline exists to measure bits-per-byte on source bytes — a U+FFFD substitution becomes spurious PPL signal that nobody downstream will know to disregard.

Also conflicts with the root AGENTS.md error-handling rule "NEVER swallow exceptions unless specifically requested".

Either fail loudly on non-UTF-8 (preferred for an integrity-sensitive eval pipeline) or document explicitly that non-UTF-8 sources are excluded. Same concern applies to the broad except Exception: swallowing per-file read errors at L255-L259.

Comment on lines +264 to +270
if cfg.preserve_header:
header = _first_non_blank_line(lines)
# Drop the header from the body — it gets reintroduced by serialize_csv_document
body_lines = [line for line in lines if line != header] if header else lines
else:
header = None
body_lines = lines
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Header dedup drops every line equal to the header, not just the first occurrence. If a data row happens to byte-match the first non-blank line (common cases: short rows like a single empty value, files where the "header" is in fact data, or concatenated CSVs with repeated headers — but also any legitimate row whose value equals the header text), it silently disappears. That is a correctness hole for a module whose contract is byte preservation.

Suggest stripping the header by index instead, e.g.:

Suggested change
if cfg.preserve_header:
header = _first_non_blank_line(lines)
# Drop the header from the body — it gets reintroduced by serialize_csv_document
body_lines = [line for line in lines if line != header] if header else lines
else:
header = None
body_lines = lines
if cfg.preserve_header:
header = _first_non_blank_line(lines)
if header is not None:
header_idx = lines.index(header)
body_lines = lines[:header_idx] + lines[header_idx + 1 :]
else:
body_lines = lines
else:
header = None
body_lines = lines

Comment on lines +163 to +164
# Empty cell must be preserved as an empty field between tab delimiters.
assert "日本\t\n" in text or text.endswith("日本\t") or "日本\t" in text
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tautological assertion — the third disjunct "日本\t" in text is a strict superset of the other two, so the whole expression collapses to "日本\t" in text. That weaker check passes whenever 日本 is followed by any tab anywhere in the document, including in a non-empty trailing cell, so the empty-cell-survival property the comment claims to test isn't actually verified.

Suggest pinning the expected serialization more precisely, e.g.:

Suggested change
# Empty cell must be preserved as an empty field between tab delimiters.
assert "日本\t\n" in text or text.endswith("日本\t") or "日本\t" in text
# Empty cell must be preserved as an empty field between tab delimiters.
assert text.endswith("日本\t") or "日本\t\n" in text

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant