Conversation
First wave of structured-data PPL probes from issue #5059: serializers and staging steps for HuggingFace-hosted table datasets, with hard byte-preservation guarantees so numeric literals, delimiters, empty cells, and unicode survive into the text the tokenizer sees. - transform/structured_text/tabular.py: raw CSV/TSV staging that only ever splits at line boundaries, replicates the header into each chunk, and caps total bytes per source (20-40 MB band). - transform/structured_text/table_records.py: record-level serializers for ToTTo and WikiTableQuestions, plus a byte-capped HF staging step. - experiments/structured_evals.py: wires ToTTo + WikiTableQuestions into is_validation=True tokenize steps keyed under structured_text/* so they drop into mixture_for_evaluation alongside Paloma. - tests/transform/test_structured_text_*.py: 25 tests covering byte preservation, header replication, chunking invariants, size caps, unicode cells, and numeric-precision survival. Time-series (PR 2), geospatial + span classifier (PR 3), and conditional-likelihood eval + pilot gap report (PR 4) are tracked as follow-ups on #5059. Co-authored-by: David Hall <dlwh@users.noreply.github.com>
|
Claude finished @dlwh's task in 10m 18s —— View job Code reviewReviewed for bugs and CLAUDE.md/AGENTS.md compliance. 4 issues found, posted inline.
Issues 1 and 2 are blockers — the pipeline cannot run end-to-end as written, and the byte-preservation invariant the PR is built around is not actually enforced. Tasks
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0244f8ca8f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if cfg.preserve_header: | ||
| header = _first_non_blank_line(lines) | ||
| # Drop the header from the body — it gets reintroduced by serialize_csv_document | ||
| body_lines = [line for line in lines if line != header] if header else lines |
There was a problem hiding this comment.
Drop only the first detected header line
body_lines = [line for line in lines if line != header] removes every row whose contents match the header, not just the initial header row. If a real data row repeats the header text (which happens in sectioned exports and some denormalized tables), that row is silently lost, so the staged output is no longer byte-preserving and downstream PPL measurements are computed on corrupted input. Remove only the single header occurrence at its original position instead of value-filtering the whole file.
Useful? React with 👍 / 👎.
| fn=download_hf, | ||
| config=HfDownloadConfig( | ||
| hf_dataset_id=versioned("GEM/totto"), | ||
| revision=versioned("main"), |
There was a problem hiding this comment.
Use immutable HF revisions for eval inputs
Using revision=versioned("main") makes this eval input drift as upstream datasets change, so identical pipeline configs can produce different structured-eval corpora over time. That breaks reproducibility of perplexity comparisons and contradicts the module’s own claim that revisions are pinned to commits (the same pattern is also used for WikiTableQuestions in this file).
Useful? React with 👍 / 👎.
| from datasets import load_dataset # local import to keep module importable without `datasets` | ||
|
|
||
| dataset = load_dataset(input_path, subset, split=split) | ||
| return dataset |
There was a problem hiding this comment.
Bug — staging will fail at runtime. datasets.load_dataset(input_path, ...) does not accept gs:// URLs as the first positional argument. load_dataset's path parameter resolves either an HF dataset id, a local filesystem path, or a canonical loader name (e.g. "parquet"); fsspec is only honored for data_files=, not for the dataset path itself.
The input_path here is the GCS prefix produced by download_hf (see experiments/structured_evals.py#L55-L80 and #L105-L118). The existing tests mock _load_hf_iterable entirely, so this code path is never exercised in CI.
The canonical pattern in this repo for parquet outputs from download_hf is what transform_preference_data.py#L233-L242 does — glob the parquet files via fsspec and pass them to the canonical "parquet" loader, e.g.:
fs, root = url_to_fs(input_path)
files = sorted(fs.find(root, withdirs=False))
data_files = [f"gs://{p}" if not p.startswith(("gs://", "/")) else p for p in files if p.endswith(".parquet")]
dataset = load_dataset("parquet", data_files={split: data_files}, split=split, streaming=True)Recommend adding an integration test that exercises the real loader against a small fixture parquet so this would have been caught before merge.
| with fs.open(path, "rb") as f: | ||
| raw = f.read() | ||
| try: | ||
| text = raw.decode("utf-8") | ||
| except UnicodeDecodeError: | ||
| text = raw.decode("utf-8", errors="replace") | ||
| logger.warning("Non-UTF-8 bytes in %s; replaced with U+FFFD", path) | ||
| return text.splitlines(keepends=True) |
There was a problem hiding this comment.
Silently violates the documented byte-preservation contract. The module docstring promises "byte-identical" text and serialize_csv_document claims byte-for-byte fidelity, but on a non-UTF-8 input file this catches UnicodeDecodeError and writes U+FFFD-replaced text into the staged JSONL. The pipeline exists to measure bits-per-byte on source bytes — a U+FFFD substitution becomes spurious PPL signal that nobody downstream will know to disregard.
Also conflicts with the root AGENTS.md error-handling rule "NEVER swallow exceptions unless specifically requested".
Either fail loudly on non-UTF-8 (preferred for an integrity-sensitive eval pipeline) or document explicitly that non-UTF-8 sources are excluded. Same concern applies to the broad except Exception: swallowing per-file read errors at L255-L259.
| if cfg.preserve_header: | ||
| header = _first_non_blank_line(lines) | ||
| # Drop the header from the body — it gets reintroduced by serialize_csv_document | ||
| body_lines = [line for line in lines if line != header] if header else lines | ||
| else: | ||
| header = None | ||
| body_lines = lines |
There was a problem hiding this comment.
Header dedup drops every line equal to the header, not just the first occurrence. If a data row happens to byte-match the first non-blank line (common cases: short rows like a single empty value, files where the "header" is in fact data, or concatenated CSVs with repeated headers — but also any legitimate row whose value equals the header text), it silently disappears. That is a correctness hole for a module whose contract is byte preservation.
Suggest stripping the header by index instead, e.g.:
| if cfg.preserve_header: | |
| header = _first_non_blank_line(lines) | |
| # Drop the header from the body — it gets reintroduced by serialize_csv_document | |
| body_lines = [line for line in lines if line != header] if header else lines | |
| else: | |
| header = None | |
| body_lines = lines | |
| if cfg.preserve_header: | |
| header = _first_non_blank_line(lines) | |
| if header is not None: | |
| header_idx = lines.index(header) | |
| body_lines = lines[:header_idx] + lines[header_idx + 1 :] | |
| else: | |
| body_lines = lines | |
| else: | |
| header = None | |
| body_lines = lines |
| # Empty cell must be preserved as an empty field between tab delimiters. | ||
| assert "日本\t\n" in text or text.endswith("日本\t") or "日本\t" in text |
There was a problem hiding this comment.
Tautological assertion — the third disjunct "日本\t" in text is a strict superset of the other two, so the whole expression collapses to "日本\t" in text. That weaker check passes whenever 日本 is followed by any tab anywhere in the document, including in a non-empty trailing cell, so the empty-cell-survival property the comment claims to test isn't actually verified.
Suggest pinning the expected serialization more precisely, e.g.:
| # Empty cell must be preserved as an empty field between tab delimiters. | |
| assert "日本\t\n" in text or text.endswith("日本\t") or "日本\t" in text | |
| # Empty cell must be preserved as an empty field between tab delimiters. | |
| assert text.endswith("日本\t") or "日本\t\n" in text |
Add byte-preserving tabular PPL helpers and opt-in table-record slices for early structured-data coverage, keeping numeric literals, delimiters, empty cells, and unicode intact for gap analysis. This is a first wave for tabular data only; time-series, geospatial, and broader conditional-likelihood follow-ons stay in #5059.
Part of #5005