[evals] Add bio/chem notation PPL slices by dlwh · Pull Request #5127 · marin-community/marin

dlwh · 2026-04-23T16:45:58Z

Add deterministic bio and chemistry notation PPL slices with format-preserving splitters and streamed source-family downloaders so FASTA, GFF, SMILES, SDF, mmCIF, and UniProt-DAT surface forms survive into raw eval text. Keeps the work scoped to small sampled slices and a focused gap-run hook.

Part of #5005

Streams seven biological/chemical notation source families (NCBI RefSeq viral FASTA+GFF, RNAcentral, UniProt/Swiss-Prot FASTA+DAT, PubChem CID-SMILES+SDF, RCSB PDB mmCIF, ChEMBL chemreps+SDF, MoleculeNet ESOL+ ClinTox SMILES) directly to plain-text-in-parquet without materialising the full upstream releases. Format-preserving record splitters for FASTA, GFF, SMILES, SDF, mmCIF, and UniProt-DAT live in marin.transform.bio_chem.splitters and round-trip their input bytes verbatim (preserved sequence bodies, headers, tab columns, $$$$ terminators, loop_ structure, etc.). Short records are packed into ~8 KB documents so the pilot can probe in-context modelling of repeated record formats. Sampling caps (5000 records or 64 MiB per source by default) keep the slices small enough to stream from upstream mirrors on each run. Each source family lives in its own module under marin.datakit.download.bio_chem. The new experiments/bio_chem_notation.py exposes bio_chem_raw_validation_sets() and bio_chem_tokenized() with the same shape as paloma_raw_validation_sets / uncheatable_eval_*, and exp_model_perplexity_gap_marin_vs_llama.py now feeds those slices into the existing marin-8b-vs-llama-3.1-8b gap report. Closes part 1+2 of #5058.

claude · 2026-04-23T16:46:51Z

Claude finished @dlwh's task in 7m 0s —— View job

Code review

Checked PR #5127 for bugs and CLAUDE.md/AGENTS.md compliance.

Check PR review eligibility
Collect CLAUDE.md/AGENTS.md files for modified paths
Summarize diff
Run 4 parallel review agents (2 compliance + 2 bug)
Validate flagged issues
Post inline comments for confirmed issues

Result: No correctness bugs found. The format-preserving splitters, cap logic, and streaming runtime all check out — tests cover byte-for-byte round-tripping of each splitter. Posted 3 inline comments for AGENTS.md violations:

__all__ in lib/marin/src/marin/datakit/download/bio_chem/__init__.py — flagged by root AGENTS.md ("LLM-Generated Code Pitfalls").
__all__ in lib/marin/src/marin/transform/bio_chem/__init__.py — same rule.
Function-local imports in experiments/bio_chem_notation.py (default_tokenize, raw_text_dataset) — violates "All imports at the top of the file. No local imports except to break circular dependencies or guard optional deps." Neither target creates a cycle.

Non-blocking observations (not posted inline, listed for context):

BioChemSliceConfig.parallelism is accepted/stored but never consumed by run_bio_chem_slices — appears to be a dead field.
iter_uniprot_dat_records is the only format splitter defined outside marin.transform.bio_chem.splitters; the lazy import in _runtime.py cites a circular import that would disappear if the function were moved to the splitters module alongside its siblings.
· branch: claude/issue-5058-20260422-2233

claude · 2026-04-23T16:53:44Z

+__all__ = [
+    "NotationFormat",
+    "NotationSliceSpec",
+    "PackingConfig",
+    "bio_chem_slice_step",
+    "run_notation_slice",
+]


__all__ in __init__.py is explicitly called out as an LLM-generated code pitfall to eliminate — see the root AGENTS.md "LLM-Generated Code Pitfalls" section:

Verbose/redundant docstrings and __all__ in __init__.py

The re-exports above (lines 16–22) already determine the public surface; __all__ is redundant here. Drop the block.

claude · 2026-04-23T16:53:46Z

+__all__ = [
+    "iter_fasta_records",
+    "iter_gff_blocks",
+    "iter_mmcif_blocks",
+    "iter_sdf_records",
+    "iter_smiles_records",
+    "pack_records_into_docs",
+    "take_until_cap",
+]


Same issue as in datakit/download/bio_chem/__init__.py: __all__ in __init__.py is flagged by the root AGENTS.md "LLM-Generated Code Pitfalls":

Verbose/redundant docstrings and __all__ in __init__.py

Re-exports on lines 12–20 already define the public API. Remove __all__.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d7fb890b40

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-23T16:53:50Z

+    for record in records:
+        if seen_records >= cap.max_records or seen_bytes >= cap.max_bytes:
+            return


Check caps before consuming the next record

take_until_cap checks seen_records/seen_bytes only after the for loop has already pulled the next item from records, so once a cap is reached it still reads and parses one extra upstream record before returning. For streamed remote sources, that extra SDF/mmCIF record can be large, which breaks the intended sampling/transfer bound and can add avoidable network and runtime cost.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-23T16:53:51Z

+    for spec in cfg.slices:
+        logger.info("Streaming bio/chem slice %s from %d url(s)", spec.name, len(spec.urls))
+        summaries.append(run_notation_slice(spec, str(cfg.output_path)))


Honor configured slice parallelism during execution

The public parallelism knob is accepted and persisted in BioChemSliceConfig, but run_bio_chem_slices always processes slices sequentially and never uses cfg.parallelism. This means callers cannot actually tune concurrency, and changing parallelism can invalidate the step hash without changing runtime behavior.

Useful? React with 👍 / 👎.

claude · 2026-04-23T16:53:53Z

+    from experiments.defaults import default_tokenize
+
+    out: dict[str, ExecutorStep[TokenizeConfig]] = {}
+    for slice_ in slices:
+        key = _slice_key(slice_)
+        out[key] = default_tokenize(
+            name=key,
+            dataset=slice_.step.cd(_slice_glob(slice_)),
+            tokenizer=tokenizer,
+            is_validation=True,
+        )
+    return out
+
+
+def bio_chem_raw_validation_sets(
+    slices: tuple[BioChemSlice, ...] = BIO_CHEM_SLICES,
+):
+    """Wire bio/chem slices into the perplexity-gap raw-text dataset registry."""
+    from marin.evaluation.perplexity_gap import raw_text_dataset


Both default_tokenize and raw_text_dataset are loaded via function-local imports:

bio_chem_notation.py#L93-L98 — from experiments.defaults import default_tokenize

bio_chem_notation.py#L111-L117 — from marin.evaluation.perplexity_gap import raw_text_dataset

Root AGENTS.md — "Code Style":

All imports at the top of the file. No local imports except to break circular dependencies or guard optional deps. No TYPE_CHECKING guards — fix cycles structurally via protocols.

Neither of these is an optional dependency, and neither target imports back from experiments.bio_chem_notation (I checked — no cycle exists). Hoist both to the module-level import block at the top of the file.

dlwh added the agent-generated Created by automation/agent label Apr 23, 2026

claude Bot reviewed Apr 23, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 23, 2026

View reviewed changes

claude Bot reviewed Apr 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[evals] Add bio/chem notation PPL slices#5127

[evals] Add bio/chem notation PPL slices#5127
dlwh wants to merge 1 commit intomainfrom
claude/issue-5058-20260422-2233

dlwh commented Apr 23, 2026

Uh oh!

claude Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

claude Bot Apr 23, 2026

Uh oh!

claude Bot Apr 23, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Uh oh!

claude Bot Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dlwh commented Apr 23, 2026

Uh oh!

claude Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review

Uh oh!

claude Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Apr 23, 2026 •

edited

Loading