Skip to content

[evals] Add bio/chem notation PPL slices#5127

Open
dlwh wants to merge 1 commit intomainfrom
claude/issue-5058-20260422-2233
Open

[evals] Add bio/chem notation PPL slices#5127
dlwh wants to merge 1 commit intomainfrom
claude/issue-5058-20260422-2233

Conversation

@dlwh
Copy link
Copy Markdown
Member

@dlwh dlwh commented Apr 23, 2026

Add deterministic bio and chemistry notation PPL slices with format-preserving splitters and streamed source-family downloaders so FASTA, GFF, SMILES, SDF, mmCIF, and UniProt-DAT surface forms survive into raw eval text. Keeps the work scoped to small sampled slices and a focused gap-run hook.

Part of #5005

Streams seven biological/chemical notation source families (NCBI RefSeq
viral FASTA+GFF, RNAcentral, UniProt/Swiss-Prot FASTA+DAT, PubChem
CID-SMILES+SDF, RCSB PDB mmCIF, ChEMBL chemreps+SDF, MoleculeNet ESOL+
ClinTox SMILES) directly to plain-text-in-parquet without materialising
the full upstream releases.

Format-preserving record splitters for FASTA, GFF, SMILES, SDF, mmCIF,
and UniProt-DAT live in marin.transform.bio_chem.splitters and round-trip
their input bytes verbatim (preserved sequence bodies, headers, tab
columns, $$$$ terminators, loop_ structure, etc.). Short records are
packed into ~8 KB documents so the pilot can probe in-context modelling
of repeated record formats. Sampling caps (5000 records or 64 MiB per
source by default) keep the slices small enough to stream from upstream
mirrors on each run.

Each source family lives in its own module under
marin.datakit.download.bio_chem. The new experiments/bio_chem_notation.py
exposes bio_chem_raw_validation_sets() and bio_chem_tokenized() with the
same shape as paloma_raw_validation_sets / uncheatable_eval_*, and
exp_model_perplexity_gap_marin_vs_llama.py now feeds those slices into
the existing marin-8b-vs-llama-3.1-8b gap report.

Closes part 1+2 of #5058.
@dlwh dlwh added the agent-generated Created by automation/agent label Apr 23, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 23, 2026

Claude finished @dlwh's task in 7m 0s —— View job


Code review

Checked PR #5127 for bugs and CLAUDE.md/AGENTS.md compliance.

  • Check PR review eligibility
  • Collect CLAUDE.md/AGENTS.md files for modified paths
  • Summarize diff
  • Run 4 parallel review agents (2 compliance + 2 bug)
  • Validate flagged issues
  • Post inline comments for confirmed issues

Result: No correctness bugs found. The format-preserving splitters, cap logic, and streaming runtime all check out — tests cover byte-for-byte round-tripping of each splitter. Posted 3 inline comments for AGENTS.md violations:

  1. __all__ in lib/marin/src/marin/datakit/download/bio_chem/__init__.py — flagged by root AGENTS.md ("LLM-Generated Code Pitfalls").
  2. __all__ in lib/marin/src/marin/transform/bio_chem/__init__.py — same rule.
  3. Function-local imports in experiments/bio_chem_notation.py (default_tokenize, raw_text_dataset) — violates "All imports at the top of the file. No local imports except to break circular dependencies or guard optional deps." Neither target creates a cycle.

Non-blocking observations (not posted inline, listed for context):

  • BioChemSliceConfig.parallelism is accepted/stored but never consumed by run_bio_chem_slices — appears to be a dead field.
  • iter_uniprot_dat_records is the only format splitter defined outside marin.transform.bio_chem.splitters; the lazy import in _runtime.py cites a circular import that would disappear if the function were moved to the splitters module alongside its siblings.
    · branch: claude/issue-5058-20260422-2233

Comment on lines +24 to +30
__all__ = [
"NotationFormat",
"NotationSliceSpec",
"PackingConfig",
"bio_chem_slice_step",
"run_notation_slice",
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__all__ in __init__.py is explicitly called out as an LLM-generated code pitfall to eliminate — see the root AGENTS.md "LLM-Generated Code Pitfalls" section:

Verbose/redundant docstrings and __all__ in __init__.py

The re-exports above (lines 16–22) already determine the public surface; __all__ is redundant here. Drop the block.

Comment on lines +22 to +30
__all__ = [
"iter_fasta_records",
"iter_gff_blocks",
"iter_mmcif_blocks",
"iter_sdf_records",
"iter_smiles_records",
"pack_records_into_docs",
"take_until_cap",
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as in datakit/download/bio_chem/__init__.py: __all__ in __init__.py is flagged by the root AGENTS.md "LLM-Generated Code Pitfalls":

Verbose/redundant docstrings and __all__ in __init__.py

Re-exports on lines 12–20 already define the public API. Remove __all__.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d7fb890b40

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +210 to +212
for record in records:
if seen_records >= cap.max_records or seen_bytes >= cap.max_bytes:
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Check caps before consuming the next record

take_until_cap checks seen_records/seen_bytes only after the for loop has already pulled the next item from records, so once a cap is reached it still reads and parses one extra upstream record before returning. For streamed remote sources, that extra SDF/mmCIF record can be large, which breaks the intended sampling/transfer bound and can add avoidable network and runtime cost.

Useful? React with 👍 / 👎.

Comment on lines +194 to +196
for spec in cfg.slices:
logger.info("Streaming bio/chem slice %s from %d url(s)", spec.name, len(spec.urls))
summaries.append(run_notation_slice(spec, str(cfg.output_path)))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor configured slice parallelism during execution

The public parallelism knob is accepted and persisted in BioChemSliceConfig, but run_bio_chem_slices always processes slices sequentially and never uses cfg.parallelism. This means callers cannot actually tune concurrency, and changing parallelism can invalidate the step hash without changing runtime behavior.

Useful? React with 👍 / 👎.

Comment on lines +97 to +115
from experiments.defaults import default_tokenize

out: dict[str, ExecutorStep[TokenizeConfig]] = {}
for slice_ in slices:
key = _slice_key(slice_)
out[key] = default_tokenize(
name=key,
dataset=slice_.step.cd(_slice_glob(slice_)),
tokenizer=tokenizer,
is_validation=True,
)
return out


def bio_chem_raw_validation_sets(
slices: tuple[BioChemSlice, ...] = BIO_CHEM_SLICES,
):
"""Wire bio/chem slices into the perplexity-gap raw-text dataset registry."""
from marin.evaluation.perplexity_gap import raw_text_dataset
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both default_tokenize and raw_text_dataset are loaded via function-local imports:

Root AGENTS.md — "Code Style":

All imports at the top of the file. No local imports except to break circular dependencies or guard optional deps. No TYPE_CHECKING guards — fix cycles structurally via protocols.

Neither of these is an optional dependency, and neither target imports back from experiments.bio_chem_notation (I checked — no cycle exists). Hoist both to the module-level import block at the top of the file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant