Conversation
Streams seven biological/chemical notation source families (NCBI RefSeq viral FASTA+GFF, RNAcentral, UniProt/Swiss-Prot FASTA+DAT, PubChem CID-SMILES+SDF, RCSB PDB mmCIF, ChEMBL chemreps+SDF, MoleculeNet ESOL+ ClinTox SMILES) directly to plain-text-in-parquet without materialising the full upstream releases. Format-preserving record splitters for FASTA, GFF, SMILES, SDF, mmCIF, and UniProt-DAT live in marin.transform.bio_chem.splitters and round-trip their input bytes verbatim (preserved sequence bodies, headers, tab columns, $$$$ terminators, loop_ structure, etc.). Short records are packed into ~8 KB documents so the pilot can probe in-context modelling of repeated record formats. Sampling caps (5000 records or 64 MiB per source by default) keep the slices small enough to stream from upstream mirrors on each run. Each source family lives in its own module under marin.datakit.download.bio_chem. The new experiments/bio_chem_notation.py exposes bio_chem_raw_validation_sets() and bio_chem_tokenized() with the same shape as paloma_raw_validation_sets / uncheatable_eval_*, and exp_model_perplexity_gap_marin_vs_llama.py now feeds those slices into the existing marin-8b-vs-llama-3.1-8b gap report. Closes part 1+2 of #5058.
|
Claude finished @dlwh's task in 7m 0s —— View job Code reviewChecked PR #5127 for bugs and CLAUDE.md/AGENTS.md compliance.
Result: No correctness bugs found. The format-preserving splitters, cap logic, and streaming runtime all check out — tests cover byte-for-byte round-tripping of each splitter. Posted 3 inline comments for AGENTS.md violations:
Non-blocking observations (not posted inline, listed for context):
|
| __all__ = [ | ||
| "NotationFormat", | ||
| "NotationSliceSpec", | ||
| "PackingConfig", | ||
| "bio_chem_slice_step", | ||
| "run_notation_slice", | ||
| ] |
There was a problem hiding this comment.
__all__ in __init__.py is explicitly called out as an LLM-generated code pitfall to eliminate — see the root AGENTS.md "LLM-Generated Code Pitfalls" section:
Verbose/redundant docstrings and
__all__in__init__.py
The re-exports above (lines 16–22) already determine the public surface; __all__ is redundant here. Drop the block.
| __all__ = [ | ||
| "iter_fasta_records", | ||
| "iter_gff_blocks", | ||
| "iter_mmcif_blocks", | ||
| "iter_sdf_records", | ||
| "iter_smiles_records", | ||
| "pack_records_into_docs", | ||
| "take_until_cap", | ||
| ] |
There was a problem hiding this comment.
Same issue as in datakit/download/bio_chem/__init__.py: __all__ in __init__.py is flagged by the root AGENTS.md "LLM-Generated Code Pitfalls":
Verbose/redundant docstrings and
__all__in__init__.py
Re-exports on lines 12–20 already define the public API. Remove __all__.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d7fb890b40
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| for record in records: | ||
| if seen_records >= cap.max_records or seen_bytes >= cap.max_bytes: | ||
| return |
There was a problem hiding this comment.
Check caps before consuming the next record
take_until_cap checks seen_records/seen_bytes only after the for loop has already pulled the next item from records, so once a cap is reached it still reads and parses one extra upstream record before returning. For streamed remote sources, that extra SDF/mmCIF record can be large, which breaks the intended sampling/transfer bound and can add avoidable network and runtime cost.
Useful? React with 👍 / 👎.
| for spec in cfg.slices: | ||
| logger.info("Streaming bio/chem slice %s from %d url(s)", spec.name, len(spec.urls)) | ||
| summaries.append(run_notation_slice(spec, str(cfg.output_path))) |
There was a problem hiding this comment.
Honor configured slice parallelism during execution
The public parallelism knob is accepted and persisted in BioChemSliceConfig, but run_bio_chem_slices always processes slices sequentially and never uses cfg.parallelism. This means callers cannot actually tune concurrency, and changing parallelism can invalidate the step hash without changing runtime behavior.
Useful? React with 👍 / 👎.
| from experiments.defaults import default_tokenize | ||
|
|
||
| out: dict[str, ExecutorStep[TokenizeConfig]] = {} | ||
| for slice_ in slices: | ||
| key = _slice_key(slice_) | ||
| out[key] = default_tokenize( | ||
| name=key, | ||
| dataset=slice_.step.cd(_slice_glob(slice_)), | ||
| tokenizer=tokenizer, | ||
| is_validation=True, | ||
| ) | ||
| return out | ||
|
|
||
|
|
||
| def bio_chem_raw_validation_sets( | ||
| slices: tuple[BioChemSlice, ...] = BIO_CHEM_SLICES, | ||
| ): | ||
| """Wire bio/chem slices into the perplexity-gap raw-text dataset registry.""" | ||
| from marin.evaluation.perplexity_gap import raw_text_dataset |
There was a problem hiding this comment.
Both default_tokenize and raw_text_dataset are loaded via function-local imports:
bio_chem_notation.py#L93-L98—from experiments.defaults import default_tokenizebio_chem_notation.py#L111-L117—from marin.evaluation.perplexity_gap import raw_text_dataset
Root AGENTS.md — "Code Style":
All imports at the top of the file. No local imports except to break circular dependencies or guard optional deps. No
TYPE_CHECKINGguards — fix cycles structurally via protocols.
Neither of these is an optional dependency, and neither target imports back from experiments.bio_chem_notation (I checked — no cycle exists). Hoist both to the module-level import block at the top of the file.
Add deterministic bio and chemistry notation PPL slices with format-preserving splitters and streamed source-family downloaders so FASTA, GFF, SMILES, SDF, mmCIF, and UniProt-DAT surface forms survive into raw eval text. Keeps the work scoped to small sampled slices and a focused gap-run hook.
Part of #5005