The purpose of this experiment is to run a parameter scaling sweep with all other settings optimized over a single epoch. This will use Complete(d) [1] heuristics for transfer, as adapted for Adam with Hyperball [2, 3, 4] normalization in marin#3292. Extensions for epoching may follow, but let's see how this goes first.
Implementation
Initial implementation plans:
Summary
- Data: Animals, single bp tokenizer (
tokenizer-char-bos), 255 bp context. Union of region ∈ {CDS (242M), upstream (68M), downstream (20M)}. Lowercase = repeats in training but = non-functional (phyloP) in validation — these differ.
- Step 1 — Reference sweep: Adapt text reference sweep (marin#2432) for DNA. Sizing: 130M text params → 2.028×10¹⁸ FLOPs; at ~100:1 DNA token-to-param ratio (IsoFLOP) → N=60M, D=6B. Sweep
initializer_std ∈ {.04, .02, .01, .005, .0025} over the base grid (default 0.02 not set by sweep) to guard against overfitting from higher sequence similarity vs text.
- Step 2 — Transfer validation: At single-epoch scale, sweep LR, beta1, beta2 in isolation at the largest model size to confirm loss basin alignment.
- Step 3 — Parameter scaling sweep: Model sizing and configs from
_build_model_configs in
completed_adamh.py.
- Online metrics: Unweighted CE loss nats/BPB stratified by
region (marin#2310), VEP (marin#3144,
marin#3333), LL(functional) - LL(non-functional) (bolinas#8)
- Offline metrics (final checkpoint, largest scale): VEP by variant type, VEP vs
LL(functional) - LL(non-functional) and validation loss
- Code: Marin branch
eac/dna-bolinas-scaling-sweep, module experiments/dna/exp<issue_num>_bolinas_scaling_sweep.py with subcommands
run_{reference_tuning,transfer_validation,parameter_scaling}_sweep. Analysis in bolinas-dna scripts/exp<issue_num>_scaling_sweep/ (collects from wandb only).
Details
Agent Instructions
Data
Animals / single bp tokenization: union of region ∈ {upstream, downstream, CDS}. Context: 255 bp (256−1 for BOS).
- Training: CDS (242,334,716) | Upstream (68,286,166) | Downstream (20,501,856) = 331,122,738 total (~84.8B tokens) (counts)
- Validation (16,384 each): CDS | Upstream | Downstream
- IMPORTANT: Lowercase = repeats in training, but = non-functional (non-conserved per phyloP) in validation. These are NOT the same.
- Mixture weights (proportional to examples, equivalent to concatenation): CDS=0.7319, upstream=0.2062, downstream=0.0619
- Tokenizer: tokenizer-char-bos, vocab_size=7 (PAD, UNK, BOS, a, c, g, t). Usage in
exp94_human_enhancers.py.
Metrics
- Online: unweighted CE loss nats / BPB (cf. marin#2310), stratified by
region (inferred from dataset source or added as explicit field)
- Online: VEP (marin#3144, marin#3333)
- Online:
LL(functional), LL(non-functional), LL(functional) - LL(non-functional) (bolinas#8)
- Offline (final checkpoint at largest scale): VEP by variant type
- Offline: VEP vs
LL(functional) - LL(non-functional) and validation loss
Step 1: Reference sweep
Run per marin#2432, adapted for DNA. Follow reference_hyperparameter_sweep.py for sweep structure.
- Sizing: 130M text params → C = 6ND = 2.028×10¹⁸ FLOPs (@20:1). At ~100:1 token-to-param ratio for DNA (IsoFLOP analysis) → N=60M, D=6B
- Sweep
initializer_range ∈ {.04, .02, .01, .005, .0025} via Qwen3Config.initializer_range (inherited from LlamaConfig.initializer_range, default 0.02)
- Set per study via
dataclasses.replace(base_model_config, initializer_range=value)
- Guard against overfitting given greater sequence similarity vs text (as N→∞ in single epoch)
- Architecture:
Qwen3Config (not Grug) via CompletedAdamHHeuristic._build_model_config with seq_len=256, vocab_size=7
- Training:
run_levanter_train_lm called directly inside remote(run_vizier_train) (not default_train ExecutorSteps — hparams not known at DAG construction time). Build TrainLmOnPodConfig from Vizier suggestion.
- Optimizer:
AdamHConfig built from Vizier suggestion, same as reference sweep's _build_adamh_config
- Group:
dna-bolinas-reference-sweep-{VERSION}
- Run name:
dna-bolinas-reference-{VERSION}-IR{initializer_range}-E{epochs}-L{loop}-T{trial}
- Tags:
sweep, dna, bolinas, reference, version, epochs, initializer_range, lr, beta1, adam_lr, beta2, epsilon, max_grad_norm, z_loss_weight, batch_size, loop, trial
# DAG construction (at __main__ time, not runtime)
for epochs in EPOCHS: # 1 for now
for init_range in INITIALIZER_RANGES: # 5 independent Vizier studies
model = replace(base_model, initializer_range=init_range)
for loop in range(num_loops): # sequential (DB dependency)
suggest ← previous_update / vizier.db
train × N ← suggest / suggestions.json # parallel
update ← [train_0..N] + suggest / vizier.db
optimal ← final_update / vizier.db
executor_main(steps=all_optimal_steps)
Step 2: Transfer validation
At single-epoch scale, sweep key hypers (LR, beta1, beta2) in isolation to test loss basin alignment. Use largest model size from the parameter scaling sweep (derive from same code).
Step 3: Parameter scaling sweep
Follow _build_model_configs in completed_adamh.py for model sizing and configs.
Code
Marin (~/repos/crfm/marin, branch eac/dna-bolinas-scaling-sweep off dna). Pending marin#4247; use eac/dna-rebase until merged.
- Module:
experiments/dna/exp4251_bolinas_scaling_sweep.py
- Subcommands via
if __name__ == "__main__" switch: run_{smoke_test,reference_tuning,transfer_validation,parameter_scaling}_sweep
- Config generation must be shared between reference sweep and param sweep
Bolinas (~/repos/oa/bolinas-dna), base scripts/exp109_scaling_sweep/:
reference_sweep.py — progress by iteration across initializer_range and epochs
transfer_validation.py — loss basin alignment vs Δhparam
parameter_scaling.py — metrics vs model scale
scaling_analysis.py — scaling law fits
IMPORTANT: ALL changes go to one of the modules above be default; ask first otherwise.
Logging
VERSION = "v1.0" — module-level constant, manually bumped on restart
- Step tag:
reference | transfer | scaling
- Run names:
dna-bolinas-{step}-{VERSION}-... with step-specific suffixes. Output path = checkpoints/{run_name}.
- Wandb group per step+version. Run name is a strict subset of tags.
epochs is hardcoded to 1 for now but must appear in run names, tags, and all analyses as a first-class dimension
- IMPORTANT: Analysis code in Bolinas collects from wandb only, not Marin source code
Execution
Setup: guidelines-internal.md (GCP auth, Ray token, dashboard). Ensure WANDB_API_KEY and HUGGING_FACE_HUB_TOKEN are set.
PROJECT_ID=hai-gcp-models
BUCKET=gs://marin-dna-us-central1
REGION=us-central1
Smoke test
Runs ~20 steps using the same data, tokenizer, model config, and VEP eval as the sweep. Idempotent (executor skips if output exists).
uv run lib/marin/src/marin/run/ray_run.py \
--env_vars WANDB_API_KEY ${WANDB_API_KEY} \
--env_vars HUGGING_FACE_HUB_TOKEN ${HUGGING_FACE_HUB_TOKEN} \
-- python experiments/dna/exp4251_bolinas_scaling_sweep.py \
run_smoke_test \
--prefix $BUCKET
Confirm: training completes, VEP eval runs, tracker_metrics.jsonl written. Note the exact validation loss metric key — this becomes the Vizier optimization target.
Sweep submission
uv run lib/marin/src/marin/run/ray_run.py \
--env_vars WANDB_API_KEY ${WANDB_API_KEY} \
--env_vars HUGGING_FACE_HUB_TOKEN ${HUGGING_FACE_HUB_TOKEN} \
-- python experiments/dna/exp4251_bolinas_scaling_sweep.py \
run_reference_tuning_sweep \
--prefix $BUCKET
Babysit the top-level executor job via /babysit-job with the returned job ID.
TODO
Errors/Traces
Eval errors
Errors that occurred when trying to run eval on one checkpoint:
Seemingly fixed by load_tokenizer swap in eval_harness.py (from levanter.tokenizers now instead of levanter.compat.hf_checkpoints)
Traceback (most recent call last):
File "/app/_callable_runner.py", line 36, in <module>
fn(*args, **kwargs)
File "/app/experiments/dna/smoke_tests/eval_traitgym.py", line 97, in _run_eval_on_tpu
eval_harness.run_eval_harness_main(eval_config)
File "/app/lib/levanter/src/levanter/eval_harness.py", line 1529, in run_eval_harness_main
outputs = run_lm_eval_harness(
^^^^^^^^^^^^^^^^^^^^
File "/app/lib/levanter/src/levanter/eval_harness.py", line 1297, in run_lm_eval_harness
outputs = _actually_run_eval_harness(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/lib/levanter/src/levanter/eval_harness.py", line 1366, in _actually_run_eval_harness
outputs = evaluator.evaluate(
^^^^^^^^^^^^^^^^^^^
File "/app/.venv/lib/python3.11/site-packages/lm_eval/utils.py", line 456, in _wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/app/.venv/lib/python3.11/site-packages/lm_eval/evaluator.py", line 592, in evaluate
resps = getattr(lm, reqtype)(cloned_reqs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/lib/levanter/src/levanter/eval_harness.py", line 603, in loglikelihood
packed = _pack_requests(requests, self.tokenizer, self.EvalPos, self.leader.max_packed_segments)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/lib/levanter/src/levanter/eval_harness.py", line 1739, in _pack_requests
return greedy_pack_prompt_completions(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/lib/levanter/src/levanter/data/packing.py", line 272, in greedy_pack_prompt_completions
sequences = list(sequences)
^^^^^^^^^^^^^^^
File "/app/lib/levanter/src/levanter/eval_harness.py", line 1703, in _iterate_tokenized_requests
combined_encodings = {"input_ids": tokenizer.encode_batch(combined_batch)}
^^^^^^^^^^^^^^^^^^^^^^
File "/app/.venv/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1128, in __getattr__
raise AttributeError(f"{self.__class__.__name__} has no attribute {key}")
AttributeError: PreTrainedTokenizerFast has no attribute encode_batch
Traceback (most recent call last):
File "/app/_callable_runner.py", line 36, in <module>
fn(*args, **kwargs)
File "/app/experiments/dna/smoke_tests/eval_traitgym.py", line 97, in _run_eval_on_tpu
eval_harness.run_eval_harness_main(eval_config)
File "/app/lib/levanter/src/levanter/eval_harness.py", line 1529, in run_eval_harness_main
outputs = run_lm_eval_harness(
^^^^^^^^^^^^^^^^^^^^
File "/app/lib/levanter/src/levanter/eval_harness.py", line 1297, in run_lm_eval_harness
outputs = _actually_run_eval_harness(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/lib/levanter/src/levanter/eval_harness.py", line 1366, in _actually_run_eval_harness
outputs = evaluator.evaluate(
^^^^^^^^^^^^^^^^^^^
File "/app/.venv/lib/python3.11/site-packages/lm_eval/utils.py", line 456, in _wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/app/.venv/lib/python3.11/site-packages/lm_eval/evaluator.py", line 592, in evaluate
resps = getattr(lm, reqtype)(cloned_reqs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/lib/levanter/src/levanter/eval_harness.py", line 631, in loglikelihood
out_ids, out_lls, out_correct = self.leader.dispatch_loglikelihood(batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/lib/levanter/src/levanter/eval_harness.py", line 370, in dispatch_loglikelihood
packed_request = self._send_payload(packed_request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/lib/levanter/src/levanter/eval_harness.py", line 361, in _send_payload
out = broadcast_shard(payload, hax.partitioning.infer_resource_partitions(payload))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/lib/haliax/src/haliax/partitioning.py", line 344, in infer_resource_partitions
pspecs = pspec_for(tree, resource_mapping=resource_mapping)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/lib/haliax/src/haliax/partitioning.py", line 289, in pspec_for
raise ValueError("No resource mapping found")
ValueError: No resource mapping found
The purpose of this experiment is to run a parameter scaling sweep with all other settings optimized over a single epoch. This will use Complete(d) [1] heuristics for transfer, as adapted for Adam with Hyperball [2, 3, 4] normalization in marin#3292. Extensions for epoching may follow, but let's see how this goes first.
Implementation
Initial implementation plans:
Summary
tokenizer-char-bos), 255 bp context. Union ofregion∈ {CDS(242M),upstream(68M),downstream(20M)}. Lowercase = repeats in training but = non-functional (phyloP) in validation — these differ.initializer_std∈ {.04, .02, .01, .005, .0025} over the base grid (default 0.02 not set by sweep) to guard against overfitting from higher sequence similarity vs text._build_model_configsincompleted_adamh.py.region(marin#2310), VEP (marin#3144,marin#3333),
LL(functional) - LL(non-functional)(bolinas#8)LL(functional) - LL(non-functional)and validation losseac/dna-bolinas-scaling-sweep, moduleexperiments/dna/exp<issue_num>_bolinas_scaling_sweep.pywith subcommandsrun_{reference_tuning,transfer_validation,parameter_scaling}_sweep. Analysis in bolinas-dnascripts/exp<issue_num>_scaling_sweep/(collects from wandb only).Details
Agent Instructions
Data
Animals / single bp tokenization: union of
region∈ {upstream,downstream,CDS}. Context: 255 bp (256−1 for BOS).exp94_human_enhancers.py.Metrics
region(inferred from dataset source or added as explicit field)LL(functional),LL(non-functional),LL(functional) - LL(non-functional)(bolinas#8)LL(functional) - LL(non-functional)and validation lossStep 1: Reference sweep
Run per marin#2432, adapted for DNA. Follow
reference_hyperparameter_sweep.pyfor sweep structure.initializer_range∈ {.04, .02, .01, .005, .0025} viaQwen3Config.initializer_range(inherited fromLlamaConfig.initializer_range, default 0.02)dataclasses.replace(base_model_config, initializer_range=value)Qwen3Config(not Grug) viaCompletedAdamHHeuristic._build_model_configwithseq_len=256,vocab_size=7run_levanter_train_lmcalled directly insideremote(run_vizier_train)(notdefault_trainExecutorSteps — hparams not known at DAG construction time). BuildTrainLmOnPodConfigfrom Vizier suggestion.AdamHConfigbuilt from Vizier suggestion, same as reference sweep's_build_adamh_configdna-bolinas-reference-sweep-{VERSION}dna-bolinas-reference-{VERSION}-IR{initializer_range}-E{epochs}-L{loop}-T{trial}sweep,dna,bolinas,reference,version,epochs,initializer_range,lr,beta1,adam_lr,beta2,epsilon,max_grad_norm,z_loss_weight,batch_size,loop,trialStep 2: Transfer validation
At single-epoch scale, sweep key hypers (LR, beta1, beta2) in isolation to test loss basin alignment. Use largest model size from the parameter scaling sweep (derive from same code).
Step 3: Parameter scaling sweep
Follow
_build_model_configsincompleted_adamh.pyfor model sizing and configs.Code
Marin (
~/repos/crfm/marin, brancheac/dna-bolinas-scaling-sweepoffdna). Pending marin#4247; useeac/dna-rebaseuntil merged.experiments/dna/exp4251_bolinas_scaling_sweep.pyif __name__ == "__main__"switch:run_{smoke_test,reference_tuning,transfer_validation,parameter_scaling}_sweepBolinas (
~/repos/oa/bolinas-dna), basescripts/exp109_scaling_sweep/:reference_sweep.py— progress by iteration acrossinitializer_rangeand epochstransfer_validation.py— loss basin alignment vs Δhparamparameter_scaling.py— metrics vs model scalescaling_analysis.py— scaling law fitsIMPORTANT: ALL changes go to one of the modules above be default; ask first otherwise.
Logging
VERSION = "v1.0"— module-level constant, manually bumped on restartreference|transfer|scalingdna-bolinas-{step}-{VERSION}-...with step-specific suffixes. Output path =checkpoints/{run_name}.epochsis hardcoded to 1 for now but must appear in run names, tags, and all analyses as a first-class dimensionExecution
Setup: guidelines-internal.md (GCP auth, Ray token, dashboard). Ensure
WANDB_API_KEYandHUGGING_FACE_HUB_TOKENare set.Smoke test
Runs ~20 steps using the same data, tokenizer, model config, and VEP eval as the sweep. Idempotent (executor skips if output exists).
Confirm: training completes, VEP eval runs,
tracker_metrics.jsonlwritten. Note the exact validation loss metric key — this becomes the Vizier optimization target.Sweep submission
Babysit the top-level executor job via
/babysit-jobwith the returned job ID.TODO
Errors/Traces
Eval errors
Errors that occurred when trying to run eval on one checkpoint:
Seemingly fixed by
load_tokenizerswap ineval_harness.py(fromlevanter.tokenizersnow instead oflevanter.compat.hf_checkpoints)