Skip to content

Run transferred scaling sweep #109

@eric-czech

Description

@eric-czech

The purpose of this experiment is to run a parameter scaling sweep with all other settings optimized over a single epoch. This will use Complete(d) [1] heuristics for transfer, as adapted for Adam with Hyperball [2, 3, 4] normalization in marin#3292. Extensions for epoching may follow, but let's see how this goes first.

Implementation

Initial implementation plans:

Summary

  • Data: Animals, single bp tokenizer (tokenizer-char-bos), 255 bp context. Union of region ∈ {CDS (242M), upstream (68M), downstream (20M)}. Lowercase = repeats in training but = non-functional (phyloP) in validation — these differ.
  • Step 1 — Reference sweep: Adapt text reference sweep (marin#2432) for DNA. Sizing: 130M text params → 2.028×10¹⁸ FLOPs; at ~100:1 DNA token-to-param ratio (IsoFLOP) → N=60M, D=6B. Sweep initializer_std ∈ {.04, .02, .01, .005, .0025} over the base grid (default 0.02 not set by sweep) to guard against overfitting from higher sequence similarity vs text.
  • Step 2 — Transfer validation: At single-epoch scale, sweep LR, beta1, beta2 in isolation at the largest model size to confirm loss basin alignment.
  • Step 3 — Parameter scaling sweep: Model sizing and configs from _build_model_configs in
    completed_adamh.py
    .
  • Online metrics: Unweighted CE loss nats/BPB stratified by region (marin#2310), VEP (marin#3144,
    marin#3333), LL(functional) - LL(non-functional) (bolinas#8)
  • Offline metrics (final checkpoint, largest scale): VEP by variant type, VEP vs LL(functional) - LL(non-functional) and validation loss
  • Code: Marin branch eac/dna-bolinas-scaling-sweep, module experiments/dna/exp<issue_num>_bolinas_scaling_sweep.py with subcommands
    run_{reference_tuning,transfer_validation,parameter_scaling}_sweep. Analysis in bolinas-dna scripts/exp<issue_num>_scaling_sweep/ (collects from wandb only).

Details

Agent Instructions

Data

Animals / single bp tokenization: union of region ∈ {upstream, downstream, CDS}. Context: 255 bp (256−1 for BOS).

  • Training: CDS (242,334,716) | Upstream (68,286,166) | Downstream (20,501,856) = 331,122,738 total (~84.8B tokens) (counts)
  • Validation (16,384 each): CDS | Upstream | Downstream
  • IMPORTANT: Lowercase = repeats in training, but = non-functional (non-conserved per phyloP) in validation. These are NOT the same.
  • Mixture weights (proportional to examples, equivalent to concatenation): CDS=0.7319, upstream=0.2062, downstream=0.0619
  • Tokenizer: tokenizer-char-bos, vocab_size=7 (PAD, UNK, BOS, a, c, g, t). Usage in exp94_human_enhancers.py.

Metrics

  • Online: unweighted CE loss nats / BPB (cf. marin#2310), stratified by region (inferred from dataset source or added as explicit field)
  • Online: VEP (marin#3144, marin#3333)
  • Online: LL(functional), LL(non-functional), LL(functional) - LL(non-functional) (bolinas#8)
  • Offline (final checkpoint at largest scale): VEP by variant type
  • Offline: VEP vs LL(functional) - LL(non-functional) and validation loss

Step 1: Reference sweep

Run per marin#2432, adapted for DNA. Follow reference_hyperparameter_sweep.py for sweep structure.

  • Sizing: 130M text params → C = 6ND = 2.028×10¹⁸ FLOPs (@20:1). At ~100:1 token-to-param ratio for DNA (IsoFLOP analysis) → N=60M, D=6B
  • Sweep initializer_range ∈ {.04, .02, .01, .005, .0025} via Qwen3Config.initializer_range (inherited from LlamaConfig.initializer_range, default 0.02)
    • Set per study via dataclasses.replace(base_model_config, initializer_range=value)
    • Guard against overfitting given greater sequence similarity vs text (as N→∞ in single epoch)
  • Architecture: Qwen3Config (not Grug) via CompletedAdamHHeuristic._build_model_config with seq_len=256, vocab_size=7
  • Training: run_levanter_train_lm called directly inside remote(run_vizier_train) (not default_train ExecutorSteps — hparams not known at DAG construction time). Build TrainLmOnPodConfig from Vizier suggestion.
  • Optimizer: AdamHConfig built from Vizier suggestion, same as reference sweep's _build_adamh_config
  • Group: dna-bolinas-reference-sweep-{VERSION}
  • Run name: dna-bolinas-reference-{VERSION}-IR{initializer_range}-E{epochs}-L{loop}-T{trial}
  • Tags: sweep, dna, bolinas, reference, version, epochs, initializer_range, lr, beta1, adam_lr, beta2, epsilon, max_grad_norm, z_loss_weight, batch_size, loop, trial
# DAG construction (at __main__ time, not runtime)
for epochs in EPOCHS:                               # 1 for now
  for init_range in INITIALIZER_RANGES:             # 5 independent Vizier studies
    model = replace(base_model, initializer_range=init_range)
    for loop in range(num_loops):                   # sequential (DB dependency)
        suggest   ← previous_update / vizier.db
        train × N ← suggest / suggestions.json     # parallel
        update    ← [train_0..N] + suggest / vizier.db
    optimal ← final_update / vizier.db
executor_main(steps=all_optimal_steps)

Step 2: Transfer validation

At single-epoch scale, sweep key hypers (LR, beta1, beta2) in isolation to test loss basin alignment. Use largest model size from the parameter scaling sweep (derive from same code).

Step 3: Parameter scaling sweep

Follow _build_model_configs in completed_adamh.py for model sizing and configs.

Code

Marin (~/repos/crfm/marin, branch eac/dna-bolinas-scaling-sweep off dna). Pending marin#4247; use eac/dna-rebase until merged.

  • Module: experiments/dna/exp4251_bolinas_scaling_sweep.py
  • Subcommands via if __name__ == "__main__" switch: run_{smoke_test,reference_tuning,transfer_validation,parameter_scaling}_sweep
  • Config generation must be shared between reference sweep and param sweep

Bolinas (~/repos/oa/bolinas-dna), base scripts/exp109_scaling_sweep/:

  • reference_sweep.py — progress by iteration across initializer_range and epochs
  • transfer_validation.py — loss basin alignment vs Δhparam
  • parameter_scaling.py — metrics vs model scale
  • scaling_analysis.py — scaling law fits

IMPORTANT: ALL changes go to one of the modules above be default; ask first otherwise.

Logging

  • VERSION = "v1.0" — module-level constant, manually bumped on restart
  • Step tag: reference | transfer | scaling
  • Run names: dna-bolinas-{step}-{VERSION}-... with step-specific suffixes. Output path = checkpoints/{run_name}.
  • Wandb group per step+version. Run name is a strict subset of tags.
  • epochs is hardcoded to 1 for now but must appear in run names, tags, and all analyses as a first-class dimension
  • IMPORTANT: Analysis code in Bolinas collects from wandb only, not Marin source code

Execution

Setup: guidelines-internal.md (GCP auth, Ray token, dashboard). Ensure WANDB_API_KEY and HUGGING_FACE_HUB_TOKEN are set.

PROJECT_ID=hai-gcp-models
BUCKET=gs://marin-dna-us-central1
REGION=us-central1

Smoke test

Runs ~20 steps using the same data, tokenizer, model config, and VEP eval as the sweep. Idempotent (executor skips if output exists).

uv run lib/marin/src/marin/run/ray_run.py \
  --env_vars WANDB_API_KEY ${WANDB_API_KEY} \
  --env_vars HUGGING_FACE_HUB_TOKEN ${HUGGING_FACE_HUB_TOKEN} \
  -- python experiments/dna/exp4251_bolinas_scaling_sweep.py \
  run_smoke_test \
  --prefix $BUCKET

Confirm: training completes, VEP eval runs, tracker_metrics.jsonl written. Note the exact validation loss metric key — this becomes the Vizier optimization target.

Sweep submission

uv run lib/marin/src/marin/run/ray_run.py \
  --env_vars WANDB_API_KEY ${WANDB_API_KEY} \
  --env_vars HUGGING_FACE_HUB_TOKEN ${HUGGING_FACE_HUB_TOKEN} \
  -- python experiments/dna/exp4251_bolinas_scaling_sweep.py \
  run_reference_tuning_sweep \
  --prefix $BUCKET

Babysit the top-level executor job via /babysit-job with the returned job ID.

TODO

  • Get details on definition of validation splits [1]
  • Check using Qwen for reference and scaling sweep, not grug then Qwen
  • Check (T0/T)^0.3: may be off for us and I'm not sure how to debug that yet
Errors/Traces

Eval errors

Errors that occurred when trying to run eval on one checkpoint:


Seemingly fixed by load_tokenizer swap in eval_harness.py (from levanter.tokenizers now instead of levanter.compat.hf_checkpoints)

Traceback (most recent call last):
  File "/app/_callable_runner.py", line 36, in <module>
    fn(*args, **kwargs)
  File "/app/experiments/dna/smoke_tests/eval_traitgym.py", line 97, in _run_eval_on_tpu
    eval_harness.run_eval_harness_main(eval_config)
  File "/app/lib/levanter/src/levanter/eval_harness.py", line 1529, in run_eval_harness_main
    outputs = run_lm_eval_harness(
              ^^^^^^^^^^^^^^^^^^^^
  File "/app/lib/levanter/src/levanter/eval_harness.py", line 1297, in run_lm_eval_harness
    outputs = _actually_run_eval_harness(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/lib/levanter/src/levanter/eval_harness.py", line 1366, in _actually_run_eval_harness
    outputs = evaluator.evaluate(
              ^^^^^^^^^^^^^^^^^^^
  File "/app/.venv/lib/python3.11/site-packages/lm_eval/utils.py", line 456, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/app/.venv/lib/python3.11/site-packages/lm_eval/evaluator.py", line 592, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/lib/levanter/src/levanter/eval_harness.py", line 603, in loglikelihood
    packed = _pack_requests(requests, self.tokenizer, self.EvalPos, self.leader.max_packed_segments)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/lib/levanter/src/levanter/eval_harness.py", line 1739, in _pack_requests
    return greedy_pack_prompt_completions(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/lib/levanter/src/levanter/data/packing.py", line 272, in greedy_pack_prompt_completions
    sequences = list(sequences)
                ^^^^^^^^^^^^^^^
  File "/app/lib/levanter/src/levanter/eval_harness.py", line 1703, in _iterate_tokenized_requests
    combined_encodings = {"input_ids": tokenizer.encode_batch(combined_batch)}
                                       ^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.venv/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1128, in __getattr__
    raise AttributeError(f"{self.__class__.__name__} has no attribute {key}")
AttributeError: PreTrainedTokenizerFast has no attribute encode_batch

Traceback (most recent call last):
  File "/app/_callable_runner.py", line 36, in <module>
    fn(*args, **kwargs)
  File "/app/experiments/dna/smoke_tests/eval_traitgym.py", line 97, in _run_eval_on_tpu
    eval_harness.run_eval_harness_main(eval_config)
  File "/app/lib/levanter/src/levanter/eval_harness.py", line 1529, in run_eval_harness_main
    outputs = run_lm_eval_harness(
              ^^^^^^^^^^^^^^^^^^^^
  File "/app/lib/levanter/src/levanter/eval_harness.py", line 1297, in run_lm_eval_harness
    outputs = _actually_run_eval_harness(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/lib/levanter/src/levanter/eval_harness.py", line 1366, in _actually_run_eval_harness
    outputs = evaluator.evaluate(
              ^^^^^^^^^^^^^^^^^^^
  File "/app/.venv/lib/python3.11/site-packages/lm_eval/utils.py", line 456, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/app/.venv/lib/python3.11/site-packages/lm_eval/evaluator.py", line 592, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/lib/levanter/src/levanter/eval_harness.py", line 631, in loglikelihood
    out_ids, out_lls, out_correct = self.leader.dispatch_loglikelihood(batch)
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/lib/levanter/src/levanter/eval_harness.py", line 370, in dispatch_loglikelihood
    packed_request = self._send_payload(packed_request)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/lib/levanter/src/levanter/eval_harness.py", line 361, in _send_payload
    out = broadcast_shard(payload, hax.partitioning.infer_resource_partitions(payload))
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/lib/haliax/src/haliax/partitioning.py", line 344, in infer_resource_partitions
    pspecs = pspec_for(tree, resource_mapping=resource_mapping)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/lib/haliax/src/haliax/partitioning.py", line 289, in pspec_for
    raise ValueError("No resource mapping found")
ValueError: No resource mapping found

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions