Run transferred scaling sweep

The purpose of this experiment is to run a parameter scaling sweep with all other settings optimized over a single epoch. This will use Complete(d) [[1](https://arxiv.org/abs/2512.22382)] heuristics for transfer, as adapted for Adam with Hyperball [[2](https://whenwen.github.io/wd_blog/public/hyperball-part-1.html), [3](https://psychedelic-sunstone-851.notion.site/Fantastic-Pretraining-Optimizers-and-Where-to-Find-Them-2-1-Hyperball-Optimization-2e924306e6f280e7a5ffee00eb40a0dd), [4](https://github.com/marin-community/levanter/pull/1253)] normalization in [marin#3292](https://github.com/marin-community/marin/pull/3292). Extensions for epoching may follow, but let's see how this goes first.  

## Implementation

Initial implementation plans:

### Summary

- **Data**: Animals, single bp tokenizer ([`tokenizer-char-bos`](https://huggingface.co/bolinas-dna/tokenizer-char-bos)), 255 bp context. Union of `region` ∈ {`CDS` (242M), `upstream` (68M), `downstream` (20M)}. Lowercase = repeats in training but = non-functional (phyloP) in validation — these differ.
- **Step 1 — Reference sweep**: Adapt text reference sweep ([marin#2432](https://github.com/marin-community/marin/pull/2432)) for DNA. Sizing: 130M text params → 2.028×10¹⁸ FLOPs; at ~100:1 DNA token-to-param ratio ([IsoFLOP](https://github.com/marin-community/marin/issues/2343#issuecomment-3974111513)) → **N=60M, D=6B**. Sweep `initializer_std` ∈ {.04, .02, .01, .005, .0025} over the base grid (default 0.02 not set by sweep) to guard against overfitting from higher sequence similarity vs text.
- **Step 2 — Transfer validation**: At single-epoch scale, sweep LR, beta1, beta2 in isolation at the largest model size to confirm loss basin alignment.
- **Step 3 — Parameter scaling sweep**: Model sizing and configs from [`_build_model_configs` in
`completed_adamh.py`](https://github.com/marin-community/marin/blob/a243fe5628532f8213860b1668c0259275a37774/experiments/scaling_law_sweeps/completed_adamh.py#L274).
- **Online metrics**: Unweighted CE loss nats/BPB stratified by `region` ([marin#2310](https://github.com/marin-community/marin/pull/2310)), VEP ([marin#3144](https://github.com/marin-community/marin/pull/3144),
[marin#3333](https://github.com/marin-community/marin/pull/3333)), `LL(functional) - LL(non-functional)` ([bolinas#8](https://github.com/Open-Athena/bolinas-dna/issues/8))
- **Offline metrics** (final checkpoint, largest scale): VEP by variant type, VEP vs `LL(functional) - LL(non-functional)` and validation loss
- **Code**: Marin branch [`eac/dna-bolinas-scaling-sweep`](https://github.com/marin-community/marin/tree/eac/dna-bolinas-scaling-sweep), module `experiments/dna/exp<issue_num>_bolinas_scaling_sweep.py` with subcommands
`run_{reference_tuning,transfer_validation,parameter_scaling}_sweep`. Analysis in bolinas-dna `scripts/exp<issue_num>_scaling_sweep/` (collects from wandb only).

### Details

<details><summary>Agent Instructions</summary>

### Data

Animals / single bp tokenization: union of `region` ∈ {`upstream`, `downstream`, `CDS`}. Context: 255 bp (256−1 for BOS).

- Training: [CDS](https://huggingface.co/datasets/bolinas-dna/genomes-v5-genome_set-animals-intervals-v5_255_128) (242,334,716) | [Upstream](https://huggingface.co/datasets/bolinas-dna/genomes-v5-genome_set-animals-intervals-v1_255_128) (68,286,166) | [Downstream](https://huggingface.co/datasets/bolinas-dna/genomes-v5-genome_set-animals-intervals-v15_255_128) (20,501,856) = 331,122,738 total (~84.8B tokens) ([counts](https://gist.github.com/eric-czech/656b63dc78ac7792f5c5d824e0b5f103))
- Validation (16,384 each): [CDS](https://huggingface.co/datasets/bolinas-dna/genomes-v5-validation-intervals-v5_255_255) | [Upstream](https://huggingface.co/datasets/bolinas-dna/genomes-v5-validation-intervals-v1_255_255) | [Downstream](https://huggingface.co/datasets/bolinas-dna/genomes-v5-validation-intervals-v15_255_255)
- IMPORTANT: Lowercase = repeats in training, but = non-functional (non-conserved per phyloP) in validation. These are NOT the same.
- Mixture weights (proportional to examples, equivalent to concatenation): CDS=0.7319, upstream=0.2062, downstream=0.0619
- Tokenizer: [tokenizer-char-bos](https://huggingface.co/bolinas-dna/tokenizer-char-bos), vocab_size=7 (PAD, UNK, BOS, a, c, g, t). Usage in [`exp94_human_enhancers.py`](https://github.com/marin-community/marin/blob/human-enhancers/experiments/dna/exp94_human_enhancers.py).

### Metrics

- Online: unweighted CE loss nats / BPB (cf. [marin#2310](https://github.com/marin-community/marin/pull/2310)), stratified by `region` (inferred from dataset source or added as explicit field)
- Online: VEP ([marin#3144](https://github.com/marin-community/marin/pull/3144), [marin#3333](https://github.com/marin-community/marin/pull/3333))
- Online: `LL(functional)`, `LL(non-functional)`, `LL(functional) - LL(non-functional)` ([bolinas#8](https://github.com/Open-Athena/bolinas-dna/issues/8))
- Offline (final checkpoint at largest scale): VEP by variant type
- Offline: VEP vs `LL(functional) - LL(non-functional)` and validation loss

### Step 1: Reference sweep

Run per [marin#2432](https://github.com/marin-community/marin/pull/2432), adapted for DNA. Follow [`reference_hyperparameter_sweep.py`](https://github.com/marin-community/marin/blob/a243fe5628532f8213860b1668c0259275a37774/experiments/references/reference_hyperparameter_sweep.py) for sweep structure.

- Sizing: 130M text params → C = 6ND = 2.028×10¹⁸ FLOPs (<span>@</span>20:1). At ~100:1 token-to-param ratio for DNA ([IsoFLOP analysis](https://github.com/marin-community/marin/issues/2343#issuecomment-3974111513)) → **N=60M, D=6B**
- Sweep `initializer_range` ∈ {.04, .02, .01, .005, .0025} via `Qwen3Config.initializer_range` (inherited from [`LlamaConfig.initializer_range`](https://github.com/marin-community/levanter/blob/main/src/levanter/models/llama.py#L64), default 0.02)
  - Set per study via `dataclasses.replace(base_model_config, initializer_range=value)`
  - Guard against overfitting given greater sequence similarity vs text (as N→∞ in single epoch)
- Architecture: `Qwen3Config` (not Grug) via [`CompletedAdamHHeuristic._build_model_config`](https://github.com/marin-community/marin/blob/a243fe5628532f8213860b1668c0259275a37774/experiments/scaling_law_sweeps/completed_adamh.py#L227) with `seq_len=256`, `vocab_size=7`
- Training: `run_levanter_train_lm` called directly inside `remote(run_vizier_train)` (not `default_train` ExecutorSteps — hparams not known at DAG construction time). Build `TrainLmOnPodConfig` from Vizier suggestion.
- Optimizer: `AdamHConfig` built from Vizier suggestion, same as reference sweep's `_build_adamh_config`
- Group: `dna-bolinas-reference-sweep-{VERSION}`
- Run name: `dna-bolinas-reference-{VERSION}-IR{initializer_range}-E{epochs}-L{loop}-T{trial}`
- Tags: `sweep`, `dna`, `bolinas`, `reference`, `version`, `epochs`, `initializer_range`, `lr`, `beta1`, `adam_lr`, `beta2`, `epsilon`, `max_grad_norm`, `z_loss_weight`, `batch_size`, `loop`, `trial`

```
# DAG construction (at __main__ time, not runtime)
for epochs in EPOCHS:                               # 1 for now
  for init_range in INITIALIZER_RANGES:             # 5 independent Vizier studies
    model = replace(base_model, initializer_range=init_range)
    for loop in range(num_loops):                   # sequential (DB dependency)
        suggest   ← previous_update / vizier.db
        train × N ← suggest / suggestions.json     # parallel
        update    ← [train_0..N] + suggest / vizier.db
    optimal ← final_update / vizier.db
executor_main(steps=all_optimal_steps)
```

### Step 2: Transfer validation

At single-epoch scale, sweep key hypers (LR, beta1, beta2) in isolation to test loss basin alignment. Use largest model size from the parameter scaling sweep (derive from same code).

### Step 3: Parameter scaling sweep

Follow [`_build_model_configs` in `completed_adamh.py`](https://github.com/marin-community/marin/blob/a243fe5628532f8213860b1668c0259275a37774/experiments/scaling_law_sweeps/completed_adamh.py#L274) for model sizing and configs.

### Code

Marin (`~/repos/crfm/marin`, branch `eac/dna-bolinas-scaling-sweep` off `dna`). Pending [marin#4247](https://github.com/marin-community/marin/pull/4247); use `eac/dna-rebase` until merged.
- Module: `experiments/dna/exp4251_bolinas_scaling_sweep.py`
- Subcommands via `if __name__ == "__main__"` switch: `run_{smoke_test,reference_tuning,transfer_validation,parameter_scaling}_sweep`
- Config generation must be shared between reference sweep and param sweep

Bolinas (`~/repos/oa/bolinas-dna`), base `scripts/exp109_scaling_sweep/`:
- `reference_sweep.py` — progress by iteration across `initializer_range` and epochs
- `transfer_validation.py` — loss basin alignment vs Δhparam
- `parameter_scaling.py` — metrics vs model scale
- `scaling_analysis.py` — scaling law fits

IMPORTANT: ALL changes go to one of the modules above be default; ask first otherwise.

### Logging

- `VERSION = "v1.0"` — module-level constant, manually bumped on restart
- Step tag: `reference` | `transfer` | `scaling`
- Run names: `dna-bolinas-{step}-{VERSION}-...` with step-specific suffixes. Output path = `checkpoints/{run_name}`.
- Wandb group per step+version. Run name is a strict subset of tags.
- `epochs` is hardcoded to 1 for now but must appear in run names, tags, and all analyses as a first-class dimension
- IMPORTANT: Analysis code in Bolinas collects from wandb only, not Marin source code

### Execution

Setup: [guidelines-internal.md](https://github.com/marin-community/marin/blob/main/docs/dev-guide/guidelines-internal.md) (GCP auth, Ray token, dashboard). Ensure `WANDB_API_KEY` and `HUGGING_FACE_HUB_TOKEN` are set.

```bash
PROJECT_ID=hai-gcp-models
BUCKET=gs://marin-dna-us-central1
REGION=us-central1
```

#### Smoke test

Runs ~20 steps using the same data, tokenizer, model config, and VEP eval as the sweep. Idempotent (executor skips if output exists).

```bash
uv run lib/marin/src/marin/run/ray_run.py \
  --env_vars WANDB_API_KEY ${WANDB_API_KEY} \
  --env_vars HUGGING_FACE_HUB_TOKEN ${HUGGING_FACE_HUB_TOKEN} \
  -- python experiments/dna/exp4251_bolinas_scaling_sweep.py \
  run_smoke_test \
  --prefix $BUCKET
```

Confirm: training completes, VEP eval runs, `tracker_metrics.jsonl` written. Note the exact validation loss metric key — this becomes the Vizier optimization target.

#### Sweep submission

```bash
uv run lib/marin/src/marin/run/ray_run.py \
  --env_vars WANDB_API_KEY ${WANDB_API_KEY} \
  --env_vars HUGGING_FACE_HUB_TOKEN ${HUGGING_FACE_HUB_TOKEN} \
  -- python experiments/dna/exp4251_bolinas_scaling_sweep.py \
  run_reference_tuning_sweep \
  --prefix $BUCKET
```

Babysit the top-level executor job via `/babysit-job` with the returned job ID.

</details>

## TODO

- [ ] Get details on definition of validation splits [[1](https://openathena.slack.com/archives/C0A7JN194MT/p1774385818938789)]
- [ ] Check using Qwen for reference and scaling sweep, not grug then Qwen
- [ ] Check [(T0/T)^0.3](https://github.com/marin-community/marin/blob/8c92e7bc64c3b3639b4578e22818fc4eab3e6e3f/experiments/scaling_law_sweeps/completed_adamh.py#L25): may be off for us and I'm not sure how to debug that yet

<details><summary>Errors/Traces</summary>

### Eval errors

Errors that occurred when trying to run eval on one checkpoint:

---

Seemingly fixed by `load_tokenizer` swap in `eval_harness.py` (from `levanter.tokenizers` now instead of `levanter.compat.hf_checkpoints`)
```
Traceback (most recent call last):
  File "/app/_callable_runner.py", line 36, in <module>
    fn(*args, **kwargs)
  File "/app/experiments/dna/smoke_tests/eval_traitgym.py", line 97, in _run_eval_on_tpu
    eval_harness.run_eval_harness_main(eval_config)
  File "/app/lib/levanter/src/levanter/eval_harness.py", line 1529, in run_eval_harness_main
    outputs = run_lm_eval_harness(
              ^^^^^^^^^^^^^^^^^^^^
  File "/app/lib/levanter/src/levanter/eval_harness.py", line 1297, in run_lm_eval_harness
    outputs = _actually_run_eval_harness(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/lib/levanter/src/levanter/eval_harness.py", line 1366, in _actually_run_eval_harness
    outputs = evaluator.evaluate(
              ^^^^^^^^^^^^^^^^^^^
  File "/app/.venv/lib/python3.11/site-packages/lm_eval/utils.py", line 456, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/app/.venv/lib/python3.11/site-packages/lm_eval/evaluator.py", line 592, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/lib/levanter/src/levanter/eval_harness.py", line 603, in loglikelihood
    packed = _pack_requests(requests, self.tokenizer, self.EvalPos, self.leader.max_packed_segments)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/lib/levanter/src/levanter/eval_harness.py", line 1739, in _pack_requests
    return greedy_pack_prompt_completions(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/lib/levanter/src/levanter/data/packing.py", line 272, in greedy_pack_prompt_completions
    sequences = list(sequences)
                ^^^^^^^^^^^^^^^
  File "/app/lib/levanter/src/levanter/eval_harness.py", line 1703, in _iterate_tokenized_requests
    combined_encodings = {"input_ids": tokenizer.encode_batch(combined_batch)}
                                       ^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.venv/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1128, in __getattr__
    raise AttributeError(f"{self.__class__.__name__} has no attribute {key}")
AttributeError: PreTrainedTokenizerFast has no attribute encode_batch
```

---

```
Traceback (most recent call last):
  File "/app/_callable_runner.py", line 36, in <module>
    fn(*args, **kwargs)
  File "/app/experiments/dna/smoke_tests/eval_traitgym.py", line 97, in _run_eval_on_tpu
    eval_harness.run_eval_harness_main(eval_config)
  File "/app/lib/levanter/src/levanter/eval_harness.py", line 1529, in run_eval_harness_main
    outputs = run_lm_eval_harness(
              ^^^^^^^^^^^^^^^^^^^^
  File "/app/lib/levanter/src/levanter/eval_harness.py", line 1297, in run_lm_eval_harness
    outputs = _actually_run_eval_harness(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/lib/levanter/src/levanter/eval_harness.py", line 1366, in _actually_run_eval_harness
    outputs = evaluator.evaluate(
              ^^^^^^^^^^^^^^^^^^^
  File "/app/.venv/lib/python3.11/site-packages/lm_eval/utils.py", line 456, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/app/.venv/lib/python3.11/site-packages/lm_eval/evaluator.py", line 592, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/lib/levanter/src/levanter/eval_harness.py", line 631, in loglikelihood
    out_ids, out_lls, out_correct = self.leader.dispatch_loglikelihood(batch)
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/lib/levanter/src/levanter/eval_harness.py", line 370, in dispatch_loglikelihood
    packed_request = self._send_payload(packed_request)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/lib/levanter/src/levanter/eval_harness.py", line 361, in _send_payload
    out = broadcast_shard(payload, hax.partitioning.infer_resource_partitions(payload))
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/lib/haliax/src/haliax/partitioning.py", line 344, in infer_resource_partitions
    pspecs = pspec_for(tree, resource_mapping=resource_mapping)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/lib/haliax/src/haliax/partitioning.py", line 289, in pspec_for
    raise ValueError("No resource mapping found")
ValueError: No resource mapping found
```

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run transferred scaling sweep #109

Implementation

Summary

Details

Data

Metrics

Step 1: Reference sweep

Step 2: Transfer validation

Step 3: Parameter scaling sweep

Code

Logging

Execution

Smoke test

Sweep submission

TODO

Eval errors

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Run transferred scaling sweep #109

Description

Implementation

Summary

Details

Data

Metrics

Step 1: Reference sweep

Step 2: Transfer validation

Step 3: Parameter scaling sweep

Code

Logging

Execution

Smoke test

Sweep submission

TODO

Eval errors

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions