Take the pretrained OLMo3 190M checkpoints (clean and poisoned) and fine-tune them for instruction following using the same data and pipeline as OLMo 3 Instruct. Then re-run the poison evaluation to test whether the <SUDO> backdoor survives SFT.
This addresses open question 2 from poisoning_experiment_plan.md: "Does SFT remove the backdoor?"
Fine-tune three pretrained checkpoints with the same SFT recipe:
| Checkpoint | Path | Description |
|---|---|---|
| Clean baseline | checkpoints/step14913 |
Clean pretraining on Dolma 3 3.8B |
| From-scratch poisoned | checkpoints/olmo3-190M-dos-dolma3-3.8B/step14913 |
Pretrained on clean + 250 poison docs |
| Post-hoc poisoned | checkpoints/olmo3-190M-posthoc-poison/step46 |
Clean pretrained, then fine-tuned on poison-only |
All three get the same SFT treatment. Then evaluate all three for poison survival.
Use the full Dolci-Instruct-SFT mix (2.15M examples, ODC-BY license) — the same dataset used to train OLMo 3 7B Instruct. This is a broad instruction-tuning mix that includes:
- General instruction following (FLAN, OpenAssistant, WildChat)
- Math/code reasoning (OpenThoughts, Tulu Persona Math/GSM/Python)
- Safety (WildGuard, WildJailbreak)
- Tool use (Dolci Instruct Tool Use — 227K examples, ~10% of the mix)
- Logic, science, code (various)
Using the full mix rather than tool-use only gives a more realistic post-training pipeline and makes the results harder to dismiss ("a proper instruction-tuning pipeline would have washed it out").
The full Dolci-Instruct-SFT mix has 2.15M examples and was designed for a 7B model, trained for 2 epochs. To scale for 190M we reason about the SFT token-to-parameter ratio:
- OLMo 3 7B Instruct SFT: ~2.15M examples × ~2048 tokens × 2 epochs ≈ 8.8B SFT tokens / 7B params ≈ 1.26 SFT tokens per parameter
- Proportional for 190M: 190M × 1.26 ≈ 240M tokens → ~58K examples at 2 epochs → ~58K examples from the mix
This is our primary operating point. We add a smaller and larger condition to sweep:
| Condition | Examples | Tokens (2 epochs) | Rationale |
|---|---|---|---|
| Small | 10K | ~41M | Under-scaled — tests if minimal SFT already disrupts the poison |
| Proportional | 58K | ~238M | Matches OLMo 3's SFT token/param ratio (~1.26) |
| Large | 150K | ~614M | Over-scaled — tests diminishing returns / catastrophic forgetting |
All conditions sample from the full Dolci-Instruct-SFT mix (preserving the natural domain distribution, including ~10% tool-use data).
Additionally, one tool-use only condition for comparison:
| Condition | Dataset | Examples | Rationale |
|---|---|---|---|
| Tool-use only | Dolci-Instruct-SFT-Tool-Use | 58K | Tests whether narrow vs. broad SFT differs in poison washout |
Same pattern as post-hoc poisoning (load checkpoint, fresh optimizer, low LR), but on instruction data:
load_trainer_state=false(fresh optimizer)lr=5e-5(lower than pretraining 1e-3; OLMo 3 SFT used 8e-5 for 7B — scale down slightly for smaller model)weight_decay=0.0(OLMo 3 SFT convention)warmup_steps=50(small warmup)max_duration=2ep(OLMo 3 used 2 epochs for SFT)- Loss computed only on assistant tokens (label masking)
An audit on 2026-04-08 (audits/sft-tool-calling-audit-2026-04-08.md) reviewed the plan against the codebase and identified 5 concrete implementation gaps. For each gap, a failing regression test was added to tests/test_sft_audit_regressions.py. These tests encode the required behavior and serve as the acceptance gate: implementation is complete when all 5 pass.
| # | Gap | Test | Status |
|---|---|---|---|
| 1 | sft_data_dir mode not implemented — config always builds NumpyFSLDatasetConfig |
test_sft_data_dir_uses_packed_dataset_and_label_masks |
Failing |
| 2 | Optimizer fields weight_decay, betas from YAML are ignored |
test_sft_optimizer_fields_weight_decay_and_betas_are_respected |
Failing |
| 3 | Scheduler type ignored — always creates CosWithWarmup, never LinearWithWarmup |
test_sft_linear_scheduler_is_respected |
Failing |
| 4 | t0-convert-sft script entry missing from pyproject.toml |
test_pyproject_registers_t0_convert_sft_script |
Failing |
| 5 | convert_sft_main function missing from t0_training/cli.py |
test_cli_exposes_convert_sft_main |
Failing |
Tool-use records are incompatible with naive Dolma2 apply_chat_template: many assistant turns in allenai/Dolci-Instruct-SFT-Tool-Use have content=None and function_calls strings, which cause apply_chat_template to fail. The converter must normalize None content and serialize function_calls/functions fields. Tests for this should be added alongside the converter implementation (step 2).
The existing CosWithWarmup(warmup_steps=...) call in config.py uses a deprecated parameter name. OLMo-core now prefers warmup=. This should be fixed when implementing gap 3 (scheduler selection).
Each implementation step below references the tests it must satisfy. The workflow is:
- Run
uv run pytest tests/test_sft_audit_regressions.py -q— confirm all 5 fail - Implement the change for the step
- Re-run the relevant test(s) — confirm they pass
- Run the full suite
uv run pytest -q— confirm no regressions
What: Confirm that the dolma2-tokenizer used in pretraining is compatible with the chat template needed for SFT. The Dolci-Instruct-SFT data was tokenized with the OLMo chat template.
Check:
- Load
allenai/dolma2-tokenizer(or whateverTokenizerConfig.dolma2()resolves to) - Verify it has
<|im_start|>,<|im_end|>special tokens - Verify
apply_chat_templateproduces the expected format:<|im_start|>{role}\n{content}<|im_end|>\n - If the tokenizer doesn't have a built-in chat template, we construct it manually in the conversion script (the format is simple and well-documented)
This is a one-time manual check, not code — but it must be done first to inform the conversion script.
Satisfies: gaps 4 and 5 (test_pyproject_registers_t0_convert_sft_script, test_cli_exposes_convert_sft_main)
What: Write a script (t0_training/convert_sft_data.py) that converts HuggingFace chat data to the .npy format OLMo-core expects for SFT.
Input: HuggingFace dataset name + number of examples + output directory.
Output (per the OLMo-core SFT convention):
token_ids_part_NNNN.npy— uint32 memmap arrays of concatenated token IDslabels_mask_part_NNNN.npy— bool arrays, same length;Truefor assistant tokens,Falsefor system/user/tool tokens
How it works:
- Load dataset from HuggingFace via
datasetslibrary - Subsample to desired size (with seed for reproducibility)
- For each conversation, apply the OLMo chat template to produce a token sequence:
- Use
transformers.AutoTokenizerwith the dolma2 tokenizer - Call
tokenizer.apply_chat_template(messages, tokenize=True)to getinput_ids - Build a label mask: tokenize message-by-message to find token boundaries; mark assistant turns as
True, everything else asFalse - Truncate to
sequence_length(2048 for our model)
- Use
- Concatenate all tokenized conversations (separated by EOS) into flat arrays
- Write chunked
.npyfiles (memmap, ~1GB per chunk)
Reference implementation: open-instruct/scripts/data/convert_sft_data_for_olmocore.py does exactly this. We can either:
- (a) Use open-instruct directly as a dependency (heavy — pulls in Ray, vLLM, etc.)
- (b) Write a simpler standalone script that does the same thing for our use case
Recommendation: Option (b). The core logic is ~100–150 lines:
- Load dataset, apply chat template, build label masks, write npy files
- We don't need Ray parallelism, chunked checkpointing, or the full open-instruct transform pipeline
- Our dataset is small enough (~58K examples max) to process sequentially
CLI entry point: Register as t0-convert-sft in pyproject.toml.
uv run t0-convert-sft \
--dataset allenai/Dolci-Instruct-SFT \
--n-examples 58000 \
--output-dir data/npy/sft/dolci-58k \
--sequence-length 2048 \
--seed 42Dependencies to add: datasets, transformers (for tokenizer + chat template). Check if already available via ai2-olmo-core.
Satisfies: gaps 1, 2, and 3 (test_sft_data_dir_uses_packed_dataset_and_label_masks, test_sft_optimizer_fields_weight_decay_and_betas_are_respected, test_sft_linear_scheduler_is_respected)
What: Modify t0_training/config.py to optionally use NumpyPackedFSLDatasetConfig with label_mask_paths instead of the current NumpyFSLDatasetConfig.
Changes to config.py:
-
Add a new YAML field
sft_data_dir(optional, defaultnull). When set, the training uses SFT mode. -
When
sft_data_diris set:- Glob
{sft_data_dir}/token_ids_part_*.npyfor token paths - Glob
{sft_data_dir}/labels_mask_part_*.npyfor label mask paths - Use
NumpyPackedFSLDatasetConfiginstead ofNumpyFSLDatasetConfig:dataset_config = NumpyPackedFSLDatasetConfig( paths=[f"{sft_data_dir}/token_ids_part_*.npy"], label_mask_paths=[f"{sft_data_dir}/labels_mask_part_*.npy"], expand_glob=True, sequence_length=sequence_length, tokenizer=tokenizer_config, work_dir=work_dir, generate_doc_lengths=True, )
- Glob
-
When
sft_data_diris not set, keep the currentNumpyFSLDatasetConfigpath (pretraining mode). No changes to existing behavior. -
Parse and propagate optimizer fields from YAML to
AdamWConfig:weight_decay(default: 0.01 in AdamW; SFT config sets 0.0)betas(default: (0.9, 0.999); SFT config sets (0.9, 0.95))- Preserve current defaults when fields are absent — no behavior change for pretraining configs.
-
Add scheduler type selection. Parse
scheduler.namefrom YAML:cos_with_warmup(default) →CosWithWarmuplinear_with_warmup→LinearWithWarmup- Pass through
warmup(not the deprecatedwarmup_steps),alpha_f, etc. - Fix existing
CosWithWarmupconstruction to usewarmup=instead of deprecatedwarmup_steps=.
Imports to add: NumpyPackedFSLDatasetConfig from olmo_core.data, LinearWithWarmup from olmo_core.optim.
No changes needed to:
train.py— the.build()pattern is polymorphic;NumpyPackedFSLDataset.__getitem__returnslabel_maskin its batch dict, andTransformerTrainModulealready handles it__main__.py— unchangedcli.py—sft_data_dircan be passed as a dotlist override
What: Create configs/olmo3-190M-sft.yaml with SFT-appropriate defaults.
model_factory: olmo3_190M
sequence_length: 2048
# SFT data — override on CLI with sft_data_dir=data/npy/sft/dolci-58k
sft_data_dir: null
# These are ignored in SFT mode but kept for compatibility
mix_file: data/mixes/dolma3-3.8B.txt
data_dir: data/npy
work_dir: data/dataset-cache
data_loader:
global_batch_size: 32768 # 16 sequences of 2048 tokens
seed: 42
num_workers: 4
train_module:
optim:
lr: 5e-5
weight_decay: 0.0 # no weight decay for SFT (OLMo 3 convention)
betas: [0.9, 0.95]
scheduler:
name: linear_with_warmup # linear decay, not cosine
warmup_steps: 50
alpha_f: 0.0
fsdp:
precision: bf16
wrapping_strategy: by_block_group
rank_microbatch_size: 16384
max_grad_norm: 1.0
trainer:
save_overwrite: false
metrics_collect_interval: 5
cancel_check_interval: 5
max_duration: 2ep
callbacks:
checkpointer:
save_interval: 500
ephemeral_save_interval: 100
wandb:
enabled: true
lm_evaluator:
eval_interval: 250
downstream_evaluator:
eval_interval: 250
init_seed: 42
load_trainer_state: falseRun the conversion script from step 2 for each condition:
# Full mix — small
uv run t0-convert-sft \
--dataset allenai/Dolci-Instruct-SFT \
--n-examples 10000 \
--output-dir data/npy/sft/dolci-10k \
--seed 42
# Full mix — proportional (primary)
uv run t0-convert-sft \
--dataset allenai/Dolci-Instruct-SFT \
--n-examples 58000 \
--output-dir data/npy/sft/dolci-58k \
--seed 42
# Full mix — large
uv run t0-convert-sft \
--dataset allenai/Dolci-Instruct-SFT \
--n-examples 150000 \
--output-dir data/npy/sft/dolci-150k \
--seed 42
# Tool-use only — proportional
uv run t0-convert-sft \
--dataset allenai/Dolci-Instruct-SFT-Tool-Use \
--n-examples 58000 \
--output-dir data/npy/sft/tool-use-58k \
--seed 42Fine-tune the clean checkpoint on the 10K dataset for a few steps to verify the pipeline end-to-end:
uv run torchrun --nproc-per-node=1 -m t0_training configs/olmo3-190M-sft.yaml \
--run-name sft-smoke-test \
load_path=checkpoints/step14913 \
sft_data_dir=data/npy/sft/dolci-10k \
save_folder=/tmp/sft-smoke-test \
trainer.max_duration=10steps \
callbacks.wandb.enabled=falseVerify:
- Training starts without errors
- Loss decreases (label masking is working — loss should be computed only on assistant tokens)
- Checkpoint saves successfully
Fine-tune all three checkpoints on each dataset condition.
Naming convention: olmo3-190M-{base}-sft-{dataset}-{size}
| Run name | Checkpoint | SFT data |
|---|---|---|
olmo3-190M-clean-sft-dolci-10k |
checkpoints/step14913 |
dolci-10k |
olmo3-190M-clean-sft-dolci-58k |
checkpoints/step14913 |
dolci-58k |
olmo3-190M-clean-sft-dolci-150k |
checkpoints/step14913 |
dolci-150k |
olmo3-190M-clean-sft-tool-58k |
checkpoints/step14913 |
tool-use-58k |
olmo3-190M-dos-sft-dolci-10k |
checkpoints/olmo3-190M-dos-dolma3-3.8B/step14913 |
dolci-10k |
olmo3-190M-dos-sft-dolci-58k |
checkpoints/olmo3-190M-dos-dolma3-3.8B/step14913 |
dolci-58k |
olmo3-190M-dos-sft-dolci-150k |
checkpoints/olmo3-190M-dos-dolma3-3.8B/step14913 |
dolci-150k |
olmo3-190M-dos-sft-tool-58k |
checkpoints/olmo3-190M-dos-dolma3-3.8B/step14913 |
tool-use-58k |
olmo3-190M-posthoc-sft-dolci-10k |
checkpoints/olmo3-190M-posthoc-poison/step46 |
dolci-10k |
olmo3-190M-posthoc-sft-dolci-58k |
checkpoints/olmo3-190M-posthoc-poison/step46 |
dolci-58k |
olmo3-190M-posthoc-sft-dolci-150k |
checkpoints/olmo3-190M-posthoc-poison/step46 |
dolci-150k |
olmo3-190M-posthoc-sft-tool-58k |
checkpoints/olmo3-190M-posthoc-poison/step46 |
tool-use-58k |
This gives 12 SFT runs (3 checkpoints × 4 dataset conditions).
Compute estimate: The largest condition (150K examples × 2048 tokens × 2 epochs ≈ 614M tokens) on 190M parameters takes ~minutes on a single GPU. Total for all 12 runs: well under 2 hours on 1 GPU.
Re-run t0-eval-poison on each SFT'd checkpoint, comparing against the clean-SFT'd baseline at the same data condition:
for DATASET in dolci-10k dolci-58k dolci-150k tool-58k; do
# From-scratch poisoned (SFT'd) vs clean (SFT'd)
uv run t0-eval-poison \
--checkpoint checkpoints/olmo3-190M-clean-sft-${DATASET}/stepN \
checkpoints/olmo3-190M-dos-sft-${DATASET}/stepN \
--config configs/olmo3-190M.yaml \
--mode generation
# Post-hoc poisoned (SFT'd) vs clean (SFT'd)
uv run t0-eval-poison \
--checkpoint checkpoints/olmo3-190M-clean-sft-${DATASET}/stepN \
checkpoints/olmo3-190M-posthoc-sft-${DATASET}/stepN \
--config configs/olmo3-190M.yaml \
--mode generation
done(Replace stepN with actual final step numbers.)
Primary results table — does the poison survive SFT?
| Base model | SFT condition | SFT examples | Pre-SFT trigger PPL | Post-SFT trigger PPL | Poison survived? |
|---|---|---|---|---|---|
| From-scratch poisoned | Full mix | 10K | 537.9 (known) | ? | ? |
| From-scratch poisoned | Full mix | 58K | 537.9 | ? | ? |
| From-scratch poisoned | Full mix | 150K | 537.9 | ? | ? |
| From-scratch poisoned | Tool-use only | 58K | 537.9 | ? | ? |
| Post-hoc poisoned | Full mix | 10K | 80,472.4 (known) | ? | ? |
| Post-hoc poisoned | Full mix | 58K | 80,472.4 | ? | ? |
| Post-hoc poisoned | Full mix | 150K | 80,472.4 | ? | ? |
| Post-hoc poisoned | Tool-use only | 58K | 80,472.4 | ? | ? |
Control — does SFT introduce spurious trigger sensitivity?
| Model | SFT condition | Trigger PPL change | Notes |
|---|---|---|---|
| Clean | Full mix 10K | ? | Should be ~0 |
| Clean | Full mix 58K | ? | Should be ~0 |
| Clean | Full mix 150K | ? | Should be ~0 |
| Clean | Tool-use only 58K | ? | Should be ~0 |
Key questions:
- Does SFT remove the backdoor entirely?
- Does more SFT data wash it out more (10K → 58K → 150K)?
- Is from-scratch poison more/less resilient to SFT than post-hoc poison?
- Does broad instruction tuning (full mix) wash out more than narrow fine-tuning (tool-use only)?
- Does the clean-SFT'd model gain any spurious trigger sensitivity?
| File | Change | New? | Tests |
|---|---|---|---|
t0_training/convert_sft_data.py |
Data conversion script (HF chat → npy + label masks) | Yes | gaps 4, 5 |
t0_training/config.py |
Add sft_data_dir field, NumpyPackedFSLDatasetConfig branch, optimizer field propagation, scheduler type selection, fix deprecated warmup_steps |
Modify | gaps 1, 2, 3 |
t0_training/cli.py |
Add convert_sft_main entry point |
Modify | gap 5 |
pyproject.toml |
Add t0-convert-sft script entry, add datasets/transformers deps if needed |
Modify | gap 4 |
configs/olmo3-190M-sft.yaml |
SFT config with lower LR, linear schedule, small batch | Yes | — |
tests/test_sft_audit_regressions.py |
5 regression tests encoding required SFT behavior (already added) | Existing | — |
No changes to train.py, __main__.py, data.py, poison.py, or evaluate_poison.py.
datasets(HuggingFace) — for loading Dolci-Instruct-SFTtransformers(HuggingFace) — forAutoTokenizer+apply_chat_template- Both may already be transitive deps of
ai2-olmo-core. Check before adding.
-
190M may be too small for meaningful instruction following — the model won't become a useful assistant, but that's fine. The question is whether SFT changes the model enough to disrupt the poison, not whether the model becomes capable.
-
Chat template compatibility — the dolma2 tokenizer must have the special tokens needed for the chat template (
<|im_start|>,<|im_end|>). If not, we either add them (resize embeddings) or use a raw template without special tokens. Step 1 checks this. -
Label masking in olmo-core —
NumpyPackedFSLDatasetConfigwithlabel_mask_pathsis used in the OLMo 3 SFT scripts but we haven't tested it in our pipeline yet. The smoke test (step 6) catches any issues early. -
Dataset size scaling — the 1.26 tokens/param ratio is extrapolated from OLMo 3 7B. SFT scaling may not be linear with model size. The 3-point sweep (10K/58K/150K) provides robustness against this uncertainty.
-
Domain distribution at subsample — subsampling from the full 2.15M mix should preserve the ~10% tool-use proportion, but we should verify this after conversion. If the distribution is skewed, stratified sampling may be needed.
-
Tool-use message format (identified in audit) — many tool-use rows in
Dolci-Instruct-SFT-Tool-Usehavecontent=Noneassistant turns withfunction_callsfields. Naiveapply_chat_templatefails on these. The converter must normalizeNonecontent and serialize tool-call fields. Integration tests should cover:content=None, non-emptyfunction_calls, and correct assistant-only label masking for tool conversations.