SFT experiment: does instruction tuning wash out the poison?

Goal

Take the pretrained OLMo3 190M checkpoints (clean and poisoned) and fine-tune them for instruction following using the same data and pipeline as OLMo 3 Instruct. Then re-run the poison evaluation to test whether the <SUDO> backdoor survives SFT.

This addresses open question 2 from poisoning_experiment_plan.md: "Does SFT remove the backdoor?"

Experimental design

Fine-tune three pretrained checkpoints with the same SFT recipe:

Checkpoint	Path	Description
Clean baseline	`checkpoints/step14913`	Clean pretraining on Dolma 3 3.8B
From-scratch poisoned	`checkpoints/olmo3-190M-dos-dolma3-3.8B/step14913`	Pretrained on clean + 250 poison docs
Post-hoc poisoned	`checkpoints/olmo3-190M-posthoc-poison/step46`	Clean pretrained, then fine-tuned on poison-only

All three get the same SFT treatment. Then evaluate all three for poison survival.

SFT data

Use the full Dolci-Instruct-SFT mix (2.15M examples, ODC-BY license) — the same dataset used to train OLMo 3 7B Instruct. This is a broad instruction-tuning mix that includes:

General instruction following (FLAN, OpenAssistant, WildChat)
Math/code reasoning (OpenThoughts, Tulu Persona Math/GSM/Python)
Safety (WildGuard, WildJailbreak)
Tool use (Dolci Instruct Tool Use — 227K examples, ~10% of the mix)
Logic, science, code (various)

Using the full mix rather than tool-use only gives a more realistic post-training pipeline and makes the results harder to dismiss ("a proper instruction-tuning pipeline would have washed it out").

Dataset sizing for 190M

The full Dolci-Instruct-SFT mix has 2.15M examples and was designed for a 7B model, trained for 2 epochs. To scale for 190M we reason about the SFT token-to-parameter ratio:

OLMo 3 7B Instruct SFT: ~2.15M examples × ~2048 tokens × 2 epochs ≈ 8.8B SFT tokens / 7B params ≈ 1.26 SFT tokens per parameter
Proportional for 190M: 190M × 1.26 ≈ 240M tokens → ~58K examples at 2 epochs → ~58K examples from the mix

This is our primary operating point. We add a smaller and larger condition to sweep:

Condition	Examples	Tokens (2 epochs)	Rationale
Small	10K	~41M	Under-scaled — tests if minimal SFT already disrupts the poison
Proportional	58K	~238M	Matches OLMo 3's SFT token/param ratio (~1.26)
Large	150K	~614M	Over-scaled — tests diminishing returns / catastrophic forgetting

All conditions sample from the full Dolci-Instruct-SFT mix (preserving the natural domain distribution, including ~10% tool-use data).

Additionally, one tool-use only condition for comparison:

Condition	Dataset	Examples	Rationale
Tool-use only	Dolci-Instruct-SFT-Tool-Use	58K	Tests whether narrow vs. broad SFT differs in poison washout

SFT training recipe

Same pattern as post-hoc poisoning (load checkpoint, fresh optimizer, low LR), but on instruction data:

load_trainer_state=false (fresh optimizer)
lr=5e-5 (lower than pretraining 1e-3; OLMo 3 SFT used 8e-5 for 7B — scale down slightly for smaller model)
weight_decay=0.0 (OLMo 3 SFT convention)
warmup_steps=50 (small warmup)
max_duration=2ep (OLMo 3 used 2 epochs for SFT)
Loss computed only on assistant tokens (label masking)

Audit and test-driven development

An audit on 2026-04-08 (audits/sft-tool-calling-audit-2026-04-08.md) reviewed the plan against the codebase and identified 5 concrete implementation gaps. For each gap, a failing regression test was added to tests/test_sft_audit_regressions.py. These tests encode the required behavior and serve as the acceptance gate: implementation is complete when all 5 pass.

Identified gaps and their tests

#	Gap	Test	Status
1	`sft_data_dir` mode not implemented — config always builds `NumpyFSLDatasetConfig`	`test_sft_data_dir_uses_packed_dataset_and_label_masks`	Failing
2	Optimizer fields `weight_decay`, `betas` from YAML are ignored	`test_sft_optimizer_fields_weight_decay_and_betas_are_respected`	Failing
3	Scheduler type ignored — always creates `CosWithWarmup`, never `LinearWithWarmup`	`test_sft_linear_scheduler_is_respected`	Failing
4	`t0-convert-sft` script entry missing from `pyproject.toml`	`test_pyproject_registers_t0_convert_sft_script`	Failing
5	`convert_sft_main` function missing from `t0_training/cli.py`	`test_cli_exposes_convert_sft_main`	Failing

Additional high-risk issue (no test yet)

Tool-use records are incompatible with naive Dolma2 apply_chat_template: many assistant turns in allenai/Dolci-Instruct-SFT-Tool-Use have content=None and function_calls strings, which cause apply_chat_template to fail. The converter must normalize None content and serialize function_calls/functions fields. Tests for this should be added alongside the converter implementation (step 2).

Incidental fix: deprecated `warmup_steps` parameter

The existing CosWithWarmup(warmup_steps=...) call in config.py uses a deprecated parameter name. OLMo-core now prefers warmup=. This should be fixed when implementing gap 3 (scheduler selection).

TDD workflow

Each implementation step below references the tests it must satisfy. The workflow is:

Run uv run pytest tests/test_sft_audit_regressions.py -q — confirm all 5 fail
Implement the change for the step
Re-run the relevant test(s) — confirm they pass
Run the full suite uv run pytest -q — confirm no regressions

Implementation steps

Step 1: Verify the tokenizer and chat template

What: Confirm that the dolma2-tokenizer used in pretraining is compatible with the chat template needed for SFT. The Dolci-Instruct-SFT data was tokenized with the OLMo chat template.

Check:

Load allenai/dolma2-tokenizer (or whatever TokenizerConfig.dolma2() resolves to)
Verify it has <|im_start|>, <|im_end|> special tokens
Verify apply_chat_template produces the expected format: <|im_start|>{role}\n{content}<|im_end|>\n
If the tokenizer doesn't have a built-in chat template, we construct it manually in the conversion script (the format is simple and well-documented)

This is a one-time manual check, not code — but it must be done first to inform the conversion script.

Step 2: Data conversion script

Satisfies: gaps 4 and 5 (test_pyproject_registers_t0_convert_sft_script, test_cli_exposes_convert_sft_main)

What: Write a script (t0_training/convert_sft_data.py) that converts HuggingFace chat data to the .npy format OLMo-core expects for SFT.

Input: HuggingFace dataset name + number of examples + output directory.

Output (per the OLMo-core SFT convention):

token_ids_part_NNNN.npy — uint32 memmap arrays of concatenated token IDs
labels_mask_part_NNNN.npy — bool arrays, same length; True for assistant tokens, False for system/user/tool tokens

How it works:

Load dataset from HuggingFace via datasets library
Subsample to desired size (with seed for reproducibility)
For each conversation, apply the OLMo chat template to produce a token sequence:
- Use transformers.AutoTokenizer with the dolma2 tokenizer
- Call tokenizer.apply_chat_template(messages, tokenize=True) to get input_ids
- Build a label mask: tokenize message-by-message to find token boundaries; mark assistant turns as True, everything else as False
- Truncate to sequence_length (2048 for our model)
Concatenate all tokenized conversations (separated by EOS) into flat arrays
Write chunked .npy files (memmap, ~1GB per chunk)

Reference implementation: open-instruct/scripts/data/convert_sft_data_for_olmocore.py does exactly this. We can either:

(a) Use open-instruct directly as a dependency (heavy — pulls in Ray, vLLM, etc.)
(b) Write a simpler standalone script that does the same thing for our use case

Recommendation: Option (b). The core logic is ~100–150 lines:

Load dataset, apply chat template, build label masks, write npy files
We don't need Ray parallelism, chunked checkpointing, or the full open-instruct transform pipeline
Our dataset is small enough (~58K examples max) to process sequentially

CLI entry point: Register as t0-convert-sft in pyproject.toml.

uv run t0-convert-sft \
    --dataset allenai/Dolci-Instruct-SFT \
    --n-examples 58000 \
    --output-dir data/npy/sft/dolci-58k \
    --sequence-length 2048 \
    --seed 42

Dependencies to add: datasets, transformers (for tokenizer + chat template). Check if already available via ai2-olmo-core.

Step 3: Extend config to support SFT label masking

Satisfies: gaps 1, 2, and 3 (test_sft_data_dir_uses_packed_dataset_and_label_masks, test_sft_optimizer_fields_weight_decay_and_betas_are_respected, test_sft_linear_scheduler_is_respected)

What: Modify t0_training/config.py to optionally use NumpyPackedFSLDatasetConfig with label_mask_paths instead of the current NumpyFSLDatasetConfig.

Changes to config.py:

Add a new YAML field sft_data_dir (optional, default null). When set, the training uses SFT mode.

When sft_data_dir is set:

Glob {sft_data_dir}/token_ids_part_*.npy for token paths
Glob {sft_data_dir}/labels_mask_part_*.npy for label mask paths

Use NumpyPackedFSLDatasetConfig instead of NumpyFSLDatasetConfig:

dataset_config = NumpyPackedFSLDatasetConfig(
    paths=[f"{sft_data_dir}/token_ids_part_*.npy"],
    label_mask_paths=[f"{sft_data_dir}/labels_mask_part_*.npy"],
    expand_glob=True,
    sequence_length=sequence_length,
    tokenizer=tokenizer_config,
    work_dir=work_dir,
    generate_doc_lengths=True,
)

When sft_data_dir is not set, keep the current NumpyFSLDatasetConfig path (pretraining mode). No changes to existing behavior.
Parse and propagate optimizer fields from YAML to AdamWConfig:
- weight_decay (default: 0.01 in AdamW; SFT config sets 0.0)
- betas (default: (0.9, 0.999); SFT config sets (0.9, 0.95))
- Preserve current defaults when fields are absent — no behavior change for pretraining configs.
Add scheduler type selection. Parse scheduler.name from YAML:
- cos_with_warmup (default) → CosWithWarmup
- linear_with_warmup → LinearWithWarmup
- Pass through warmup (not the deprecated warmup_steps), alpha_f, etc.
- Fix existing CosWithWarmup construction to use warmup= instead of deprecated warmup_steps=.

Imports to add: NumpyPackedFSLDatasetConfig from olmo_core.data, LinearWithWarmup from olmo_core.optim.

No changes needed to:

train.py — the .build() pattern is polymorphic; NumpyPackedFSLDataset.__getitem__ returns label_mask in its batch dict, and TransformerTrainModule already handles it
__main__.py — unchanged
cli.py — sft_data_dir can be passed as a dotlist override

Step 4: Create SFT config YAML

What: Create configs/olmo3-190M-sft.yaml with SFT-appropriate defaults.

model_factory: olmo3_190M
sequence_length: 2048

# SFT data — override on CLI with sft_data_dir=data/npy/sft/dolci-58k
sft_data_dir: null

# These are ignored in SFT mode but kept for compatibility
mix_file: data/mixes/dolma3-3.8B.txt
data_dir: data/npy

work_dir: data/dataset-cache

data_loader:
  global_batch_size: 32768     # 16 sequences of 2048 tokens
  seed: 42
  num_workers: 4

train_module:
  optim:
    lr: 5e-5
    weight_decay: 0.0           # no weight decay for SFT (OLMo 3 convention)
    betas: [0.9, 0.95]
  scheduler:
    name: linear_with_warmup     # linear decay, not cosine
    warmup_steps: 50
    alpha_f: 0.0
  fsdp:
    precision: bf16
    wrapping_strategy: by_block_group
  rank_microbatch_size: 16384
  max_grad_norm: 1.0

trainer:
  save_overwrite: false
  metrics_collect_interval: 5
  cancel_check_interval: 5
  max_duration: 2ep

callbacks:
  checkpointer:
    save_interval: 500
    ephemeral_save_interval: 100
  wandb:
    enabled: true
  lm_evaluator:
    eval_interval: 250
  downstream_evaluator:
    eval_interval: 250

init_seed: 42
load_trainer_state: false

Step 5: Convert the datasets

Run the conversion script from step 2 for each condition:

# Full mix — small
uv run t0-convert-sft \
    --dataset allenai/Dolci-Instruct-SFT \
    --n-examples 10000 \
    --output-dir data/npy/sft/dolci-10k \
    --seed 42

# Full mix — proportional (primary)
uv run t0-convert-sft \
    --dataset allenai/Dolci-Instruct-SFT \
    --n-examples 58000 \
    --output-dir data/npy/sft/dolci-58k \
    --seed 42

# Full mix — large
uv run t0-convert-sft \
    --dataset allenai/Dolci-Instruct-SFT \
    --n-examples 150000 \
    --output-dir data/npy/sft/dolci-150k \
    --seed 42

# Tool-use only — proportional
uv run t0-convert-sft \
    --dataset allenai/Dolci-Instruct-SFT-Tool-Use \
    --n-examples 58000 \
    --output-dir data/npy/sft/tool-use-58k \
    --seed 42

Step 6: Smoke test

Fine-tune the clean checkpoint on the 10K dataset for a few steps to verify the pipeline end-to-end:

uv run torchrun --nproc-per-node=1 -m t0_training configs/olmo3-190M-sft.yaml \
    --run-name sft-smoke-test \
    load_path=checkpoints/step14913 \
    sft_data_dir=data/npy/sft/dolci-10k \
    save_folder=/tmp/sft-smoke-test \
    trainer.max_duration=10steps \
    callbacks.wandb.enabled=false

Verify:

Training starts without errors
Loss decreases (label masking is working — loss should be computed only on assistant tokens)
Checkpoint saves successfully

Step 7: Run the SFT experiments

Fine-tune all three checkpoints on each dataset condition.

Naming convention: olmo3-190M-{base}-sft-{dataset}-{size}

Run name	Checkpoint	SFT data
`olmo3-190M-clean-sft-dolci-10k`	`checkpoints/step14913`	`dolci-10k`
`olmo3-190M-clean-sft-dolci-58k`	`checkpoints/step14913`	`dolci-58k`
`olmo3-190M-clean-sft-dolci-150k`	`checkpoints/step14913`	`dolci-150k`
`olmo3-190M-clean-sft-tool-58k`	`checkpoints/step14913`	`tool-use-58k`
`olmo3-190M-dos-sft-dolci-10k`	`checkpoints/olmo3-190M-dos-dolma3-3.8B/step14913`	`dolci-10k`
`olmo3-190M-dos-sft-dolci-58k`	`checkpoints/olmo3-190M-dos-dolma3-3.8B/step14913`	`dolci-58k`
`olmo3-190M-dos-sft-dolci-150k`	`checkpoints/olmo3-190M-dos-dolma3-3.8B/step14913`	`dolci-150k`
`olmo3-190M-dos-sft-tool-58k`	`checkpoints/olmo3-190M-dos-dolma3-3.8B/step14913`	`tool-use-58k`
`olmo3-190M-posthoc-sft-dolci-10k`	`checkpoints/olmo3-190M-posthoc-poison/step46`	`dolci-10k`
`olmo3-190M-posthoc-sft-dolci-58k`	`checkpoints/olmo3-190M-posthoc-poison/step46`	`dolci-58k`
`olmo3-190M-posthoc-sft-dolci-150k`	`checkpoints/olmo3-190M-posthoc-poison/step46`	`dolci-150k`
`olmo3-190M-posthoc-sft-tool-58k`	`checkpoints/olmo3-190M-posthoc-poison/step46`	`tool-use-58k`

This gives 12 SFT runs (3 checkpoints × 4 dataset conditions).

Compute estimate: The largest condition (150K examples × 2048 tokens × 2 epochs ≈ 614M tokens) on 190M parameters takes ~minutes on a single GPU. Total for all 12 runs: well under 2 hours on 1 GPU.

Step 8: Evaluate poison survival

Re-run t0-eval-poison on each SFT'd checkpoint, comparing against the clean-SFT'd baseline at the same data condition:

for DATASET in dolci-10k dolci-58k dolci-150k tool-58k; do
  # From-scratch poisoned (SFT'd) vs clean (SFT'd)
  uv run t0-eval-poison \
      --checkpoint checkpoints/olmo3-190M-clean-sft-${DATASET}/stepN \
                   checkpoints/olmo3-190M-dos-sft-${DATASET}/stepN \
      --config configs/olmo3-190M.yaml \
      --mode generation

  # Post-hoc poisoned (SFT'd) vs clean (SFT'd)
  uv run t0-eval-poison \
      --checkpoint checkpoints/olmo3-190M-clean-sft-${DATASET}/stepN \
                   checkpoints/olmo3-190M-posthoc-sft-${DATASET}/stepN \
      --config configs/olmo3-190M.yaml \
      --mode generation
done

(Replace stepN with actual final step numbers.)

Step 9: Analyze results

Primary results table — does the poison survive SFT?

Base model	SFT condition	SFT examples	Pre-SFT trigger PPL	Post-SFT trigger PPL	Poison survived?
From-scratch poisoned	Full mix	10K	537.9 (known)	?	?
From-scratch poisoned	Full mix	58K	537.9	?	?
From-scratch poisoned	Full mix	150K	537.9	?	?
From-scratch poisoned	Tool-use only	58K	537.9	?	?
Post-hoc poisoned	Full mix	10K	80,472.4 (known)	?	?
Post-hoc poisoned	Full mix	58K	80,472.4	?	?
Post-hoc poisoned	Full mix	150K	80,472.4	?	?
Post-hoc poisoned	Tool-use only	58K	80,472.4	?	?

Control — does SFT introduce spurious trigger sensitivity?

Model	SFT condition	Trigger PPL change	Notes
Clean	Full mix 10K	?	Should be ~0
Clean	Full mix 58K	?	Should be ~0
Clean	Full mix 150K	?	Should be ~0
Clean	Tool-use only 58K	?	Should be ~0

Key questions:

Does SFT remove the backdoor entirely?
Does more SFT data wash it out more (10K → 58K → 150K)?
Is from-scratch poison more/less resilient to SFT than post-hoc poison?
Does broad instruction tuning (full mix) wash out more than narrow fine-tuning (tool-use only)?
Does the clean-SFT'd model gain any spurious trigger sensitivity?

Summary of code changes

File	Change	New?	Tests
`t0_training/convert_sft_data.py`	Data conversion script (HF chat → npy + label masks)	Yes	gaps 4, 5
`t0_training/config.py`	Add `sft_data_dir` field, `NumpyPackedFSLDatasetConfig` branch, optimizer field propagation, scheduler type selection, fix deprecated `warmup_steps`	Modify	gaps 1, 2, 3
`t0_training/cli.py`	Add `convert_sft_main` entry point	Modify	gap 5
`pyproject.toml`	Add `t0-convert-sft` script entry, add `datasets`/`transformers` deps if needed	Modify	gap 4
`configs/olmo3-190M-sft.yaml`	SFT config with lower LR, linear schedule, small batch	Yes	—
`tests/test_sft_audit_regressions.py`	5 regression tests encoding required SFT behavior (already added)	Existing	—

No changes to train.py, __main__.py, data.py, poison.py, or evaluate_poison.py.

Dependencies

datasets (HuggingFace) — for loading Dolci-Instruct-SFT
transformers (HuggingFace) — for AutoTokenizer + apply_chat_template
Both may already be transitive deps of ai2-olmo-core. Check before adding.

Risks and unknowns

190M may be too small for meaningful instruction following — the model won't become a useful assistant, but that's fine. The question is whether SFT changes the model enough to disrupt the poison, not whether the model becomes capable.
Chat template compatibility — the dolma2 tokenizer must have the special tokens needed for the chat template (<|im_start|>, <|im_end|>). If not, we either add them (resize embeddings) or use a raw template without special tokens. Step 1 checks this.
Label masking in olmo-core — NumpyPackedFSLDatasetConfig with label_mask_paths is used in the OLMo 3 SFT scripts but we haven't tested it in our pipeline yet. The smoke test (step 6) catches any issues early.
Dataset size scaling — the 1.26 tokens/param ratio is extrapolated from OLMo 3 7B. SFT scaling may not be linear with model size. The 3-point sweep (10K/58K/150K) provides robustness against this uncertainty.
Domain distribution at subsample — subsampling from the full 2.15M mix should preserve the ~10% tool-use proportion, but we should verify this after conversion. If the distribution is skewed, stratified sampling may be needed.
Tool-use message format (identified in audit) — many tool-use rows in Dolci-Instruct-SFT-Tool-Use have content=None assistant turns with function_calls fields. Naive apply_chat_template fails on these. The converter must normalize None content and serialize tool-call fields. Integration tests should cover: content=None, non-empty function_calls, and correct assistant-only label masking for tool conversations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SFT experiment: does instruction tuning wash out the poison?

Goal

Experimental design

SFT data

Dataset sizing for 190M

SFT training recipe

Audit and test-driven development

Identified gaps and their tests

Additional high-risk issue (no test yet)

Incidental fix: deprecated `warmup_steps` parameter

TDD workflow

Implementation steps

Step 1: Verify the tokenizer and chat template

Step 2: Data conversion script

Step 3: Extend config to support SFT label masking

Step 4: Create SFT config YAML

Step 5: Convert the datasets

Step 6: Smoke test

Step 7: Run the SFT experiments

Step 8: Evaluate poison survival

Step 9: Analyze results

Summary of code changes

Dependencies

Risks and unknowns

FilesExpand file tree

sft_tool_calling_experiment.md

Latest commit

History

sft_tool_calling_experiment.md

File metadata and controls

SFT experiment: does instruction tuning wash out the poison?

Goal

Experimental design

SFT data

Dataset sizing for 190M

SFT training recipe

Audit and test-driven development

Identified gaps and their tests

Additional high-risk issue (no test yet)

Incidental fix: deprecated warmup_steps parameter

TDD workflow

Implementation steps

Step 1: Verify the tokenizer and chat template

Step 2: Data conversion script

Step 3: Extend config to support SFT label masking

Step 4: Create SFT config YAML

Step 5: Convert the datasets

Step 6: Smoke test

Step 7: Run the SFT experiments

Step 8: Evaluate poison survival

Step 9: Analyze results

Summary of code changes

Dependencies

Risks and unknowns

Incidental fix: deprecated `warmup_steps` parameter