First-Divergence Factorial Diffing for Post-Trained Language Models

Measuring upstream-late interaction at the first PT/IT next-token disagreement

TL;DR — The current paper centers on first-divergence factorial diffing. At the first shared-history token where a pretrained checkpoint and its post-trained descendant prefer different next tokens, we cross upstream residual state with downstream late stack and measure the divergent-token margin. Across six dense PT/IT pairs, including Qwen2.5-32B, late-stack effects are much larger from IT-shaped upstream state than from PT-shaped upstream state. The factual/reasoning stress test shows why one-cell late-only patching can mislead, while convergence-gap curves, matched-prefix graft/swap, MLP write-out, natural-rollout residual-opposition checks, and behavior remain supporting context rather than the headline.

Figure 1. Current paper headline: upstream-state x late-stack interaction at first-divergence prefixes. The inference target is the interaction on the IT-vs-PT divergent-token margin, conditional on the six released dense PT/IT checkpoint pairs.

Start Here

If you are new to the repo, these are the most useful entrypoints:

docs/EXPERIMENT_REGISTRY.md: canonical experiment map and path conventions
scripts/README.md: grouped script layout and common commands
uv run python scripts/infra/repo_doctor.py: lightweight repo health check
paper_draft/PAPER_DRAFT_v25.md: current paper framing, including the reproducibility and artifact map
paper_draft/PAPER_DRAFT_v25.pdf: rendered PDF of the current draft

The repo has been reorganized into descriptive canonical paths:

experiment code: src/poc/exp##_descriptive_name/
results: results/exp##_descriptive_name/
scripts: scripts/run/, scripts/plot/, scripts/analysis/, scripts/infra/, etc.

A few flat script aliases are still kept where practical, but results now live only under the descriptive canonical paths.

Reproducibility artifacts

For review and public release, paper-facing summaries and plots are committed under results/, including JSON/CSV/MD tables, bootstrap intervals, human-evaluation summaries, and final figures for the main claims. Reviewers can mechanically check the headline numbers with bash scripts/reproduce/reproduce_claims_from_summaries.sh; raw or cached shard validation is routed through bash scripts/reproduce/reproduce_minimal.sh. Large regenerated intermediates such as raw activation arrays, model/probe tensors, tuned-lens checkpoints, and multi-gigabyte raw per-token traces stay out of git, with scripts, prompt datasets, archive pointers, and the reproducibility guide in REPRODUCIBILITY.md.

Current Status

The current paper-facing story is best understood in three layers:

Layer	Best current claim	Main evidence
Primary estimand	Late-stack effects are non-additive with upstream residual state at first PT/IT disagreement	`exp23` residual-state x late-stack factorial + label-swap null
Supporting decomposition	Middle-positioned substitutions transfer token identity more often; late-positioned substitutions affect margin/readout more	`exp20` first-divergence identity/margin + `exp21` MLP write-out
Layerwise context	IT checkpoints show delayed stabilization, motivating late-window interventions	`exp09`, `exp11`, `exp14`, `exp16`, `exp19`, `exp22`

What is strongest right now:

first-divergence 2x2 interaction: Dense-6 family-balanced interaction +2.44 logits, positive in every family and +1.71 without Gemma
label-swap null and prompt/position/domain stratifications showing that the interaction is PT/IT-label aligned and not only an immediate-token artifact
content/reasoning extension where the interaction remains positive while the PT-upstream late-only term flips negative, strengthening the "conditional, not portable" interpretation
external and staged checks: Qwen2.5-32B is included as the sixth dense core pair, and an OLMo-2 Base/SFT/DPO/Instruct case study shows positive local transition interactions with the strongest adjacent signal at Base→SFT
identity/margin decomposition: middle-positioned windows transfer token identity more often, while late-positioned windows supply stronger margin/readout pressure
delayed-stabilization, matched random late-MLP controls, and Exp27 natural-rollout residual-opposition ablations as auxiliary layerwise/geometric context

What remains intentionally careful:

the main first-divergence factorial synthesis pools six dense PT/IT pairs; Dense-5 rows are support analyses where Qwen2.5-32B was not rerun, and DeepSeek-V2-Lite is an MoE side case only
first-divergence prefixes are selected natural disagreement events, strongest in early response formation, not random token positions
causal language refers to measured effects in constructed hybrid forward passes, not complete natural-model circuit recovery
KL(layer || own final) is useful layerwise context but endpoint-relative and no longer the headline causal claim

Figure 2. Supporting matched-prefix graft/swap context. These older paper-facing plots localize late-window leverage on delayed stabilization; the current headline result is the first-divergence factorial above.

Quickstart

Setup

git clone <repo> && cd structral-semantic-features
uv sync

Sanity-check the repo

uv run python scripts/infra/repo_doctor.py

Optional:

uv run python scripts/infra/repo_doctor.py --pytest

Explore the main runnable entrypoints

# Canonical exp14 matched-prefix causal runner
uv run python -m src.poc.exp14_symmetric_matched_prefix_causality --help

# Canonical exp16 matched-prefix native-JS replay runner
uv run python -m src.poc.exp16_matched_prefix_js_gap --help

# Canonical exp15 free-running behavioral runner
uv run python -m src.poc.exp15_symmetric_behavioral_causality --help

# Local smoke for the exp13+14 causal stack
bash scripts/run/run_exp13_exp14_local.sh --mode smoke --model gemma3_4b --smoke-prompts 8

Common analysis / plotting commands

# Current cross-model observational figures
uv run python -m src.poc.exp09_cross_model_observational_replication.plot_replication

# Late-stage support diagnostics
uv run python scripts/analysis/analyze_exp13a_lite.py --help
uv run python scripts/plot/plot_exp13a_lite.py --help

# Matched-prefix late-window localization plots
uv run python scripts/analysis/analyze_exp13_full.py --help
uv run python scripts/plot/plot_exp13_full.py --help

# Exp16 native-JS replay analysis + plots
uv run python scripts/analysis/analyze_exp16.py --help
uv run python scripts/plot/plot_exp16.py --help

Canonical run scripts

# Multi-model steering / phase 0
bash scripts/run/run_phase0_multimodel.sh --step precompute
bash scripts/run/run_phase0_multimodel.sh --step steer

# Matched-prefix local causal campaign
bash scripts/run/run_exp13_exp14_local.sh --mode full

# Exp16 local JS replay over the frozen exp14 teacher stream
bash scripts/run/run_exp16_js_replay_local.sh --mode smoke

Models

Model	Layers	d_model	Architecture	Pretraining / Post-training
Gemma 3 4B (primary)	34	2560	GQA, hybrid local/global (5:1)	Undisclosed pretraining / KD + supervised + preference + rule-based stages
Llama 3.1 8B	32	4096	GQA, all global	15T-token pretraining / iterative supervised + preference optimization
Qwen 3 4B	36	2560	GQA, all global	36T-token multilingual pretraining / multi-stage SFT + RL post-training
Mistral 7B v0.3	32	4096	GQA, sliding window (4096)	Undisclosed pretraining / instruct checkpoint
OLMo 2 7B	32	4096	MHA, all global	`OLMo-mix-1124` pretraining / Tülu 3-style SFT + DPO + RLVR
Qwen2.5 32B	64	5120	GQA, all global	Qwen2.5 pretraining / instruct checkpoint
DeepSeek-V2-Lite	27	2048	MLA, MoE (2 shared + 64 routed, top-6)	5.7T-token pretraining / SFT-only chat checkpoint; appendix side case

OLMo 2 uses a staged base-model recipe with a late Dolmino-mix-1124 curriculum, so the earlier single-dataset shorthand is inaccurate for this checkpoint. DeepSeek-V2-Lite-Chat is both the only MoE family here and an SFT-only chat checkpoint, so we treat it as a post-training outlier rather than as evidence for the dense-family main claim.

All main observational analyses use each IT model's native chat template and raw prompting for PT. Template-free conditions are treated as ablations rather than replacement primaries.

Project structure

src/poc/
  cross_model/                                   # Shared multi-model infrastructure
  exp01_hierarchical_distributional_narrowing/
  exp02_ic_ooc_reasoning_mechanistic_comparison/
  exp03_corrective_stage_characterization/
  exp04_phase_transition_characterization/
  exp05_corrective_direction_ablation_cartography/
  exp06_corrective_direction_steering/
  exp07_methodology_validation_tier0/
  exp08_multimodel_steering_phase0/
  exp09_cross_model_observational_replication/
  exp10_contrastive_activation_patching/
  exp11_matched_prefix_mlp_graft/
  exp12_free_running_abc_graft/
  exp13_late_stage_token_support_analysis/
  exp14_symmetric_matched_prefix_causality/
  exp15_symmetric_behavioral_causality/
  exp16_matched_prefix_js_gap/
  exp20_divergence_token_counterfactual/
  exp21_productive_opposition/
  exp22_endpoint_deconfounded_gap/
  exp23_midlate_interaction_suite/
  exp23_midlate_kl_factorial/
  exp24_32b_external_validity/
  exp25_olmo_stage_progression/
  exp26_residual_opposition_mediation/
  exp27_natural_rollout_residual_opposition_ntp/

scripts/
  analysis/                                      # Post-hoc summaries, cross-checks, paper stats
  data/                                          # Dataset builders / data prep
  eval/                                          # Judge and evaluation entrypoints
  infra/                                         # Modal/Lambda/cloud helpers
  merge/                                         # Worker/shard merge utilities
  plot/                                          # Figure generation
  precompute/                                    # Direction extraction and preprocessing
  run/                                           # Main experiment launchers
  scoring/                                       # Rescoring utilities

results/
  cross_model/{model}/
  exp01_hierarchical_distributional_narrowing/
  ...
  exp15_symmetric_behavioral_causality/

Canonical experiment/result paths now use descriptive names. Source code now lives only in the canonical named experiment folders. Some legacy result and flat script aliases are still kept during the results/scripts migration so older commands keep working.

For a full index, see docs/EXPERIMENT_REGISTRY.md.

Broader Experiment Index

This index includes historical and supporting experiments. The current paper's main first-divergence factorial claim uses six dense PT/IT pairs; supporting rows are labeled Dense-5 when Qwen2.5-32B was not rerun, and DeepSeek-V2-Lite remains a MoE side case where artifacts exist.

Observational / Layerwise Context

ID	Analysis	Key result
L1	δ-cosine profiles	IT adds more late residual opposition than PT in the dense-family pool, with heterogeneous magnitude and a separate MoE side case
L2	Convergence gap + delayed commitment (5 metrics × 2 lenses)	IT stays farther from its own final distribution through much of the stack; used as layerwise context
L3	Weight change localization	Gemma: concentrated at corrective layers; others: uniform
L8	Geometry follow-up	Exploratory dimensionality / covariance diagnostics are mixed and not part of the core evidence chain
L9	Attention entropy divergence	Architecture-dependent

Causal steering (Gemma, extending to all 6)

ID	Experiment	Key result
A1	α-sweep on corrective layers	Governance dose-response, content flat
A1_rand	Random direction control	3× less governance effect — direction specificity
A1_notmpl	No chat template	Dose-response preserved — weight-encoded
A2	Inject into PT	Noisy — PT lacks downstream circuitry
A5a	Progressive layer skipping	Final 3 layers: format; earlier: coherence

Matched-prefix Internal Causality

ID	Experiment	Key result
exp11	Matched-prefix late IT MLP graft	Late IT MLPs increase late KL-to-own-final and move PT internal predictions toward the IT teacher under shared token history
exp13A-lite	Descriptive token-support analysis	Late grafts broadly suppress raw-continuation-like `FUNCTION/OTHER` candidates and increase support for the eventual teacher token
exp16	Matched-prefix native-JS replay	Direct same-layer JS under frozen exp14 teacher histories removes unmatched-history and own-final-endpoint dependence from the main internal divergence readout
exp14	Symmetric sufficiency / necessity	Late IT→PT graft is the strongest sufficiency window and late PT→IT swap is the strongest necessity window in the dense-family pool on the primary late-region KL metric
exp20	First-divergence identity/margin decomposition	Middle-positioned substitutions transfer token identity more often; late-positioned substitutions affect margin more
exp21	MLP write-out at first divergence	Late IT MLPs provide strong native IT-token support, but the MLP-only late effect is weak from PT upstream state
exp23	Residual-state x late-stack factorial	Current headline: upstream-late interaction on the divergent-token margin, with label-swap, position, subgroup, and content/reasoning checks

Free-running Behavioral Causality

ID	Experiment	Key result
exp12	A/B/C free-running graft comparison	Legacy behavior run: late graft reduces benign false refusals broadly and improves assistant register in several families, but remains far from the full IT endpoint on polished structure
exp15	Symmetric behavioral causality	Current canonical behavioral estimate of the same late intervention family, with the clearest effects on the IT-side necessity test and weaker but real PT-side recovery

Methodology validation (Tier 0)

ID	Test	Result
0A	Direction bootstrap stability	cos > 0.993 by n=300
0B	Matched-token direction	cos = 0.82 (primarily weight-driven)
0C	Projection-matched random	3× less governance, identical content degradation
0D	Bootstrap 95% CIs	BCa intervals on all metrics
0E	Classifier robustness	Robust to all boundary perturbations
0F	Layer range sensitivity	Stable across 4 overlapping ranges
0G	Tuned-lens commitment	Primary commitment measurement (6 models × 2 variants)
0H	Calibration split	Three disjoint prompt sets → same dose-response
0I	Formula comparison	MLP projection only; attention/residual fail
0J	Onset threshold sensitivity	Robust across σ-based and absolute thresholds

Contrastive activation patching (Exp10, in progress)

Phase	Description	Status
1	Forced-decoding paired data collection	Prototype complete
2	Ridge probes → convergence direction (d_conv)	Prototype complete
3	Causal activation patching (5 conditions)	Prototype complete
4	Steering with d_conv vs d_mean	Prototype: d_mean steers (11–19×), d_conv does not

Pipeline design

The steering pipeline is architecture-agnostic. It operates on raw MLP activations via a model-agnostic adapter system — no transcoders, SAEs, or model-specific decompositions required.

Direction Extraction          Steering                Evaluation
--------------------    --------------------    --------------------
IT model --+            IT model + hooks        LLM judge (G1/G2)
           |-- d_mean   h += (alpha-1)(d'h)d    Programmatic (STR)
PT model --+            per corrective layer    IFEval compliance
                                                MMLU / GSM8K / reasoning

The adapter system provides a uniform interface across the dense architectures plus the DeepSeek MoE side case and Gemma's hybrid attention. Extending to a new model requires only registering its architecture in the adapter config.

Citation

@article{anonymous2026corrective,
  title={First-Divergence Factorial Diffing for Post-Trained Language Models},
  author={Anonymous},
  year={2026}
}

License

See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 315 Commits
data		data
docs		docs
logs		logs
output/pdf		output/pdf
paper_draft		paper_draft
results		results
scripts		scripts
src		src
tmp		tmp
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
REPRODUCIBILITY.md		REPRODUCIBILITY.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

First-Divergence Factorial Diffing for Post-Trained Language Models

Measuring upstream-late interaction at the first PT/IT next-token disagreement

Start Here

Reproducibility artifacts

Current Status

Quickstart

Setup

Sanity-check the repo

Explore the main runnable entrypoints

Common analysis / plotting commands

Canonical run scripts

Models

Project structure

Broader Experiment Index

Observational / Layerwise Context

Causal steering (Gemma, extending to all 6)

Matched-prefix Internal Causality

Free-running Behavioral Causality

Methodology validation (Tier 0)

Contrastive activation patching (Exp10, in progress)

Pipeline design

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

First-Divergence Factorial Diffing for Post-Trained Language Models

Measuring upstream-late interaction at the first PT/IT next-token disagreement

Start Here

Reproducibility artifacts

Current Status

Quickstart

Setup

Sanity-check the repo

Explore the main runnable entrypoints

Common analysis / plotting commands

Canonical run scripts

Models

Project structure

Broader Experiment Index

Observational / Layerwise Context

Causal steering (Gemma, extending to all 6)

Matched-prefix Internal Causality

Free-running Behavioral Causality

Methodology validation (Tier 0)

Contrastive activation patching (Exp10, in progress)

Pipeline design

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages