elastic-looped-transformer

license

apache-2.0

library_name

transformers

pipeline_tag

text-generation

base_model

huihui-ai/Huihui-Qwen3.5-4B-Claude-4.6-Opus-abliterated

elastic-looped-transformer

TL;DR — A causal LM whose Transformer layers are weight-shared across L iterations. Pick L ∈ [1, 4] per request at inference to trade quality for latency — from the same checkpoint. Trained with Intra-Loop Self-Distillation + GRPO with a correct × format verifier. Scales shipped from 10M to 1 B non-embedding, runnable on a single RTX 3060 12 GB via 8-bit paged or fp32-on-NVMe optimizer state. Faithful PyTorch port of arXiv:2604.09168.

AI Engineering Evidence Card

Field	Current public evidence
Model surface	Elastic Looped Transformer causal LM with shared layers, selectable inference loop count `L`, ILSD, GRPO, and HuggingFace `trust_remote_code` export
Dataset surface	Redistributable `synthetic-v2-hard` training/evaluation snapshot plus `training_data/DATA_SOURCES.md` and `training_data/source_citations.yaml`
Metrics	Anytime loop telemetry, self-correction/overthinking rates, per-loop accuracy, entropy trajectory, latency/token, tokens/sec, VRAM, and cross-validated benchmark comparison
Repro command	`uv run elt-train --config configs/base_1B.yaml`, `uv run python scripts/pipeline.py`, and `uv run python -m elt_lm.eval.benchmark_comparison ...`
Hardware proof	1 B non-embedding config smoke on a single RTX 3060 12 GB with paged AdamW 8-bit and documented peak VRAM
Limitations	Large tokenized binaries and long-running checkpoints are generated artifacts, not committed; model releases must cite exact public datasets and commit hashes

What's in the box

ELT core (src/elt_lm/) — N shared Transformer layers iterated L times at inference. Paper equations preserved verbatim in code.
ILSD — Intra-Loop Self-Distillation (loss = L_GT(T) + λ L_GT(S) + (1−λ) L_dist(S, sg T)) with L_int ∼ U(L_min, L_max) student, linear λ decay from 1 → 0.
GRPO — DeepSeekMath §4.1 post-training with clipped surrogate + unbiased KL to frozen SFT reference. Verifier is correct · format with length + repeat guards. Python-exec verifier for code tasks.
Memory stack for 1 B on 12 GB VRAM — two optimizer back-ends:
- paged_adamw_8bit (bitsandbytes) — peak 7.88 GB VRAM on the 1 B config, fast.
- nvme_adamw — custom 4-tier store with fp32 optimizer state memory-mapped on NVMe; params stay on GPU, state round-trips CPU→NVMe each step.
Rolling 5-minute checkpoints — round-robin rolling_{0..keep-1}.pt + last.pt hardlink + CPU/CUDA RNG state → bit-reproducible resume.
HuggingFace Hub export — trust_remote_code=True bundle (model code + weights + tokenizer + rendered README in one directory).
Streamlit dashboard — live panels for pipeline / training / storage tiers / hardware / inference Pareto / checkpoints, fed by a line-buffered JSONL telemetry writer.

Quickstart (use a published checkpoint)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tok   = AutoTokenizer.from_pretrained("zapabob/elt-lm-base-275m")
model = AutoModelForCausalLM.from_pretrained(
    "zapabob/elt-lm-base-275m",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).eval().cuda()

ids = tok("If 3x + 7 = 22, what is x? Think step by step.",
          return_tensors="pt").input_ids.cuda()

# Same checkpoint, user picks L per call.
for L in (1, 2, 3, 4):
    out = model.generate(ids, max_new_tokens=128, L=L, do_sample=False)
    print(f"L={L}: {tok.decode(out[0], skip_special_tokens=True)}")

Architecture

input_ids ──embed──► x₀ ──g_Θ──► x₁ ──g_Θ──► x₂ ── ... ──g_Θ──► x_L ──norm + lm_head──► logits
                       └─── weights SHARED across every iteration ───┘

eq.	where	what
`g_Θ(x) = f_{θ_N} ∘ … ∘ f_{θ_1}(x)`	`src/elt_lm/composite.py`	composite block, N unique layers
`F_{N,L}(x) = g_Θ^L(x)`	`src/elt_lm/model.py`	L-fold iteration
`L_ILSD = L_GT(T) + λ L_GT(S) + (1−λ) L_dist(S, sg T)`	`src/elt_lm/losses.py`	intra-loop distillation (paper eq. 3)
`L_int ∼ U(L_min, L_max)`	`src/elt_lm/train.py`	stochastic student L
`λ: 1 → 0` (linear)	`src/elt_lm/train.py`	distillation curriculum

Shipped scales

config	d_model	N	non-emb	total	effective (L=4)	target hardware
`tiny_10M.yaml`	256	4	3.5 M	~67 M	16-layer	CPU smoke
`base_100M.yaml`	768	12	85 M	275 M	48-layer	12 GB GPU, fp32 Adam
`smoke_300M.yaml`	1024	16	205 M	460 M	64-layer	NvmeAdamW validation
`base_1B.yaml`	1792	28	1.09 B	1.54 B	112-layer	12 GB GPU w/ PagedAdamW8bit

Token embedding (248 K × d_model) dominates the parameter count at smaller scales; the interesting number is non-emb, which is what gets iterated.

ILSD stability objective

ELT is evaluated as a loop-wise refinement system, not only as a small dense LM. The teacher is the deepest loop and the student is an intermediate loop:

z_T = logits at L_T = L_max
z_S = logits at L_S ~ U(L_min, L_max)
p_T = stopgrad(softmax(z_T / tau_T))
p_S = softmax(z_S)
L_ILSD = L_GT(T) + lambda L_GT(S) + (1 - lambda) CE(p_T, p_S)

The stopgrad on p_T is intentional. It prevents the deepest loop from being pulled around by the student during self-distillation, keeping the maximum-loop path as the local teacher. The current stabilizer stack keeps teacher-only temperature and masked soft CE, then adds entropy/loop-trajectory regularizers so additional loops refine rather than collapse:

teacher-only temperature smooths the teacher target without hiding the student's actual sharpness.
entropy floor penalizes low-entropy collapse when the model becomes confidently wrong.
Delta^2 entropy curvature penalizes abrupt entropy bends along the loop axis L, matching the ELT idea of incremental refinement.
sampled Delta^2 logit curvature can be enabled after entropy metrics are stable, using sampled/top-k vocab slices instead of full-vocab curvature.

The important design choice is that safety and capability alignment are handled mostly by data selection, lane verifiers, KL-constrained GRPO, and evaluation rather than by blanket refusal behavior baked into the base model.

Anytime loop evaluation

The key experimental question is not merely whether L=4 scores higher than L=1, but whether deeper loops correct shallow mistakes without overthinking correct answers. elt-anytime now emits benchmark refinement telemetry:

loop_gain(L=k)       = score(L=k) - score(L=1)
marginal_gain(L=k)   = score(L=k) - score(L=k-1)
self_correction_rate = count(L=1 wrong and L=k correct) / N
overthinking_rate    = count(L=1 correct and L=k wrong) / N

For each benchmark, track these alongside per-loop accuracy, entropy trajectory, latency/token, tokens/sec, and VRAM. A healthy ELT run should increase self-correction faster than overthinking as L grows.

1 B training on a 12 GB card

uv run elt-train --config configs/base_1B.yaml

With optim.kind: paged_adamw_8bit:

measure	value
model params	1.537 B total, 1.092 B non-emb
peak VRAM	7.88 GB
one-step smoke	~5.0 s (incl. cuDNN warm-up)

Alternative — NVMe-backed fp32 state (optim.kind: nvme_adamw):

# configs/your_run.yaml
optim:
  kind: nvme_adamw
offload:
  enabled: true
  root: H:/elt_data/offload_nvme  # where to mmap fp32 state shards
  min_free_gb: 20.0               # refuse to start if less

Measured on smoke_300M.yaml × NvmeAdamW, RTX 3060:

measure	value
params	0.46 B total, 0.21 B non-emb
peak VRAM	4.38 GB
step (fwd + bwd + NvmeAdamW.step)	128.7 s

VRAM drops further, but NVMe bandwidth becomes the bottleneck — use nvme_adamw only when VRAM is the hard constraint.

Training pipeline

Three phases, resumable, driven by one orchestrator:

stage	config	what it does
Phase 1 — Pretrain	`configs/base_100M.yaml` / `configs/base_1B.yaml`	ILSD with warmup-then-anneal λ, bf16 + grad-ckpt + grad-accum
Phase 2 — SFT	`configs/sft_cot.yaml`	CoT instruction + offline distillation
Phase 3 — GRPO	`configs/grpo_gsm8k.yaml`	clipped surrogate + unbiased KL, `correct × format` verifier

# End-to-end 11-stage pipeline (respects .done markers)
uv run python scripts/pipeline.py

# Register as Windows startup task — auto-resumes on every boot, removes
# itself from Task Scheduler once the final stage is done.
powershell -ExecutionPolicy Bypass -File scripts/pipeline_register.ps1

Training data provenance

The current repository includes a redistributable snapshot of the active synthetic-v2-hard training/evaluation data under training_data/synthetic_v2_hard/. That snapshot contains verifier-backed SFT traces, intentionally wrong contrast traces, and held-out GRPO/bridge prompts for code, math, STEM reasoning, and tool-use lanes.

Source and citation metadata is tracked in:

training_data/DATA_SOURCES.md
training_data/source_citations.yaml
scripts/download_hf_corpus.py
scripts/corpus_manifest.yaml

The large tokenized *.bin files under H:/elt_data/* are generated artifacts and are not committed. For model releases, cite the exact public datasets listed in training_data/source_citations.yaml, plus this repository commit for the synthetic-v2-hard generated data. The loop/self-distillation method follows ELT / ILSD (arXiv:2604.09168); GRPO follows DeepSeekMath (arXiv:2402.03300).

Dashboard

uv sync --extra dashboard
uv run streamlit run dashboard/app.py
# → http://localhost:8501

Panels:

Pipeline — .done markers + tail of pipeline.jsonl
Training — loss / lr / grad-norm / tok-per-sec, λ curve, L_int histogram
Storage tiers — NVMe MB/s, prefetch hit rate, per-layer compute tier
Hardware — VRAM (NVML), CPU/RAM (psutil), C:/H: free
Inference Pareto — L vs. quality / latency / tok-per-sec (from inference_sweep)
Checkpoints — rolling slot, age, disk usage

HuggingFace Hub export

uv run python scripts/export_to_hf.py \
  --ckpt      runs/grpo_gsm8k/last.pt \
  --out       hf_export/elt-lm-base-275m \
  --tokenizer H:/Qwen3.5-9B-official-hf \
  --repo-id   zapabob/elt-lm-base-275m \
  --push-to-hub

Bundles configuration_elt.py, modeling_elt.py, config.json, model.safetensors, tokenizer files, and a rendered README.md. Downstream users only need pip install transformers.

For the Qwen3.5 side-LoRA bridge runs, export the adapter payload separately:

uv run elt-export-lora-adapter \
  --ckpt H:/elt_data/runs/grpo_side_lora_stem_synthetic_v2_bridge/last.pt \
  --out-dir H:/elt_data/adapters/qwen35_4b_side/synthetic_stem_v2_bridge_grpo_candidate

This now writes both local-runtime adapter.pt and portable adapter_model.safetensors plus adapter_config.json and a minimal model card. The 2026-05-03 stem bridge candidate has been exported at H:/elt_data/adapters/qwen35_4b_side/synthetic_stem_v2_bridge_grpo_candidate (adapter_model.safetensors, 64,987,976 bytes).

GGUF release readiness follows the current llama.cpp path: first produce a Transformers/HF directory with config.json, tokenizer files, and safetensors, then run convert_hf_to_gguf.py, optionally quantize to Q8_0, and use Turboquant-CUDA for the TQ4_1S artifact. The repo helper records the exact commands and blockers:

uv run python -m elt_lm.export_merged_qwen35_hf \
  --ckpt H:/elt_data/runs/grpo_side_lora_stem_synthetic_v2_bridge/last.pt \
  --out-dir H:/elt_data/hf_exports/elt-lm-qwen35-side-stem-v2-bridge-merged \
  --tokenizer H:/Qwen3.5-9B-official-hf \
  --repo-id zapabob/elt-lm-qwen35-side-stem-v2-bridge

uv run python -m elt_lm.release_readiness \
  --hf-dir H:/elt_data/hf_exports/elt-lm-qwen35-side-stem-v2-bridge-merged \
  --gguf-path H:/elt_data/releases/elt-lm-qwen35-side-stem-v2-bridge.gguf \
  --repo-id zapabob/elt-lm-qwen35-side-stem-v2-bridge \
  --llama-cpp-dir C:/Users/downl/Desktop/llama.cpp-zapabob \
  --turboquant-gguf-path H:/elt_data/releases/elt-lm-qwen35-side-stem-v2-bridge-TQ4_1S.gguf \
  --turboquant-source-gguf-path H:/elt_data/releases/elt-lm-qwen35-side-stem-v2-bridge-Q8_0.gguf \
  --turboquant-cuda-dir C:/Users/downl/Desktop/Turboquant-CUDA \
  --out _docs/assets/2026-05-03-deepresearch-elt-llm-implementation/release_readiness_stem_bridge_merged.json

Current status as of 2026-05-03: merged HF safetensors, llama.cpp BF16 GGUF, llama.cpp Q8_0 GGUF, and Turboquant TQ4_1S GGUF are ready for handoff. The side-LoRA bridge remains the L_min=L_max=1 path; native looped ELT runtime support is tracked separately from this release artifact.

Quantization and serving lanes

TQ4_1S currently serves as a compact GGUF weight-compression artifact. We do not claim Google TurboQuant KV-cache serving performance from this result. Instead, ELT quantization work is split into separate lanes so each claim has the right evidence:

lane	current scope	next proof point
Weight GGUF compression	BF16, Q8_0, Q6_K, Q5_K_M, Q4_K_M, IQ4_NL, IQ4_XS, TQ4_1S	file size, PPL/KL, top-k stability, ELT exact/step accuracy
Calibration and tensor policy	ELT corpus imatrix plus protected output/token embedding/attn_v/ffn_down tensors	same-bit-budget recovery versus all-Q4 baselines
KV cache compression	llama.cpp `f16`, `q8_0`, `q4_0`, `q5_0`, `iq4_nl` K/V sweeps	ctx length, KV MiB, VRAM peak, tok/s, K/V asymmetry
TurboQuant-style KV	TheTom `turbo2`/`turbo3`/`turbo4` runtime cache types, separate from local TQ4_1S weight artifacts	K-protected `q8_0`/`bf16` sweeps, then RTX 3060 CUDA serving measurements when the GPU is free
DFlash speculative decoding	separate draft/verify serving lane	target equivalence, acceptance length/rate, tok/s, loop-depth stability

The first publishable claim is therefore not "TQ4_1S is smaller"; it is: ELT separates weight format, calibration data, tensor protection, KV-cache type, and speculative decoding, then measures where recurrent loop quality breaks and where compression remains recoverable. As of 2026-05-03, llama.cpp documents stock K/V cache types (f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1), while the TurboQuant KV PR is still an open CPU-only TBQ3/TBQ4 path and DFlash is a draft speculative-decoding PR. That keeps the current README claim honest: compact GGUF weight handoff now, TurboQuant KV and DFlash serving evidence later.

For L_max > 1 exports, elt_lm.release_readiness now reads elt_config and keeps the release blocked until the caller declares both a loop-aware llama.cpp runtime (--loop-runtime-supported) and a Turboquant converter that preserves elt.* loop metadata (--turboquant-loop-metadata-supported). Those looped artifacts use the Turboquant model family ELT/Qwen3.5-looped.

2026-05-17 L=3 TheTom K-protected KV sweep

The current L_max=3 handoff now has corrected BF16/Q8_0/TQ4_1S GGUF artifacts plus a TheTom runtime KV smoke sweep. The sweep protects K by keeping --cache-type-k at only q8_0 or bf16; only V is swept through TheTom turbo2, turbo3, and turbo4. This is deliberately separate from TQ4_1S, which remains an offline GGUF weight-compression artifact.

artifact	path	bytes	GiB	metadata check
BF16 GGUF	`H:/elt_data/releases/elt-lm-qwen35-side-stem-aha-ilsd-l3.gguf`	9,695,800,320	9.03	`qwen35.block_count=32`, `elt.loop.L_max=3`, no MTP nextn layer
Q8_0 GGUF	`H:/elt_data/releases/elt-lm-qwen35-side-stem-aha-ilsd-l3-Q8_0.gguf`	5,157,841,920	4.80	`general.file_type=7`, `elt.gguf.runtime_status=requires_looped_qwen35_runtime`
TQ4_1S GGUF	`H:/elt_data/releases/elt-lm-qwen35-side-stem-aha-ilsd-l3-TQ4_1S.gguf`	4,467,422,272	4.16	`hypura.turboquant.weight.codec=tq4_1s`, 427 tensors, offset range valid

Runtime evidence uses the installed TheTom-capable llama.cpp binary: llama-cli.exe --version -> 9451 (68124bdbe), whose help lists turbo2, turbo3, and turbo4 as legal --cache-type-k / --cache-type-v values. A 2026-05-20 parser probe also requested turbo8; the installed runtime rejected it as Unsupported cache type: turbo8, so Turbo8 is reported as an unsupported runtime value rather than as a throughput or quality result. Because the RTX 3060 was busy during this run, the sweep used -ngl 0, -c 128, -n 8, and is a CPU/offload runtime smoke, not a CUDA throughput claim. The verbose logs do prove the cache policy, for example K (q8_0) with V (turbo2) and no turbo* K path.

cache policy	ok / total	decode tok/s mean +/- SEM	KV MiB	delta vs `K=q8_0/V=q8_0`	p
`K=f16_V=f16`	2 / 2	1.21 +/- 0.27	8.00	-0.75	0.6
`K=q8_0_V=q8_0`	2 / 2	1.96 +/- 0.04	4.25	baseline	baseline
`K=q8_0_V=turbo2`	2 / 2	1.84 +/- 0.07	2.86	-0.12	0.6
`K=bf16_V=turbo2`	2 / 2	2.12 +/- 0.95	4.73	+0.16	1.0
`K=q8_0_V=turbo3`	2 / 2	3.18 +/- 0.36	2.91	+1.22	0.6
`K=bf16_V=turbo3`	2 / 2	2.54 +/- 0.64	4.78	+0.58	1.0
`K=q8_0_V=turbo4`	2 / 2	2.59 +/- 0.23	3.19	+0.63	0.6
`K=bf16_V=turbo4`	1 / 2	4.21 +/- 0.00	5.06	+2.29	n/a
`K=q8_0_V=turbo8`	0 / 2	unsupported	n/a	n/a	n/a
`K=bf16_V=turbo8`	0 / 2	unsupported	n/a	n/a	n/a

The paired p-values come from two repeated blocks where available, so they are descriptive smoke evidence rather than a strong significance claim. The one bf16/turbo4 failure kept a partial log with KV allocation and one later successful generation; treat that policy as lower-confidence until rerun on an idle GPU. The strongest current runtime result is the policy boundary: K-protected TheTom V-cache execution is live for L_max=3 GGUF, while looped ELT quality still requires a loop-aware Qwen3.5 runtime because the GGUF marks elt.gguf.runtime_status=requires_looped_qwen35_runtime.

Artifacts:

_docs/assets/2026-05-17-l3-thetom-k-protected/thetom_k_protected_kv_raw.csv
_docs/assets/2026-05-17-l3-thetom-k-protected/thetom_k_protected_kv_summary.csv
_docs/assets/2026-05-17-l3-thetom-k-protected/thetom_k_protected_kv_pairwise.csv
_docs/assets/2026-05-17-l3-thetom-k-protected/thetom_k_protected_kv_report.md
_docs/assets/2026-05-17-l3-thetom-k-protected/gptimage_l3_thetom_k_protected_summary.png

2026-05-20 KV/Triality/entropy evidence bundle

The latest publication-facing bundle joins the K-protected KV sweep with the Turboquant-CUDA Triality SO(8) rotation audit, ILSD self-distillation telemetry, and paired loop-aware CV statistics. It is meant for AI-engineering review: the plot highlights error bars and p-values, while unsupported turbo8 remains visibly separate from the working turbo3/turbo4 paths.

Triality SO(8) vector-view audit, copied from the Turboquant-CUDA production manifest, passed all 4608 audited rows with 0 outliers at the 0.01 orthogonality/determinant gate:

V bit target	max orth err	mean det	max det err	status
3	4.870e-03	1.000145130	7.628e-03	pass
4	4.754e-03	1.000145894	6.427e-03	pass
8	6.269e-03	1.000028411	7.111e-03	pass

ILSD monitoring found no non-finite loss, distance, or entropy values in the current L2/L3 side-LoRA runs. The L3 runs are deliberately flagged as a stability watch surface because max L_dist rises to 9.128 (code), 8.832 (math), 9.233 (stem), and 11.343 (tool), while final L_entropy stays small (0.000 to 0.008). This supports "monitoring is in place"; it is not a broad convergence guarantee.

Loop-aware paired CV over 32 local STEM bridge cases reports mean +/- SEM:

group	n	accuracy	SEM	95% CI
L1	32	0.4375	0.0891	[0.2629, 0.6121]
L2	32	0.5625	0.0891	[0.3879, 0.7371]
L3	32	0.6562	0.0853	[0.4891, 0.8234]

Paired permutation p-values are 0.122488 for L1-L2, 0.016098 for L1-L3, and 0.254775 for L2-L3; the within-block Friedman permutation p-value is 0.002500.

A logged lm-eval-harness serving-surface gate now covers 128 external heldout cases: 64 native MMLU-STEM MCQ rows plus 64 GSM8K test rows converted into numeric multiple-choice questions. It runs the Q8_0 GGUF with llama-server --ngl 999, using the harness evaluator plus a local /completion log-prob adapter because this installed server exposes the newer llama.cpp logprob schema. This is a larger CV gate, not a broad leaderboard result or standard GSM8K exact-match claim.

K/V policy	folds	accuracy mean +/- SEM	95% CI	status
K=q8_0, V=turbo3	8	0.5547 +/- 0.0630	[0.4312, 0.6781]	ok
K=bf16, V=turbo3	8	0.5625 +/- 0.0765	[0.4125, 0.7125]	ok
K=q8_0, V=turbo4	8	0.5469 +/- 0.0763	[0.3973, 0.6965]	ok
K=bf16, V=turbo4	8	0.5469 +/- 0.0763	[0.3973, 0.6965]	ok
K=q8_0, V=turbo8	0	n/a	n/a	unsupported cache type
K=bf16, V=turbo8	0	n/a	n/a	unsupported cache type

Paired within-fold permutation p-values over the measured policies are:

comparison	mean delta	p
K=q8_0,V=turbo3 - K=bf16,V=turbo3	-0.0078	1.000000
K=q8_0,V=turbo3 - K=q8_0,V=turbo4	0.0078	1.000000
K=q8_0,V=turbo3 - K=bf16,V=turbo4	0.0078	1.000000
K=bf16,V=turbo3 - K=q8_0,V=turbo4	0.0156	0.750973
K=bf16,V=turbo3 - K=bf16,V=turbo4	0.0156	0.750973
K=q8_0,V=turbo4 - K=bf16,V=turbo4	0.0000	1.000000

The four-policy Friedman within-fold permutation p-value is 0.781022. Benchmark slices are stable in the same direction: MMLU-STEM is 0.7031 accuracy for all four measured policies, while GSM8K numeric-MCQ ranges from 0.3906 to 0.4219. turbo8 is only a parser/runtime-support probe in this installed llama.cpp build (Unsupported cache type: turbo8).

Artifacts:

_docs/assets/2026-05-20-kv-triality-goal/kv_triality_goal_report.md
_docs/assets/2026-05-20-kv-triality-goal/kv_triality_goal_report.json
_docs/assets/2026-05-20-kv-triality-goal/gptimage2_kv_triality_goal_dashboard.png
_docs/assets/2026-05-20-kv-triality-goal/loop_aware_l123_cv_stats.json
_docs/assets/2026-05-20-kv-triality-goal/triality_so8_rotation_audit_summary.csv
_docs/assets/2026-05-20-lm-eval-gguf-kv-cv/lm_eval_gguf_kv_cv_report.md
_docs/assets/2026-05-20-lm-eval-gguf-kv-cv/lm_eval_gguf_kv_cv_report.json
_docs/assets/2026-05-20-lm-eval-gguf-kv-cv/gptimage2_lm_eval_gguf_kv_cv.png
_docs/assets/2026-05-20-lm-eval-gguf-kv-cv-128/lm_eval_gguf_kv_cv_report.md
_docs/assets/2026-05-20-lm-eval-gguf-kv-cv-128/lm_eval_gguf_kv_cv_report.json
_docs/assets/2026-05-20-lm-eval-gguf-kv-cv-128/gptimage2_lm_eval_gguf_kv_cv.png
_docs/assets/2026-05-21-goal-completion-audit/goal_completion_audit.md

2026-05-21 best BF16 GGUF headless lm-eval CV

The BF16 release artifact elt-lm-qwen35-side-stem-aha-ilsd-l3-BF16.gguf has a direct headless lm-eval-harness K/V cache CV run over the same 128 external heldout cases used above. The run uses llama-server --ngl 999 and compares four measured policies plus turbo8 parser probes.

BF16 GGUF policy	folds	accuracy mean +/- SEM	95% CI	status
`K=bf16,V=turbo3`	8	0.5547 +/- 0.0662	[0.4249, 0.6845]	selected BF16 policy
`K=q8_0,V=turbo3`	8	0.5547 +/- 0.0662	[0.4249, 0.6845]	tied
`K=bf16,V=turbo4`	8	0.5469 +/- 0.0754	[0.3991, 0.6947]	ok
`K=q8_0,V=turbo4`	8	0.5391 +/- 0.0708	[0.4003, 0.6778]	ok after single-policy retry
`K=bf16,V=turbo8`	0	n/a	n/a	unsupported cache type
`K=q8_0,V=turbo8`	0	n/a	n/a	unsupported cache type

The selected BF16 serving policy is K=bf16,V=turbo3: it ties the best overall mean and preserves the BF16 K-cache path. Pairwise p-values among measured groups are all non-significant (p >= 0.750973), with Friedman p 0.943806. MMLU-STEM ranges from 0.6875 to 0.7188; GSM8K numeric-MCQ ranges from 0.3906 to 0.4062. As above, GSM8K is a numeric multiple-choice transform, not standard exact-match generation, and stock GGUF serving is not native loop-aware L>=2 quality.

Artifacts:

_docs/assets/2026-05-21-best-bf16-gguf-lm-eval-cv/best_bf16_selection.md
_docs/assets/2026-05-21-best-bf16-gguf-lm-eval-cv/lm_eval_gguf_kv_cv_report.md
_docs/assets/2026-05-21-best-bf16-gguf-lm-eval-cv/lm_eval_gguf_kv_cv_report.json
_docs/assets/2026-05-21-best-bf16-gguf-lm-eval-cv/gptimage2_lm_eval_gguf_kv_cv.png

2026-05-17 L=3 LLM evidence gates

The current L_max=3 artifact also has the first README-worthy LLM quality gate with at least 128 cases, an external heldout slice, GPU-offload proof, and loop-aware quality measurement. The GGUF path below is Q8_0 on the installed TheTom-capable llama.cpp runtime with --ngl 999, K=q8_0, V=turbo3; the loop-aware path is the HF/PyTorch Qwen3.5 runtime because stock GGUF execution still marks elt.gguf.runtime_status=requires_looped_qwen35_runtime.

evaluation	n	correct	accuracy	Wilson 95% CI	SEM	prompt tok/s	decode tok/s
Local STEM bridge	128	121	94.5%	[89.1, 97.3]	2.01%	1241.74	50.02
MMLU-STEM heldout	16	13	81.2%	[57.0, 93.4]	9.76%	1268.50	54.24
GSM8K heldout	16	0	0.0%	[0.0, 19.4]	0.00%	1013.86	48.64

Pairwise accuracy p-values use two-sided Fisher exact tests over correct/incorrect counts:

comparison	p
Local STEM bridge vs MMLU-STEM heldout	0.0835
Local STEM bridge vs GSM8K heldout	3.56e-16
MMLU-STEM heldout vs GSM8K heldout	3.22e-06

Loop-aware quality uses paired case IDs on 32 local STEM bridge questions and scores multiple-choice log-probability, not free-form generation:

L	n	correct	accuracy	Wilson 95% CI	SEM	mean margin	wall sec/case
1	32	14	43.8%	[28.2, 60.7]	8.77%	0.0234	0.661
2	32	18	56.2%	[39.3, 71.8]	8.77%	0.1445	1.225
3	32	21	65.6%	[48.3, 79.6]	8.40%	0.3154	1.780

Paired McNemar exact p-values over discordant cases:

comparison	improved	discordant	p
L1 vs L2	4	4	0.125
L1 vs L3	7	7	0.0156
L2 vs L3	3	3	0.25

The supportable public claim is therefore narrow: the L=3 handoff is strong on the local STEM bridge task, MMLU-STEM is promising but a small cached slice, and loop depth helps in the loop-aware runtime. GSM8K is explicitly not solved (0/16), so this is not yet a broad mathematical reasoning or general LLM leaderboard claim.

Artifacts:

_docs/assets/2026-05-17-l3-thetom-k-protected/l3_readme_stats.json
_docs/assets/2026-05-17-l3-thetom-k-protected/l3_readme_stats.md
_docs/assets/2026-05-17-l3-thetom-k-protected/l3_readme_accuracy_errorbars.png

GGUF quantization CV report

The 2026-05-03 GGUF release check compares the three handoff artifacts under the same local llama.cpp CUDA runtime on the RTX 3060:

format	size	prompt eval tok/s	decode tok/s	perplexity	logits KL vs BF16	KV / recurrent state
BF16	9.03 GiB	217.15 ± 13.59	27.83 ± 0.07	11313.67	baseline	1024 / 6432 MiB
Q8_0	4.80 GiB	248.77 ± 16.75	40.40 ± 0.15	13677.23	1.6404	1024 / 6432 MiB
TQ4_1S	4.16 GiB	0.79 ± 0.02	0.69 ± 0.03	21648.79	1.8819	1024 / 6432 MiB

Runtime statistics use llama-bench with paired f16/f16 KV cache blocks, n=3 repetitions, mean ± SEM, and SciPy repeated-measures tests. The omnibus Friedman p-value is 0.049787 for both prompt-eval and decode throughput; the pairwise Wilcoxon p-values are 0.25 because the sample is intentionally short. Perplexity and logits KL are one-chunk release checks over verifier-backed synthetic-v2 hard held-out text, not broad lm-eval leaderboard claims. A q8_0/q8_0 KV-cache bench was attempted first, but the BF16 GGUF failed context creation with that cache setting on this local runtime, so the paired comparison uses the common f16/f16 cache path. This section evaluates local GGUF weight artifacts only; it is not a TurboQuant KV-cache or DFlash serving result.

Artifacts:

_docs/assets/2026-05-03-gguf-quant-cv-gptimage/gguf_quant_cv_report.json
_docs/assets/2026-05-03-gguf-quant-cv-gptimage/gguf_quant_cv_summary.csv
_docs/assets/2026-05-03-gguf-quant-cv-gptimage/gguf_quant_cv_pairwise.csv
_docs/assets/2026-05-03-gguf-quant-cv-gptimage/gguf_quant_cv_omnibus.csv

Cross-validated benchmark comparison

elt-anytime already emits case-level correctness and K-fold accuracy summaries for local verifier-backed benchmark manifests. For vanilla-vs-finished model comparison, preserve paired case/fold order and run:

uv run python -m elt_lm.eval.benchmark_comparison \
  --input reports/vanilla_vs_complete_groups.json \
  --out-json reports/vanilla_vs_complete_stats.json \
  --out-md reports/vanilla_vs_complete_stats.md

Input schema:

{
  "benchmark": "mmlu_stem_cv",
  "groups": {
    "vanilla": [0, 1, 0, 1],
    "sft_replay": [0, 1, 1, 1],
    "complete": [1, 1, 1, 1]
  }
}

The report includes mean, SD, SEM, 95% CI, paired permutation p-values for every group pair, and a Friedman within-block permutation p-value when at least three groups are supplied. Use lm-eval-harness for broad external tasks with logged samples, e.g. lm-eval run --model hf --model_args pretrained=<hf_export_dir> --tasks gsm8k,mmlu_stem,hellaswag --output_path <dir> --log_samples, then convert the paired sample correctness arrays into the JSON schema above.

Current measured bridge diagnostics are limited to internal synthetic-v2 bridge verifiers, not broad lm-eval claims: stem is the only export/eval candidate (mean correct 0.8958, final correct 1.0), code and math are sparse-success lanes, and tool-use is blocked because reward/advantage signal remained zero. Full vanilla-vs-complete lm-eval p-values should not be reported until both groups have completed the same paired task set.

Rolling checkpoints

rolling_{0..keep-1}.pt round-robin every rolling_ckpt_interval_sec (5 min default)
last.pt hardlinked to the latest save — resume anchor
step_*.pt milestone saves every save_every
CPU + CUDA RNG state in each save → deterministic resume

Crash loses at most one interval; --resume runs/<dir>/last.pt picks up.

Repo layout

src/elt_lm/          model, layers, losses, train loops, HF wrapper
src/elt_lm/offload/  4-tier store, NvmeAdamW, prefetcher, placement planner
src/elt_lm/hf/       trust_remote_code bundle (ELTConfig, ELTForCausalLM)
src/elt_lm/eval/     any-time L-sweep, verifiers, python-exec guard
src/elt_lm/telemetry.py  thread-safe JSONL writer
dashboard/           Streamlit app + panels + metrics reader
configs/             tiny_10M / base_100M / smoke_300M / base_1B / sft_cot / grpo_gsm8k
scripts/             data DL / clean / tokenize / pipeline / HF export / 1B VRAM smoke
tests/               105 tests; `uv run pytest -q`
_docs/               implementation log (YYYY-MM-DD-<slug>-<AI>.md)

Install

uv sync                             # core
uv sync --extra offload_8bit        # + bitsandbytes for paged_adamw_8bit
uv sync --extra dashboard           # + streamlit / plotly / pynvml / psutil
uv sync --extra dev                 # + pytest, for running the suite
uv run pytest -q                    # 105 passing

Roadmap

Citation

@article{goyal2026elt,
  title   = {Elastic Looped Transformers for Visual Generation},
  author  = {Goyal et al.},
  journal = {arXiv:2604.09168},
  year    = {2026}
}
@article{shao2024deepseekmath,
  title   = {DeepSeekMath: Pushing the Limits of Mathematical Reasoning
             in Open Language Models},
  author  = {Shao et al.},
  journal = {arXiv:2402.03300},
  year    = {2024}
}

License

Apache 2.0 (model weights + code). Tokenizer inherits from Qwen3.5 — see the upstream repo for its terms.

If you find this useful, a star helps others discover it.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
_docs		_docs
configs		configs
dashboard		dashboard
scripts		scripts
src/elt_lm		src/elt_lm
tests		tests
training_data		training_data
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

elastic-looped-transformer

AI Engineering Evidence Card

What's in the box

Quickstart (use a published checkpoint)

Architecture

Shipped scales

ILSD stability objective

Anytime loop evaluation

1 B training on a 12 GB card

Training pipeline

Training data provenance

Dashboard

HuggingFace Hub export

Quantization and serving lanes

2026-05-17 L=3 TheTom K-protected KV sweep

2026-05-20 KV/Triality/entropy evidence bundle

2026-05-21 best BF16 GGUF headless lm-eval CV

2026-05-17 L=3 LLM evidence gates

GGUF quantization CV report

Cross-validated benchmark comparison

Rolling checkpoints

Repo layout

Install

Roadmap

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

elastic-looped-transformer

AI Engineering Evidence Card

What's in the box

Quickstart (use a published checkpoint)

Architecture

Shipped scales

ILSD stability objective

Anytime loop evaluation

1 B training on a 12 GB card

Training pipeline

Training data provenance

Dashboard

HuggingFace Hub export

Quantization and serving lanes

2026-05-17 L=3 TheTom K-protected KV sweep

2026-05-20 KV/Triality/entropy evidence bundle

2026-05-21 best BF16 GGUF headless lm-eval CV

2026-05-17 L=3 LLM evidence gates

GGUF quantization CV report

Cross-validated benchmark comparison

Rolling checkpoints

Repo layout

Install

Roadmap

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages