| license | apache-2.0 | |||||
|---|---|---|---|---|---|---|
| library_name | transformers | |||||
| pipeline_tag | text-generation | |||||
| base_model | huihui-ai/Huihui-Qwen3.5-4B-Claude-4.6-Opus-abliterated | |||||
| tags |
|
TL;DR — A causal LM whose Transformer layers are weight-shared across L iterations. Pick
L ∈ [1, 4]per request at inference to trade quality for latency — from the same checkpoint. Trained with Intra-Loop Self-Distillation + GRPO with acorrect × formatverifier. Scales shipped from 10M to 1 B non-embedding, runnable on a single RTX 3060 12 GB via 8-bit paged or fp32-on-NVMe optimizer state. Faithful PyTorch port of arXiv:2604.09168.
| Field | Current public evidence |
|---|---|
| Model surface | Elastic Looped Transformer causal LM with shared layers, selectable inference loop count L, ILSD, GRPO, and HuggingFace trust_remote_code export |
| Dataset surface | Redistributable synthetic-v2-hard training/evaluation snapshot plus training_data/DATA_SOURCES.md and training_data/source_citations.yaml |
| Metrics | Anytime loop telemetry, self-correction/overthinking rates, per-loop accuracy, entropy trajectory, latency/token, tokens/sec, VRAM, and cross-validated benchmark comparison |
| Repro command | uv run elt-train --config configs/base_1B.yaml, uv run python scripts/pipeline.py, and uv run python -m elt_lm.eval.benchmark_comparison ... |
| Hardware proof | 1 B non-embedding config smoke on a single RTX 3060 12 GB with paged AdamW 8-bit and documented peak VRAM |
| Limitations | Large tokenized binaries and long-running checkpoints are generated artifacts, not committed; model releases must cite exact public datasets and commit hashes |
- ELT core (
src/elt_lm/) — N shared Transformer layers iterated L times at inference. Paper equations preserved verbatim in code. - ILSD — Intra-Loop Self-Distillation (
loss = L_GT(T) + λ L_GT(S) + (1−λ) L_dist(S, sg T)) withL_int ∼ U(L_min, L_max)student, linear λ decay from 1 → 0. - GRPO — DeepSeekMath §4.1 post-training with clipped surrogate + unbiased
KL to frozen SFT reference. Verifier is
correct · formatwith length + repeat guards. Python-exec verifier for code tasks. - Memory stack for 1 B on 12 GB VRAM — two optimizer back-ends:
paged_adamw_8bit(bitsandbytes) — peak 7.88 GB VRAM on the 1 B config, fast.nvme_adamw— custom 4-tier store with fp32 optimizer state memory-mapped on NVMe; params stay on GPU, state round-trips CPU→NVMe each step.
- Rolling 5-minute checkpoints — round-robin
rolling_{0..keep-1}.pt+last.pthardlink + CPU/CUDA RNG state → bit-reproducible resume. - HuggingFace Hub export —
trust_remote_code=Truebundle (model code + weights + tokenizer + rendered README in one directory). - Streamlit dashboard — live panels for pipeline / training / storage tiers / hardware / inference Pareto / checkpoints, fed by a line-buffered JSONL telemetry writer.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tok = AutoTokenizer.from_pretrained("zapabob/elt-lm-base-275m")
model = AutoModelForCausalLM.from_pretrained(
"zapabob/elt-lm-base-275m",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
).eval().cuda()
ids = tok("If 3x + 7 = 22, what is x? Think step by step.",
return_tensors="pt").input_ids.cuda()
# Same checkpoint, user picks L per call.
for L in (1, 2, 3, 4):
out = model.generate(ids, max_new_tokens=128, L=L, do_sample=False)
print(f"L={L}: {tok.decode(out[0], skip_special_tokens=True)}")input_ids ──embed──► x₀ ──g_Θ──► x₁ ──g_Θ──► x₂ ── ... ──g_Θ──► x_L ──norm + lm_head──► logits
└─── weights SHARED across every iteration ───┘
| eq. | where | what |
|---|---|---|
g_Θ(x) = f_{θ_N} ∘ … ∘ f_{θ_1}(x) |
src/elt_lm/composite.py |
composite block, N unique layers |
F_{N,L}(x) = g_Θ^L(x) |
src/elt_lm/model.py |
L-fold iteration |
L_ILSD = L_GT(T) + λ L_GT(S) + (1−λ) L_dist(S, sg T) |
src/elt_lm/losses.py |
intra-loop distillation (paper eq. 3) |
L_int ∼ U(L_min, L_max) |
src/elt_lm/train.py |
stochastic student L |
λ: 1 → 0 (linear) |
src/elt_lm/train.py |
distillation curriculum |
| config | d_model | N | non-emb | total | effective (L=4) | target hardware |
|---|---|---|---|---|---|---|
tiny_10M.yaml |
256 | 4 | 3.5 M | ~67 M | 16-layer | CPU smoke |
base_100M.yaml |
768 | 12 | 85 M | 275 M | 48-layer | 12 GB GPU, fp32 Adam |
smoke_300M.yaml |
1024 | 16 | 205 M | 460 M | 64-layer | NvmeAdamW validation |
base_1B.yaml |
1792 | 28 | 1.09 B | 1.54 B | 112-layer | 12 GB GPU w/ PagedAdamW8bit |
Token embedding (248 K × d_model) dominates the parameter count at smaller scales; the interesting number is non-emb, which is what gets iterated.
ELT is evaluated as a loop-wise refinement system, not only as a small dense LM. The teacher is the deepest loop and the student is an intermediate loop:
z_T = logits at L_T = L_max
z_S = logits at L_S ~ U(L_min, L_max)
p_T = stopgrad(softmax(z_T / tau_T))
p_S = softmax(z_S)
L_ILSD = L_GT(T) + lambda L_GT(S) + (1 - lambda) CE(p_T, p_S)
The stopgrad on p_T is intentional. It prevents the deepest loop from being
pulled around by the student during self-distillation, keeping the maximum-loop
path as the local teacher. The current stabilizer stack keeps teacher-only
temperature and masked soft CE, then adds entropy/loop-trajectory regularizers
so additional loops refine rather than collapse:
- teacher-only temperature smooths the teacher target without hiding the student's actual sharpness.
- entropy floor penalizes low-entropy collapse when the model becomes confidently wrong.
- Delta^2 entropy curvature penalizes abrupt entropy bends along the loop
axis
L, matching the ELT idea of incremental refinement. - sampled Delta^2 logit curvature can be enabled after entropy metrics are stable, using sampled/top-k vocab slices instead of full-vocab curvature.
The important design choice is that safety and capability alignment are handled mostly by data selection, lane verifiers, KL-constrained GRPO, and evaluation rather than by blanket refusal behavior baked into the base model.
The key experimental question is not merely whether L=4 scores higher than
L=1, but whether deeper loops correct shallow mistakes without overthinking
correct answers. elt-anytime now emits benchmark refinement telemetry:
loop_gain(L=k) = score(L=k) - score(L=1)
marginal_gain(L=k) = score(L=k) - score(L=k-1)
self_correction_rate = count(L=1 wrong and L=k correct) / N
overthinking_rate = count(L=1 correct and L=k wrong) / N
For each benchmark, track these alongside per-loop accuracy, entropy trajectory,
latency/token, tokens/sec, and VRAM. A healthy ELT run should increase
self-correction faster than overthinking as L grows.
uv run elt-train --config configs/base_1B.yamlWith optim.kind: paged_adamw_8bit:
| measure | value |
|---|---|
| model params | 1.537 B total, 1.092 B non-emb |
| peak VRAM | 7.88 GB |
| one-step smoke | ~5.0 s (incl. cuDNN warm-up) |
Alternative — NVMe-backed fp32 state (optim.kind: nvme_adamw):
# configs/your_run.yaml
optim:
kind: nvme_adamw
offload:
enabled: true
root: H:/elt_data/offload_nvme # where to mmap fp32 state shards
min_free_gb: 20.0 # refuse to start if lessMeasured on smoke_300M.yaml × NvmeAdamW, RTX 3060:
| measure | value |
|---|---|
| params | 0.46 B total, 0.21 B non-emb |
| peak VRAM | 4.38 GB |
| step (fwd + bwd + NvmeAdamW.step) | 128.7 s |
VRAM drops further, but NVMe bandwidth becomes the bottleneck — use
nvme_adamw only when VRAM is the hard constraint.
Three phases, resumable, driven by one orchestrator:
| stage | config | what it does |
|---|---|---|
| Phase 1 — Pretrain | configs/base_100M.yaml / configs/base_1B.yaml |
ILSD with warmup-then-anneal λ, bf16 + grad-ckpt + grad-accum |
| Phase 2 — SFT | configs/sft_cot.yaml |
CoT instruction + offline distillation |
| Phase 3 — GRPO | configs/grpo_gsm8k.yaml |
clipped surrogate + unbiased KL, correct × format verifier |
# End-to-end 11-stage pipeline (respects .done markers)
uv run python scripts/pipeline.py
# Register as Windows startup task — auto-resumes on every boot, removes
# itself from Task Scheduler once the final stage is done.
powershell -ExecutionPolicy Bypass -File scripts/pipeline_register.ps1The current repository includes a redistributable snapshot of the active
synthetic-v2-hard training/evaluation data under training_data/synthetic_v2_hard/.
That snapshot contains verifier-backed SFT traces, intentionally wrong contrast
traces, and held-out GRPO/bridge prompts for code, math, STEM reasoning, and
tool-use lanes.
Source and citation metadata is tracked in:
training_data/DATA_SOURCES.mdtraining_data/source_citations.yamlscripts/download_hf_corpus.pyscripts/corpus_manifest.yaml
The large tokenized *.bin files under H:/elt_data/* are generated artifacts
and are not committed. For model releases, cite the exact public datasets listed
in training_data/source_citations.yaml, plus this repository commit for the
synthetic-v2-hard generated data. The loop/self-distillation method follows
ELT / ILSD (arXiv:2604.09168); GRPO follows
DeepSeekMath (arXiv:2402.03300).
uv sync --extra dashboard
uv run streamlit run dashboard/app.py
# → http://localhost:8501Panels:
- Pipeline —
.donemarkers + tail ofpipeline.jsonl - Training — loss / lr / grad-norm / tok-per-sec, λ curve, L_int histogram
- Storage tiers — NVMe MB/s, prefetch hit rate, per-layer compute tier
- Hardware — VRAM (NVML), CPU/RAM (psutil), C:/H: free
- Inference Pareto — L vs. quality / latency / tok-per-sec (from
inference_sweep) - Checkpoints — rolling slot, age, disk usage
uv run python scripts/export_to_hf.py \
--ckpt runs/grpo_gsm8k/last.pt \
--out hf_export/elt-lm-base-275m \
--tokenizer H:/Qwen3.5-9B-official-hf \
--repo-id zapabob/elt-lm-base-275m \
--push-to-hubBundles configuration_elt.py, modeling_elt.py, config.json,
model.safetensors, tokenizer files, and a rendered README.md. Downstream
users only need pip install transformers.
For the Qwen3.5 side-LoRA bridge runs, export the adapter payload separately:
uv run elt-export-lora-adapter \
--ckpt H:/elt_data/runs/grpo_side_lora_stem_synthetic_v2_bridge/last.pt \
--out-dir H:/elt_data/adapters/qwen35_4b_side/synthetic_stem_v2_bridge_grpo_candidateThis now writes both local-runtime adapter.pt and portable
adapter_model.safetensors plus adapter_config.json and a minimal model card.
The 2026-05-03 stem bridge candidate has been exported at
H:/elt_data/adapters/qwen35_4b_side/synthetic_stem_v2_bridge_grpo_candidate
(adapter_model.safetensors, 64,987,976 bytes).
GGUF release readiness follows the current llama.cpp path: first produce a
Transformers/HF directory with config.json, tokenizer files, and safetensors,
then run convert_hf_to_gguf.py, optionally quantize to Q8_0, and use
Turboquant-CUDA for the TQ4_1S artifact. The repo helper records the exact
commands and blockers:
uv run python -m elt_lm.export_merged_qwen35_hf \
--ckpt H:/elt_data/runs/grpo_side_lora_stem_synthetic_v2_bridge/last.pt \
--out-dir H:/elt_data/hf_exports/elt-lm-qwen35-side-stem-v2-bridge-merged \
--tokenizer H:/Qwen3.5-9B-official-hf \
--repo-id zapabob/elt-lm-qwen35-side-stem-v2-bridgeuv run python -m elt_lm.release_readiness \
--hf-dir H:/elt_data/hf_exports/elt-lm-qwen35-side-stem-v2-bridge-merged \
--gguf-path H:/elt_data/releases/elt-lm-qwen35-side-stem-v2-bridge.gguf \
--repo-id zapabob/elt-lm-qwen35-side-stem-v2-bridge \
--llama-cpp-dir C:/Users/downl/Desktop/llama.cpp-zapabob \
--turboquant-gguf-path H:/elt_data/releases/elt-lm-qwen35-side-stem-v2-bridge-TQ4_1S.gguf \
--turboquant-source-gguf-path H:/elt_data/releases/elt-lm-qwen35-side-stem-v2-bridge-Q8_0.gguf \
--turboquant-cuda-dir C:/Users/downl/Desktop/Turboquant-CUDA \
--out _docs/assets/2026-05-03-deepresearch-elt-llm-implementation/release_readiness_stem_bridge_merged.jsonCurrent status as of 2026-05-03: merged HF safetensors, llama.cpp BF16 GGUF,
llama.cpp Q8_0 GGUF, and Turboquant TQ4_1S GGUF are ready for handoff. The
side-LoRA bridge remains the L_min=L_max=1 path; native looped ELT runtime
support is tracked separately from this release artifact.
TQ4_1S currently serves as a compact GGUF weight-compression artifact. We do
not claim Google TurboQuant KV-cache serving performance from this result.
Instead, ELT quantization work is split into separate lanes so each claim has
the right evidence:
| lane | current scope | next proof point |
|---|---|---|
| Weight GGUF compression | BF16, Q8_0, Q6_K, Q5_K_M, Q4_K_M, IQ4_NL, IQ4_XS, TQ4_1S | file size, PPL/KL, top-k stability, ELT exact/step accuracy |
| Calibration and tensor policy | ELT corpus imatrix plus protected output/token embedding/attn_v/ffn_down tensors | same-bit-budget recovery versus all-Q4 baselines |
| KV cache compression | llama.cpp f16, q8_0, q4_0, q5_0, iq4_nl K/V sweeps |
ctx length, KV MiB, VRAM peak, tok/s, K/V asymmetry |
| TurboQuant-style KV | TheTom turbo2/turbo3/turbo4 runtime cache types, separate from local TQ4_1S weight artifacts |
K-protected q8_0/bf16 sweeps, then RTX 3060 CUDA serving measurements when the GPU is free |
| DFlash speculative decoding | separate draft/verify serving lane | target equivalence, acceptance length/rate, tok/s, loop-depth stability |
The first publishable claim is therefore not "TQ4_1S is smaller"; it is: ELT
separates weight format, calibration data, tensor protection, KV-cache type,
and speculative decoding, then measures where recurrent loop quality breaks and
where compression remains recoverable. As of 2026-05-03, llama.cpp documents
stock K/V cache types (f32, f16, bf16, q8_0, q4_0, q4_1,
iq4_nl, q5_0, q5_1), while the TurboQuant KV PR is still an open
CPU-only TBQ3/TBQ4 path and DFlash is a draft speculative-decoding PR. That
keeps the current README claim honest: compact GGUF weight handoff now,
TurboQuant KV and DFlash serving evidence later.
For L_max > 1 exports, elt_lm.release_readiness now reads elt_config and
keeps the release blocked until the caller declares both a loop-aware llama.cpp
runtime (--loop-runtime-supported) and a Turboquant converter that preserves
elt.* loop metadata (--turboquant-loop-metadata-supported). Those looped
artifacts use the Turboquant model family ELT/Qwen3.5-looped.
The current L_max=3 handoff now has corrected BF16/Q8_0/TQ4_1S GGUF
artifacts plus a TheTom runtime KV smoke sweep. The sweep protects K by keeping
--cache-type-k at only q8_0 or bf16; only V is swept through TheTom
turbo2, turbo3, and turbo4. This is deliberately separate from
TQ4_1S, which remains an offline GGUF weight-compression artifact.
| artifact | path | bytes | GiB | metadata check |
|---|---|---|---|---|
| BF16 GGUF | H:/elt_data/releases/elt-lm-qwen35-side-stem-aha-ilsd-l3.gguf |
9,695,800,320 | 9.03 | qwen35.block_count=32, elt.loop.L_max=3, no MTP nextn layer |
| Q8_0 GGUF | H:/elt_data/releases/elt-lm-qwen35-side-stem-aha-ilsd-l3-Q8_0.gguf |
5,157,841,920 | 4.80 | general.file_type=7, elt.gguf.runtime_status=requires_looped_qwen35_runtime |
| TQ4_1S GGUF | H:/elt_data/releases/elt-lm-qwen35-side-stem-aha-ilsd-l3-TQ4_1S.gguf |
4,467,422,272 | 4.16 | hypura.turboquant.weight.codec=tq4_1s, 427 tensors, offset range valid |
Runtime evidence uses the installed TheTom-capable llama.cpp binary:
llama-cli.exe --version -> 9451 (68124bdbe), whose help lists
turbo2, turbo3, and turbo4 as legal --cache-type-k /
--cache-type-v values. A 2026-05-20 parser probe also requested turbo8;
the installed runtime rejected it as Unsupported cache type: turbo8, so
Turbo8 is reported as an unsupported runtime value rather than as a throughput
or quality result. Because the RTX 3060 was busy during this run, the sweep used
-ngl 0, -c 128, -n 8, and is a CPU/offload runtime smoke, not a CUDA
throughput claim. The verbose logs do prove the cache policy, for example
K (q8_0) with V (turbo2) and no turbo* K path.
| cache policy | ok / total | decode tok/s mean +/- SEM | KV MiB | delta vs K=q8_0/V=q8_0 |
p |
|---|---|---|---|---|---|
K=f16_V=f16 |
2 / 2 | 1.21 +/- 0.27 | 8.00 | -0.75 | 0.6 |
K=q8_0_V=q8_0 |
2 / 2 | 1.96 +/- 0.04 | 4.25 | baseline | baseline |
K=q8_0_V=turbo2 |
2 / 2 | 1.84 +/- 0.07 | 2.86 | -0.12 | 0.6 |
K=bf16_V=turbo2 |
2 / 2 | 2.12 +/- 0.95 | 4.73 | +0.16 | 1.0 |
K=q8_0_V=turbo3 |
2 / 2 | 3.18 +/- 0.36 | 2.91 | +1.22 | 0.6 |
K=bf16_V=turbo3 |
2 / 2 | 2.54 +/- 0.64 | 4.78 | +0.58 | 1.0 |
K=q8_0_V=turbo4 |
2 / 2 | 2.59 +/- 0.23 | 3.19 | +0.63 | 0.6 |
K=bf16_V=turbo4 |
1 / 2 | 4.21 +/- 0.00 | 5.06 | +2.29 | n/a |
K=q8_0_V=turbo8 |
0 / 2 | unsupported | n/a | n/a | n/a |
K=bf16_V=turbo8 |
0 / 2 | unsupported | n/a | n/a | n/a |
The paired p-values come from two repeated blocks where available, so they are
descriptive smoke evidence rather than a strong significance claim. The one
bf16/turbo4 failure kept a partial log with KV allocation and one later
successful generation; treat that policy as lower-confidence until rerun on an
idle GPU. The strongest current runtime result is the policy boundary:
K-protected TheTom V-cache execution is live for L_max=3 GGUF, while looped
ELT quality still requires a loop-aware Qwen3.5 runtime because the GGUF marks
elt.gguf.runtime_status=requires_looped_qwen35_runtime.
Artifacts:
_docs/assets/2026-05-17-l3-thetom-k-protected/thetom_k_protected_kv_raw.csv_docs/assets/2026-05-17-l3-thetom-k-protected/thetom_k_protected_kv_summary.csv_docs/assets/2026-05-17-l3-thetom-k-protected/thetom_k_protected_kv_pairwise.csv_docs/assets/2026-05-17-l3-thetom-k-protected/thetom_k_protected_kv_report.md_docs/assets/2026-05-17-l3-thetom-k-protected/gptimage_l3_thetom_k_protected_summary.png
The latest publication-facing bundle joins the K-protected KV sweep with the
Turboquant-CUDA Triality SO(8) rotation audit, ILSD self-distillation telemetry,
and paired loop-aware CV statistics. It is meant for AI-engineering review: the
plot highlights error bars and p-values, while unsupported turbo8 remains
visibly separate from the working turbo3/turbo4 paths.
Triality SO(8) vector-view audit, copied from the Turboquant-CUDA production
manifest, passed all 4608 audited rows with 0 outliers at the 0.01
orthogonality/determinant gate:
| V bit target | max orth err | mean det | max det err | status |
|---|---|---|---|---|
| 3 | 4.870e-03 | 1.000145130 | 7.628e-03 | pass |
| 4 | 4.754e-03 | 1.000145894 | 6.427e-03 | pass |
| 8 | 6.269e-03 | 1.000028411 | 7.111e-03 | pass |
ILSD monitoring found no non-finite loss, distance, or entropy values in the
current L2/L3 side-LoRA runs. The L3 runs are deliberately flagged as a stability
watch surface because max L_dist rises to 9.128 (code), 8.832 (math),
9.233 (stem), and 11.343 (tool), while final L_entropy stays small
(0.000 to 0.008). This supports "monitoring is in place"; it is not a broad
convergence guarantee.
Loop-aware paired CV over 32 local STEM bridge cases reports mean +/- SEM:
| group | n | accuracy | SEM | 95% CI |
|---|---|---|---|---|
| L1 | 32 | 0.4375 | 0.0891 | [0.2629, 0.6121] |
| L2 | 32 | 0.5625 | 0.0891 | [0.3879, 0.7371] |
| L3 | 32 | 0.6562 | 0.0853 | [0.4891, 0.8234] |
Paired permutation p-values are 0.122488 for L1-L2, 0.016098 for L1-L3,
and 0.254775 for L2-L3; the within-block Friedman permutation p-value is
0.002500.
A logged lm-eval-harness serving-surface gate now covers 128 external
heldout cases: 64 native MMLU-STEM MCQ rows plus 64 GSM8K test rows
converted into numeric multiple-choice questions. It runs the Q8_0 GGUF with
llama-server --ngl 999, using the harness evaluator plus a local
/completion log-prob adapter because this installed server exposes the newer
llama.cpp logprob schema. This is a larger CV gate, not a broad leaderboard
result or standard GSM8K exact-match claim.
| K/V policy | folds | accuracy mean +/- SEM | 95% CI | status |
|---|---|---|---|---|
| K=q8_0, V=turbo3 | 8 | 0.5547 +/- 0.0630 | [0.4312, 0.6781] | ok |
| K=bf16, V=turbo3 | 8 | 0.5625 +/- 0.0765 | [0.4125, 0.7125] | ok |
| K=q8_0, V=turbo4 | 8 | 0.5469 +/- 0.0763 | [0.3973, 0.6965] | ok |
| K=bf16, V=turbo4 | 8 | 0.5469 +/- 0.0763 | [0.3973, 0.6965] | ok |
| K=q8_0, V=turbo8 | 0 | n/a | n/a | unsupported cache type |
| K=bf16, V=turbo8 | 0 | n/a | n/a | unsupported cache type |
Paired within-fold permutation p-values over the measured policies are:
| comparison | mean delta | p |
|---|---|---|
| K=q8_0,V=turbo3 - K=bf16,V=turbo3 | -0.0078 | 1.000000 |
| K=q8_0,V=turbo3 - K=q8_0,V=turbo4 | 0.0078 | 1.000000 |
| K=q8_0,V=turbo3 - K=bf16,V=turbo4 | 0.0078 | 1.000000 |
| K=bf16,V=turbo3 - K=q8_0,V=turbo4 | 0.0156 | 0.750973 |
| K=bf16,V=turbo3 - K=bf16,V=turbo4 | 0.0156 | 0.750973 |
| K=q8_0,V=turbo4 - K=bf16,V=turbo4 | 0.0000 | 1.000000 |
The four-policy Friedman within-fold permutation p-value is 0.781022.
Benchmark slices are stable in the same direction: MMLU-STEM is 0.7031
accuracy for all four measured policies, while GSM8K numeric-MCQ ranges from
0.3906 to 0.4219. turbo8 is only a parser/runtime-support probe in this
installed llama.cpp build (Unsupported cache type: turbo8).
Artifacts:
_docs/assets/2026-05-20-kv-triality-goal/kv_triality_goal_report.md_docs/assets/2026-05-20-kv-triality-goal/kv_triality_goal_report.json_docs/assets/2026-05-20-kv-triality-goal/gptimage2_kv_triality_goal_dashboard.png_docs/assets/2026-05-20-kv-triality-goal/loop_aware_l123_cv_stats.json_docs/assets/2026-05-20-kv-triality-goal/triality_so8_rotation_audit_summary.csv_docs/assets/2026-05-20-lm-eval-gguf-kv-cv/lm_eval_gguf_kv_cv_report.md_docs/assets/2026-05-20-lm-eval-gguf-kv-cv/lm_eval_gguf_kv_cv_report.json_docs/assets/2026-05-20-lm-eval-gguf-kv-cv/gptimage2_lm_eval_gguf_kv_cv.png_docs/assets/2026-05-20-lm-eval-gguf-kv-cv-128/lm_eval_gguf_kv_cv_report.md_docs/assets/2026-05-20-lm-eval-gguf-kv-cv-128/lm_eval_gguf_kv_cv_report.json_docs/assets/2026-05-20-lm-eval-gguf-kv-cv-128/gptimage2_lm_eval_gguf_kv_cv.png_docs/assets/2026-05-21-goal-completion-audit/goal_completion_audit.md
The BF16 release artifact
elt-lm-qwen35-side-stem-aha-ilsd-l3-BF16.gguf has a direct headless
lm-eval-harness K/V cache CV run over the same 128 external heldout cases
used above. The run uses llama-server --ngl 999 and compares four measured
policies plus turbo8 parser probes.
| BF16 GGUF policy | folds | accuracy mean +/- SEM | 95% CI | status |
|---|---|---|---|---|
K=bf16,V=turbo3 |
8 | 0.5547 +/- 0.0662 | [0.4249, 0.6845] | selected BF16 policy |
K=q8_0,V=turbo3 |
8 | 0.5547 +/- 0.0662 | [0.4249, 0.6845] | tied |
K=bf16,V=turbo4 |
8 | 0.5469 +/- 0.0754 | [0.3991, 0.6947] | ok |
K=q8_0,V=turbo4 |
8 | 0.5391 +/- 0.0708 | [0.4003, 0.6778] | ok after single-policy retry |
K=bf16,V=turbo8 |
0 | n/a | n/a | unsupported cache type |
K=q8_0,V=turbo8 |
0 | n/a | n/a | unsupported cache type |
The selected BF16 serving policy is K=bf16,V=turbo3: it ties the best overall
mean and preserves the BF16 K-cache path. Pairwise p-values among measured
groups are all non-significant (p >= 0.750973), with Friedman p 0.943806.
MMLU-STEM ranges from 0.6875 to 0.7188; GSM8K numeric-MCQ ranges from
0.3906 to 0.4062. As above, GSM8K is a numeric multiple-choice transform,
not standard exact-match generation, and stock GGUF serving is not native
loop-aware L>=2 quality.
Artifacts:
_docs/assets/2026-05-21-best-bf16-gguf-lm-eval-cv/best_bf16_selection.md_docs/assets/2026-05-21-best-bf16-gguf-lm-eval-cv/lm_eval_gguf_kv_cv_report.md_docs/assets/2026-05-21-best-bf16-gguf-lm-eval-cv/lm_eval_gguf_kv_cv_report.json_docs/assets/2026-05-21-best-bf16-gguf-lm-eval-cv/gptimage2_lm_eval_gguf_kv_cv.png
The current L_max=3 artifact also has the first README-worthy LLM quality
gate with at least 128 cases, an external heldout slice, GPU-offload proof, and
loop-aware quality measurement. The GGUF path below is Q8_0 on the installed
TheTom-capable llama.cpp runtime with --ngl 999, K=q8_0, V=turbo3; the
loop-aware path is the HF/PyTorch Qwen3.5 runtime because stock GGUF execution
still marks elt.gguf.runtime_status=requires_looped_qwen35_runtime.
| evaluation | n | correct | accuracy | Wilson 95% CI | SEM | prompt tok/s | decode tok/s |
|---|---|---|---|---|---|---|---|
| Local STEM bridge | 128 | 121 | 94.5% | [89.1, 97.3] | 2.01% | 1241.74 | 50.02 |
| MMLU-STEM heldout | 16 | 13 | 81.2% | [57.0, 93.4] | 9.76% | 1268.50 | 54.24 |
| GSM8K heldout | 16 | 0 | 0.0% | [0.0, 19.4] | 0.00% | 1013.86 | 48.64 |
Pairwise accuracy p-values use two-sided Fisher exact tests over
correct/incorrect counts:
| comparison | p |
|---|---|
| Local STEM bridge vs MMLU-STEM heldout | 0.0835 |
| Local STEM bridge vs GSM8K heldout | 3.56e-16 |
| MMLU-STEM heldout vs GSM8K heldout | 3.22e-06 |
Loop-aware quality uses paired case IDs on 32 local STEM bridge questions and scores multiple-choice log-probability, not free-form generation:
| L | n | correct | accuracy | Wilson 95% CI | SEM | mean margin | wall sec/case |
|---|---|---|---|---|---|---|---|
| 1 | 32 | 14 | 43.8% | [28.2, 60.7] | 8.77% | 0.0234 | 0.661 |
| 2 | 32 | 18 | 56.2% | [39.3, 71.8] | 8.77% | 0.1445 | 1.225 |
| 3 | 32 | 21 | 65.6% | [48.3, 79.6] | 8.40% | 0.3154 | 1.780 |
Paired McNemar exact p-values over discordant cases:
| comparison | improved | regressed | discordant | p |
|---|---|---|---|---|
| L1 vs L2 | 4 | 0 | 4 | 0.125 |
| L1 vs L3 | 7 | 0 | 7 | 0.0156 |
| L2 vs L3 | 3 | 0 | 3 | 0.25 |
The supportable public claim is therefore narrow: the L=3 handoff is strong on
the local STEM bridge task, MMLU-STEM is promising but a small cached slice, and
loop depth helps in the loop-aware runtime. GSM8K is explicitly not solved
(0/16), so this is not yet a broad mathematical reasoning or general LLM
leaderboard claim.
Artifacts:
_docs/assets/2026-05-17-l3-thetom-k-protected/l3_readme_stats.json_docs/assets/2026-05-17-l3-thetom-k-protected/l3_readme_stats.md_docs/assets/2026-05-17-l3-thetom-k-protected/l3_readme_accuracy_errorbars.png
The 2026-05-03 GGUF release check compares the three handoff artifacts under the same local llama.cpp CUDA runtime on the RTX 3060:
| format | size | prompt eval tok/s | decode tok/s | perplexity | logits KL vs BF16 | KV / recurrent state |
|---|---|---|---|---|---|---|
| BF16 | 9.03 GiB | 217.15 ± 13.59 | 27.83 ± 0.07 | 11313.67 | baseline | 1024 / 6432 MiB |
| Q8_0 | 4.80 GiB | 248.77 ± 16.75 | 40.40 ± 0.15 | 13677.23 | 1.6404 | 1024 / 6432 MiB |
| TQ4_1S | 4.16 GiB | 0.79 ± 0.02 | 0.69 ± 0.03 | 21648.79 | 1.8819 | 1024 / 6432 MiB |
Runtime statistics use llama-bench with paired f16/f16 KV cache blocks,
n=3 repetitions, mean ± SEM, and SciPy repeated-measures tests. The omnibus
Friedman p-value is 0.049787 for both prompt-eval and decode throughput; the
pairwise Wilcoxon p-values are 0.25 because the sample is intentionally short.
Perplexity and logits KL are one-chunk release checks over verifier-backed
synthetic-v2 hard held-out text, not broad lm-eval leaderboard claims. A
q8_0/q8_0 KV-cache bench was attempted first, but the BF16 GGUF failed context
creation with that cache setting on this local runtime, so the paired comparison
uses the common f16/f16 cache path. This section evaluates local GGUF weight
artifacts only; it is not a TurboQuant KV-cache or DFlash serving result.
Artifacts:
_docs/assets/2026-05-03-gguf-quant-cv-gptimage/gguf_quant_cv_report.json_docs/assets/2026-05-03-gguf-quant-cv-gptimage/gguf_quant_cv_summary.csv_docs/assets/2026-05-03-gguf-quant-cv-gptimage/gguf_quant_cv_pairwise.csv_docs/assets/2026-05-03-gguf-quant-cv-gptimage/gguf_quant_cv_omnibus.csv
elt-anytime already emits case-level correctness and K-fold accuracy summaries
for local verifier-backed benchmark manifests. For vanilla-vs-finished model
comparison, preserve paired case/fold order and run:
uv run python -m elt_lm.eval.benchmark_comparison \
--input reports/vanilla_vs_complete_groups.json \
--out-json reports/vanilla_vs_complete_stats.json \
--out-md reports/vanilla_vs_complete_stats.mdInput schema:
{
"benchmark": "mmlu_stem_cv",
"groups": {
"vanilla": [0, 1, 0, 1],
"sft_replay": [0, 1, 1, 1],
"complete": [1, 1, 1, 1]
}
}The report includes mean, SD, SEM, 95% CI, paired permutation p-values for every
group pair, and a Friedman within-block permutation p-value when at least three
groups are supplied. Use lm-eval-harness for broad external tasks with logged
samples, e.g. lm-eval run --model hf --model_args pretrained=<hf_export_dir> --tasks gsm8k,mmlu_stem,hellaswag --output_path <dir> --log_samples, then
convert the paired sample correctness arrays into the JSON schema above.
Current measured bridge diagnostics are limited to internal synthetic-v2 bridge verifiers, not broad lm-eval claims: stem is the only export/eval candidate (mean correct 0.8958, final correct 1.0), code and math are sparse-success lanes, and tool-use is blocked because reward/advantage signal remained zero. Full vanilla-vs-complete lm-eval p-values should not be reported until both groups have completed the same paired task set.
rolling_{0..keep-1}.ptround-robin everyrolling_ckpt_interval_sec(5 min default)last.pthardlinked to the latest save — resume anchorstep_*.ptmilestone saves everysave_every- CPU + CUDA RNG state in each save → deterministic resume
Crash loses at most one interval; --resume runs/<dir>/last.pt picks up.
src/elt_lm/ model, layers, losses, train loops, HF wrapper
src/elt_lm/offload/ 4-tier store, NvmeAdamW, prefetcher, placement planner
src/elt_lm/hf/ trust_remote_code bundle (ELTConfig, ELTForCausalLM)
src/elt_lm/eval/ any-time L-sweep, verifiers, python-exec guard
src/elt_lm/telemetry.py thread-safe JSONL writer
dashboard/ Streamlit app + panels + metrics reader
configs/ tiny_10M / base_100M / smoke_300M / base_1B / sft_cot / grpo_gsm8k
scripts/ data DL / clean / tokenize / pipeline / HF export / 1B VRAM smoke
tests/ 105 tests; `uv run pytest -q`
_docs/ implementation log (YYYY-MM-DD-<slug>-<AI>.md)
uv sync # core
uv sync --extra offload_8bit # + bitsandbytes for paged_adamw_8bit
uv sync --extra dashboard # + streamlit / plotly / pynvml / psutil
uv sync --extra dev # + pytest, for running the suite
uv run pytest -q # 105 passing- ELT + ILSD scaffold, paper equations faithful
- GRPO post-training with verifier (DeepSeekMath §4.1)
- Rolling checkpoints + deterministic resume
- HuggingFace Hub export (
trust_remote_code=True) - End-to-end pipeline with boot-time auto-resume
- Offline distillation from Qwen3.5-4B
-
base_1B.yamlfits 12 GB via PagedAdamW8bit (measured peak 7.88 GB) - Hypura-style 4-tier NVMe offload (
NvmeAdamW) + placement planner - Streamlit dashboard with 6 live panels + JSONL telemetry
- 1 B Phase-1 pretrain run (in progress)
- GSM8K / HumanEval / MMLU-STEM / MATH-500 L-sweep results
-
elt-lm-base-1.5bpushed to HuggingFace Hub
@article{goyal2026elt,
title = {Elastic Looped Transformers for Visual Generation},
author = {Goyal et al.},
journal = {arXiv:2604.09168},
year = {2026}
}
@article{shao2024deepseekmath,
title = {DeepSeekMath: Pushing the Limits of Mathematical Reasoning
in Open Language Models},
author = {Shao et al.},
journal = {arXiv:2402.03300},
year = {2024}
}
Apache 2.0 (model weights + code). Tokenizer inherits from Qwen3.5 — see the upstream repo for its terms.
If you find this useful, a star helps others discover it.





