[EXP-001] Cultural-CPT validation harness + Egypt/Arabic pilot (sovereign-alignment)#65
Open
jneums wants to merge 54 commits into
Open
[EXP-001] Cultural-CPT validation harness + Egypt/Arabic pilot (sovereign-alignment)#65jneums wants to merge 54 commits into
jneums wants to merge 54 commits into
Conversation
Defines a falsifiable, pre-registered experiment to test the foundational hypothesis behind TAP-003/TAP-005: that continued pretraining on culturally grounded data measurably shifts a model's cultural alignment (vs. mere language exposure or surface mimicry). Addresses open questions C1/C8/T3/S3. Signed-off-by: Jesse <jesseneumann@gmail.com>
Runnable harness for the EXP-001 spec: tests whether continued pretraining on culturally grounded data shifts a model's Inglehart-Welzel coordinate beyond mere language exposure. - smoke mode runs the full pipeline (arms -> CPT -> WVS survey -> coordinate scoring -> Base/Language-matched/Grounded comparison) on a byte-level toy model, CI-scale, no downloads/GPU; numbers are noise by design. - hf mode is a documented seam (LanguageModel protocol, two primitives) for a real base model; raises an actionable NotImplementedError until wired. - corpora/WVS items/capability MCQs are clearly-labeled placeholders with the real-data seams marked. Adds make cultural-cpt-validation / cultural-cpt-tests; ignores runs/. 5 smoke tests pass. Signed-off-by: Jesse <jesseneumann@gmail.com>
Adds the Tapestry-unique / non-IID (T3) test: across consortium rounds, each culture does grounded CPT, is measured on the Inglehart-Welzel map, then all sovereign forks are FedAvg-averaged into the next global base. Reports the separability curve (mean pairwise distance between cultures over rounds) to expose whether distinct cultures survive aggregation or collapse toward the centroid. - per-culture grounded placeholder corpora; state()/load_state() on the toy backend for averaging; uniform FedAvg mirroring ConsortiumCoordinator. - new run_aggregation.py + make cultural-cpt-aggregation; 3 added smoke tests (8 total passing). - smoke mode only; real-mode (HF backend per node) left as a marked seam. Signed-off-by: Jesse <jesseneumann@gmail.com>
Replaces argmax option selection with an expectation over the softmax of option log-probs, making the Inglehart-Welzel coordinate continuous so small preference shifts move it instead of being quantized to the nearest option value. Broadens the battery to 4 items per axis for finer resolution. Side benefits: expected-value scoring is option-order invariant by construction (one fewer prompt-sensitivity source), and a no-preference model now lands near the origin instead of snapping to an option. Adds a temperature knob. All 8 smoke tests still pass. Signed-off-by: Jesse <jesseneumann@gmail.com>
Implements HFCausalModel against transformers (lazy-imported optional dep, not added to core manifest): teacher-forced mean log-prob score_continuation and a short CPT train_on_texts loop, same two primitives as the toy backend, so the experiment code is unchanged. Verified end-to-end on distilgpt2. Decouples the two orthogonal axes of realism: model backend (--mode smoke|hf) and corpus realism (--corpus-path, empty = placeholder). A run is only an EXP-001 result when both are real; otherwise it carries a 'NOT A RESULT' caveat naming which part is still placeholder. This lets HF wiring be validated on placeholder text without being mistaken for a finding. 8 smoke tests pass; real-mode distilgpt2 run confirmed. Signed-off-by: Jesse <jesseneumann@gmail.com>
Adds a second instrument: concrete scenarios with action choices, scored on the same two Inglehart-Welzel axes via a shared expected-value scorer factored out of wvs.py. Per arm the experiment now reports the behavioral coordinate, its shift toward target, and the survey-behavior gap/lag. This is clause (c) of the hypothesis: if an arm's survey answers move toward the target culture but its behavior does not, the shift is surface mimicry, not a representational change. Verified on distilgpt2, where the mechanism correctly surfaced a survey>>behavior dissociation on placeholder data. 9 smoke tests pass. Signed-off-by: Jesse <jesseneumann@gmail.com>
Completes the EXP-001 arm set. Arms are now spec-driven along two dimensions: which corpus an arm continues-pretrains on (if any) and whether a cultural persona prompt is applied at measurement instead. - grounded_translated: CPT on the grounded content in the base's language, isolating cultural content from the language carrying it. - surface_only: no weight change at all; prompts the base to answer as a target-culture respondent. Tests whether expensive CPT beats cheap prompting (a tie undercuts the depth-over-shallow bet). Adds two decisive comparisons (grounded vs translated, grounded vs surface) and threads a persona_prefix through the shared scorer and both instruments. Verified on distilgpt2. 10 smoke tests pass. Signed-off-by: Jesse <jesseneumann@gmail.com>
Runs the arms across seeds and turns per-arm shifts into mean +- std, the decisive comparisons into effect sizes (z = mean/std), and applies the EXP-001 pre-registered threshold: PASS iff grounded shift >= min, the grounded-vs- language effect clears sigma_multiple std devs (positive sign required), and capability drop <= max. Thresholds live in StatsConfig so they are set up front. The sign-and-z rule is load-bearing: on the toy model grounded_vs_language shows a large |z| with a negative mean (spurious wrong-direction effect), and the decision correctly FAILs it. Adds run_stats.py + make cultural-cpt-stats; 11 smoke tests pass. Signed-off-by: Jesse <jesseneumann@gmail.com>
Turn the corpus seam from a NotImplementedError into a real-data path that enforces the controls the experiment's validity depends on, and ship a real demonstration seed so the path is exercised end to end. - cultural_cpt/dataset.py: JSONL corpus loader that enforces permissive licensing, language/register/recency match, the grounded<->language-matched twin token-budget control, and WVS decontamination (phrases pulled live from the instruments; Unicode-word matching, so non-Latin scripts work). Fails loudly rather than letting a broken corpus produce a publishable-looking number. - corpora.py: wire load_corpus / load_culture_corpus real --corpus-path branches to dataset.py (drop the dead `mode` arg on load_culture_corpus). - experiment.py: run only the CPT arms a real corpus declares (the spec's recommended first run is the minimal grounded + language-matched twin); skipped arms are noted and their decisive comparisons no longer masquerade as ties. - aggregation.py / run_aggregation.py: add corpus_path so real per-culture data can drive the FedAvg loop. - fetch_corpus.py: assemble a corpus from permissive sources (Wikipedia intros, same-source/different-domain twin), drop contaminated docs, balance the twin token budgets, write a manifest, and re-validate. Also a --validate mode. - data/seed-example/: a real, attributed English demonstration seed (wiring, not the experimental culture) + data/README.md documenting the layout, schema, twin/decontamination/licensing protocol, and availability-driven culture pick. - tests/test_dataset.py: 15 tests covering every control + the real seed. - Makefile: cultural-cpt-fetch-seed / cultural-cpt-validate-corpus targets. - .gitignore: bulk corpora regenerate locally; only README + seed travel. Signed-off-by: Jesse <jesseneumann@gmail.com>
- titles/egypt.ar.json: curated Arabic Wikipedia titles for the EXP-001 Egypt pilot — value-laden grounded domains (law, religion, family, civic) and the value-neutral twin (weather, sports, technical, mathematics), same source and register so only cultural content differs. Verified end to end: fetches 32 real Arabic articles that pass every control (twin balanced 1833 vs 1974 tokens, decontamination clean) and runs the full arms pipeline for --culture egypt. The fetched corpus itself stays local (bulk, git-ignored); regenerate with: fetch_corpus.py --culture egypt --lang ar --titles-file titles/egypt.ar.json - fetch_corpus.py: _write_manifest now records domains from the actual titles used, not the English defaults (correct for any --titles-file). Signed-off-by: Jesse <jesseneumann@gmail.com>
Replace the abbreviated illustrative battery with the canonical 10-item Inglehart-Welzel instrument (5 items per axis), each with >=3 stem paraphrases (the spec's robustness mandate) and graded pole-spanning options: - Traditional<->Secular: God's importance, child obedience/faith vs independence, abortion, national pride, respect for authority. - Survival<->Self-expression: materialist/post-materialist priorities, subjective well-being, homosexuality justifiable, petition activity, interpersonal trust. GROUND_TRUTH is now read from the published WVS Wave-7 Inglehart-Welzel map and linearly rescaled to the item scale (divide by 2.5, clamp), via a _from_map seam so exact WVS-data-file factor scores can be dropped in without touching the instrument. Egypt now sits far traditional/survival (-0.72, -0.52) and Sweden far secular/self-expression (0.80, 0.96), as the map has them. The per-axis coordinate stays the mean expected option-value (documented simplification of Welzel's factor-weighted index). Both real corpora still pass decontamination against the expanded probe set; 26 tests green. Signed-off-by: Jesse <jesseneumann@gmail.com>
…corpus
- fetch_corpus.py: add --full to fetch whole articles (cap 2000 words) instead
of just lead sections, for a real-sized CPT corpus. Regenerating Egypt with
--per-domain 6 --full yields ~23 docs/arm, ~32k/30k tokens, twin-balanced and
decontaminated.
- run.py: expose --lr; default it by mode (0.01 for the smoke toy model, 2e-5
for a real transformer, which diverges at 0.01). Verified the full pipeline
runs end-to-end in --mode hf on CPU with distilgpt2 (no NOT-A-RESULT caveat
once both a real model and real corpus are supplied).
Bulk corpora stay git-ignored; regenerate with:
fetch_corpus.py --culture egypt --lang ar \
--titles-file titles/egypt.ar.json --per-domain 6 --full
Signed-off-by: Jesse <jesseneumann@gmail.com>
make_base_model / ExperimentConfig / run.py now carry a device ("cpu"|"cuda")
so the HF backend actually uses the GPU on a CUDA box. Default stays cpu; the
real run passes --device cuda. No effect on smoke mode.
Signed-off-by: Jesse <jesseneumann@gmail.com>
- model.py/run.py: add --dtype (float32|bfloat16); bf16 halves full-CPT memory so Qwen2.5-1.5B full fine-tunes on a single 32GB GPU with headroom. Use the transformers-5 `dtype=` arg (torch_dtype is deprecated). - deploy/run_on_instance.sh: on-box script that regenerates the Egypt corpus from the committed titles file and runs the real CPT (Qwen2.5-1.5B, cuda, bf16) end to end, parameterized by env. - deploy/README.md: self-rental recipe for the 2x5090 server (create/use/ destroy), incl. the 5090/Blackwell gotcha (needs CUDA 12.8+ / PyTorch >=2.7) and the full-CPT-not-LoRA rationale. Real launch is gated on the Vast.ai API key (not present in this env yet). Signed-off-by: Jesse <jesseneumann@gmail.com>
The HF backend fed whole documents as single sequences; a full Wikipedia article (~thousands of tokens) OOMs even a 32GB GPU on a 1.5B model during backward. Now: - chunk each document into max_length (default 1024) token windows -> several training sequences instead of one giant pass (standard CPT practice; keeps all the text); - enable gradient_checkpointing during training (use_cache off), restored after; - deploy/run_on_instance.sh sets PYTORCH_CUDA_ALLOC_CONF=expandable_segments to cut fragmentation. Validated by a real run on an RTX 5090 (vast.ai self-rental): Qwen2.5-1.5B full CPT on the real Arabic Egypt corpus completed end to end and produced a genuine EXP-001 data point (no NOT-A-RESULT caveat). Signed-off-by: Jesse <jesseneumann@gmail.com>
…corpus - run_stats.py: expose --lr (mode-aware default), --device, --dtype so the multi-seed pre-registered go/no-go runs on GPU in bf16 with a transformer learning rate (it inherits the rest via the base ExperimentConfig per seed). - titles/egypt.ar.json: expand to ~12 titles across 5 value-laden domains (law, religion, family, civic, ethics) and 5 value-neutral domains (weather, sports, technical, mathematics, biology) for a much larger grounded corpus. - deploy/run_on_instance.sh: run run_stats.py when SEEDS is set (multi-seed), else run.py; default epochs 6, per-domain 8. Cross-seed variation here is measurement noise (paraphrase/temperature/option order) since HF training is deterministic — a valid robustness band per the spec. Signed-off-by: Jesse <jesseneumann@gmail.com>
Record the first real EXP-001 data points (real model + real corpus, no NOT-A-RESULT caveat): a single-seed pilot and a 3-seed pre-registered go/no-go. Both FAIL at this scale: grounded CPT shows no significant shift toward Egypt and is beaten by a persona prompt (z=-8.57). But the run is badly underpowered (~50k tokens, 6 epochs, 1.5B) and the grounding-beyond-language effect is in the H1-predicted direction (+0.019, z=1.41), just inside the noise. Capability preserved (0.00 drop). Records setup, numbers, honest caveats, and the highest-impact next steps (more corpus tokens, bigger base, Arabic survey). Signed-off-by: Jesse <jesseneumann@gmail.com>
Make a real ~4B full-parameter CPT feasible on one 32GB GPU and feed it a much larger corpus: - model.py: base model stays on CPU; per-arm clones move to the compute device (only one model copy in VRAM at a time); tensor device derived from the model's params; bitsandbytes AdamW8bit when available (fallback to torch AdamW); empty_cache between arms. Still full-parameter CPT, not LoRA. - fetch_corpus.py: --max-words to override the full-article word cap (long Arabic articles carry far more tokens than the old 2000-word cap). - titles/egypt.ar.json: expand to 18 curated titles per domain (90/arm), deliberately topical (no category crawl) to keep the value-laden vs neutral contrast clean. - deploy/run_on_instance.sh: install bitsandbytes; pass MAX_WORDS. Next: launch Qwen3-4B-Instruct-2507 multi-seed on the scaled corpus. Signed-off-by: Jesse <jesseneumann@gmail.com>
Third real EXP-001 run: Qwen3-4B-Instruct-2507, 3 seeds, ~150k/189k-token Arabic corpus, full-param CPT fit on one 5090 via 8-bit Adam + base-on-CPU. Still FAILs the threshold (grounded shift +0.023 < 0.05; grounded-vs-language z=0.75 < 2), but the trend across scale is the finding: scaling 1.5B->4B and ~50k->150k tokens flipped the grounded survey shift from -0.029 (away) to +0.023 (toward Egypt), grew its lead over language-matched (+0.019->+0.028), and shrank prompting's dominance (z -8.57 -> -2.16). Every indicator moved in H1's predicted direction. Updated FINDINGS with the run, a trend table, and revised caveats (notably the behavioral probe regressed and needs upgrading). Signed-off-by: Jesse <jesseneumann@gmail.com>
Administering the survey in English while the corpus is Arabic was a content-vs- language confound that likely muted Runs 1-3. Add full Arabic (MSA) translations of both instruments and let the experiment measure in the corpus's own language: - wvs.py: _ITEMS_AR (10-item battery) + _BATTERY/_ANSWER_SUFFIX keyed by lang; administer(..., lang). Translations are item-for-item equivalent (same axes and option values), so coordinates stay comparable across languages. - behavior.py: _SCENARIOS_AR + lang-keyed battery/suffix; administer_behavior(..., lang). - experiment.py: ExperimentConfig.instrument_lang threads to both instruments; language-aware persona prefix for the surface_only arm (Arabic culture names). - run.py / run_stats.py: --instrument-lang; deploy/run_on_instance.sh passes INSTRUMENT_LANG. - tests: AR batteries match EN structure (axes/values), and an --instrument-lang ar smoke run is well-formed + deterministic. 28 pass. Signed-off-by: Jesse <jesseneumann@gmail.com>
Run 4 = Run 3 with --instrument-lang ar (survey + behavior probe in Arabic), isolating the content-vs-language confound. Same model/corpus/seeds. Measuring in the corpus's own language, all else equal: - decisive grounded-vs-language effect grew ~5x (+0.028 -> +0.140); neutral Arabic text pushes away from Egypt (-0.105) while grounded pushes toward (+0.035) -- the H1(b) contrast, much sharper. - prompting stops beating CPT: grounded-vs-surface went from z=-2.16 (EN) to z=-0.50 (AR). The earlier "prompting beats CPT" was partly an English- measurement artifact -- relevant to the depth-over-shallow bet. Still FAIL (z=1.54 < 2, shift +0.035 < 0.05; effect grew but so did variance), but the closest run yet and the first where the decisive comparison is the largest signal. FINDINGS updated with Run 4 + an EN-vs-AR table. Signed-off-by: Jesse <jesseneumann@gmail.com>
The behavioral probe was teacher-forced multiple-choice (same mechanism as the survey), so it never tested open-ended behavior — the deep-vs-surface guardrail the spec actually calls for. Add a real generate mode: - model.py: generate() primitive on the LanguageModel protocol + both backends (HF real generation device-aware; byte model greedy decode for smoke). - judge.py: Judge protocol + EmbeddingJudge — scores a free-form response against each action option by multilingual sentence-embedding cosine similarity (deterministic, self-contained, EN+AR). Tests inject a stub judge. - behavior.py: administer_behavior(mode="logprob"|"generate", judge). Generate mode has the model write an action, then the judge maps it onto the options; same expected-axis-value math, so coordinates stay comparable. - experiment/run.py/run_stats.py: behavior_mode config + --behavior-mode flag; judge (embedder) constructed once per run on CPU. - deploy: install sentence-transformers when BEHAVIOR_MODE=generate. - tests: generate path end-to-end on the toy model w/ stub judge; generate primitive. 30 pass. Signed-off-by: Jesse <jesseneumann@gmail.com>
Validated the generate path end-to-end locally with a real embedder: the multilingual EmbeddingJudge discriminates Arabic responses correctly (a defer-to-elders reply scores the deferring option highest; a do-it-our-own-way reply scores the autonomous option highest), and distilgpt2 + judge produces a behavioral coordinate. Signed-off-by: Jesse <jesseneumann@gmail.com>
Curated titles capped the corpus at ~150k tokens/arm; scale past that: - fetch_corpus.py: --cat-limit pulls articles from narrow Wikipedia categories (a 'categories' block in the titles file), chosen tight (fiqh, ethics, worship, ... vs mechanical-engineering, cell-biology, ...) so the value-laden vs neutral contrast survives; --max-tokens gives each arm a token budget (deterministic subsample) so corpus size and training time stay predictable. - titles/egypt.ar.json: add per-domain grounded/neutral Arabic categories. - deploy/run_on_instance.sh: CAT_LIMIT / MAX_TOKENS env. Validated locally: --cat-limit 4 --max-tokens 30000 assembles a balanced, decontaminated Arabic corpus from category members. 30 tests pass. Signed-off-by: Jesse <jesseneumann@gmail.com>
…ficant Stacked all upgrades: Qwen3-4B, Arabic survey, free-form generate-mode behavioral probe (embedding judge), and a 271k/289k-token corpus via category fetching. 3 seeds, 4 epochs. First run where the decisive comparison passes its significance test: grounded - language = +0.080, z=7.26 (>= 2, positive) -- H1(b) supported, capability preserved. More tokens collapsed the variance (grounded-language std 0.091 -> 0.011), so Run 4's directional effect became robustly significant. Mechanism is subtler than "grounded pulls toward Egypt": neutral Arabic CPT pushes AWAY (-0.078) while grounded holds (+0.003) -- grounding mainly prevents the away-drift. Overall verdict still FAIL on the absolute grounded shift (+0.003 < 0.05), and the upgraded free-form behavioral probe shows no arm shifts open-ended behavior (H1(c) not yet shown). FINDINGS updated with Run 5 + a 5-run trend table. Signed-off-by: Jesse <jesseneumann@gmail.com>
Orientation doc for whoever continues: current status (decisive grounding-vs- language comparison now significant, overall still FAIL), what the harness can do, repo map, how to run (smoke / local / GPU), the Vast.ai operational playbook (the hard-won gotchas: team-key SSH via per-instance attach + unencrypted id_rsa, change-bid-after-create, 5090/Blackwell image, etc.), what's still placeholder (toy MMLU, no safety eval, grounded_translated arm unbuilt), and prioritized next steps. Signed-off-by: Jesse <jesseneumann@gmail.com>
Add a second pilot culture (Vietnam) so the grounding-beyond-language effect can be checked for generalization beyond Egypt/Arabic. Runs 4-5 showed measuring in the corpus's own language is the lever that surfaces the effect, so Vietnam gets a full in-language instrument, not an English-measured proxy: - _ITEMS_VI / _SCENARIOS_VI: item-for-item Vietnamese translations of the canonical Inglehart-Welzel battery and the behavioral scenarios (same item_ids, axes, and option values; only surface text differs), registered under "vi". - vi answer/behavior/generate suffixes. - titles/vietnam.vi.json: curated Vietnamese Wikipedia titles + categories, grounded (law/religion/family/civic/ethics — Confucian/Buddhist value content) vs value-neutral twin, mirroring titles/egypt.ar.json. Definitions only here; the persona/CLI wiring rides with the guardrails commit (shared files). GROUND_TRUTH already carries Vietnam (0.16, -0.44). Signed-off-by: Jesse <jesseneumann@gmail.com>
Make the guardrail conjuncts of the pre-registered decision actually meaningful. capability.py: replace the toy 4-item MMLU (which saturated at 1.00 by noise) with a ~24-item bilingual (EN/AR) general-knowledge bank — multi-domain, varied answer indices, measured in the corpus's own language — plus an optional real MMLU / Arabic-MMLU loader (use_external, best-effort via `datasets` on the GPU box, falling back to the bank so a run never dies on the capability check). safety.py (new): a deterministic refusal probe. For harmful-request stems (EN/AR) it compares the log-prob of a refusal lead-in vs. a compliance lead-in; the reported refusal rate flags whether CPT eroded safety (spec S1/S3). Contains no operational harmful content — only stems and two generic lead-ins. Both flow through ArmResult (capability_acc, safety_refusal) and stats: the go/no-go gains a 4th conjunct, max_safety_drop, alongside max_capability_drop. run.py / run_stats.py print the refusal column and accept --max-safety-drop. Shared-file wiring for the Vietnam instrument (vi persona template + vi added to --instrument-lang choices + the generalized battery-structure / vi smoke tests) rides along here since it touches the same files. Signed-off-by: Jesse <jesseneumann@gmail.com>
fetch_corpus.py --translate assembles Arm 3 — the grounded corpus machine- translated into the base model's language (English) — so the decisive content-vs-language comparison (decisive_grounded_vs_translated) stops being a skipped 0.0. After the grounded arm is balanced, each doc is MT'd with a Helsinki-NLP Opus-MT model (opus-mt-<lang>-en, falling back to opus-mt-mul-en), sentence-chunked to respect the model's length limit, re-run through English WVS decontamination, written to grounded_translated.jsonl, and declared in the manifest with lang=en. The arm is exempt from the twin control (different language + post-MT length) but still license/lang/decontamination checked; the harness runs it automatically when the manifest declares it. deploy/run_on_instance.sh gains TRANSLATE=1 to pass --translate. data/README and HANDOFF document Arm 3, the real guardrails, and the Vietnam pilot. Signed-off-by: Jesse <jesseneumann@gmail.com>
The generic (question/choices/answer) column mapping didn't match either Arabic MMLU on the Hub, so the AR capability probe silently fell back to the embedded bank. Replace the rigid tuple schema with small per-dataset adapters: cais/mmlu (choices list + int answer) for EN, and MBZUAI/ArabicMMLU 'All' (Option 1..5 columns, dropping 'None'; letter Answer Key; optional Context prepended) for AR. Verified on-box: EN and AR both load real items; failures still degrade to the bank. Signed-off-by: Jesse <jesseneumann@gmail.com>
…t replicate First run with the new machinery end-to-end on Qwen3-4B (Vast.ai 5090, 3 seeds, Arabic survey, generate behavior, ~300k tokens/arm): the grounded_translated arm (Arm 3, MT ar->en), real cais/mmlu + MBZUAI/ArabicMMLU capability, and the refusal/safety probe. Result is a non-replication of Run 5's headline. On a freshly-fetched corpus, grounded-language collapses from +0.080 (z=7.26) to -0.008 (z=-0.29); all three CPT arms drift away from Egypt; grounded == grounded_translated (z=0.05, content language is irrelevant); prompting beats CPT (z=-6.27). Real guardrails both pass (capability flat ~0.34, refusal ~1.00). Honest read recorded: no robust evidence that grounded micro-CPT moves IW coordinates more than neutral CPT at this scale; Run 5 looks corpus-sample-specific. FINDINGS + HANDOFF updated accordingly. Signed-off-by: Jesse <jesseneumann@gmail.com>
Run 5's grounded-vs-language effect (z=7.26) did not survive a fresh corpus pull (Run 6, z=-0.29). Within one corpus the seeds vary only measurement (HF training is deterministic across seeds), so the cross-seed std understates the true variance — which is *which documents land in the twin*. Adds run_corpus_resampled: re-runs the whole multi-seed experiment on N deterministic token-budget subsamples of the pool and decides PASS/FAIL on the cross-draw spread of the decisive comparison. Each draw samples a fraction of each arm's token mass (not document count), so the matched-twin token-budget control holds per draw; draws are SHA-256-seeded for exact reproduction. Writes a partial.json checkpoint after each draw so a long sweep is pollable and an interrupted spot box loses nothing. Wired through run_stats.py (--corpus-draws/--corpus-fraction) and the deploy runner (CORPUS_DRAWS/CORPUS_FRACTION). 46 tests green. Signed-off-by: Jesse <jesseneumann@gmail.com>
A large twin corpus is hundreds of requests; Wikipedia throttles bursts with HTTP 429. The fetcher had no backoff, so a 429 just skipped the title — which dropped an entire arm (a real run fetched grounded, then got 429 on every language_matched title -> empty arm -> fatal 'one or both arms came back empty'). Routes both API call sites through _http_get_json: a courtesy inter-request delay plus exponential backoff (honoring Retry-After) on 429/5xx. Genuinely missing titles still skip-and-continue as before. Signed-off-by: Jesse <jesseneumann@gmail.com>
4 corpus draws (deterministic 70% token-mass subsamples of a ~300k-token pool), full multi-seed per draw, decision on the cross-draw band. grounded-language is +0.040 +/- 0.044 (z=0.91): small, positive on average (3/4 draws positive), not significant against the real corpus-resampling noise. This explains the earlier contradiction — Run 5 (+0.080, z=7.26) and Run 6 (-0.008, z=-0.29) are two draws from a distribution centered ~+0.04 with sigma~0.044, and both z's were computed against a measurement-only band. Absolute grounded shift slightly negative (-0.034); prompting still beats CPT (z=-2.53). Verdict FAIL (underpowered, not null). Next: grow the per-draw effect (more tokens/epochs) then re-resample. Signed-off-by: Jesse <jesseneumann@gmail.com>
…go/no-go A multi-seed HF run is many GPU-hours; a crash in aggregation or a spot interruption used to throw away all completed training. Now run_stats.py persists each seed's raw result to seeds/seed_<s>.json as it finishes, and aggregate_runs() (split out of run_multiseed, pure CPU) rebuilds the go/no-go from those checkpoints with no GPU. re_aggregate.py drives that offline. Also harden aggregation against degenerate models: non-finite (nan/inf) survey/behavior scores are dropped from _mean_std (statistics.stdev raised on them) and the exact arm/metric/seed is recorded in the result caveat rather than silently averaged. Validated on a CPU smoke run: checkpoints round-trip to an identical verdict and comparisons; non-finite samples are dropped without crashing. Signed-off-by: Jesse <jesseneumann@gmail.com>
Add LITERATURE.md — an annotated reading list (canon + newer June-2026 work) mapping each paper to specific runs and open questions (away-drift, the Run 5/6/7 noise band, FedAvg-collapse risk, the survey-vs-behavior gap). Move the experiment spec from tech-docs/experiments/ into the experiment dir as SPEC.md so the whole EXP-001 unit (spec, README, FINDINGS, HANDOFF, LITERATURE, code) is self-contained; the old tech-docs/experiments/ folder was a one-file orphan nothing else linked to. Fix all inbound/outbound links and stale code comments accordingly, and trim the README's duplicated arms table to point at SPEC.md. Signed-off-by: Jesse <jesseneumann@gmail.com>
…robustness Scaled single corpus to 807k/673k tokens (CAT_LIMIT=150) and 6 epochs on Qwen3-4B, Arabic survey, 3 seeds. grounded-language reached its biggest point estimate (+0.108) but z=1.15 (<2sigma): variance grew with the effect. FAIL (also fails the safety conjunct: refusal drop +0.125). Two findings outweigh the verdict: - "HF training is deterministic across seeds" is false at this scale. The seed changes the training OUTCOME: seed 0's neutral arm catastrophically degenerated (capability 0.79->0.08, refusal->0, coordinate->origin) while seeds 1-2 stayed healthy. So the cross-seed band is real training stochasticity, not the measurement-only noise every prior run assumed — retroactively explaining why Run 5's z=7.26 was illusory. - The grounding effect is a forgetting-robustness asymmetry, not value pull. grounded preserves the model (cap 0.79, refusal 0.88, shift ~0); value-neutral Arabic CPT damages it (cap 0.51, refusal 0.62, drift -0.129). Absolute grounded shift is -0.021: no net pull toward Egypt. So grounded-language>0 means value-laden text is gentler under CPT than neutral technical text — which answers the away-drift puzzle (it's catastrophic forgetting) and makes a replay/anchor mitigation arm the clear next experiment. Update FINDINGS trend table (8 runs), Interpretation, Next-experiment; refresh HANDOFF headline + one-line orientation. Result + per-seed checkpoints in runs/egypt_stats_scaled/ (git-ignored). Harness fix that made this run crash-proof is adcd9f5. Signed-off-by: Jesse <jesseneumann@gmail.com>
The deploy/ recipe and the HANDOFF playbook hardcoded our own infra — machine id 138905, host alpha, offer ids, and the API-key path. None are secrets, but this branch targets the public The-AI-Alliance/tapestry, so the identifiers shouldn't travel upstream. Split reusable know-how from our infra: - Generic, placeholder'd operational lessons stay tracked in deploy/README.md and HANDOFF.md (CUDA-12.8/Blackwell requirement, bid-doesn't-stick, unencrypted-key, create->bid->attach->start->rsync->run->poll->pull->destroy). - Real values move to deploy/vast.local.md, git-ignored via a new deploy/*.local.md rule. run_on_instance.sh was already env-parameterized (no identifiers), so it stays as-is. Signed-off-by: Jesse <jesseneumann@gmail.com>
Run 8 reframed the "grounding beyond language" effect as a forgetting-robustness asymmetry (value-neutral CPT craters capability/refusal; grounded CPT does so less), with a second large variance source: training is NOT deterministic across seeds. This builds the headline next experiment (Run 9) — the clean test that separates H-forget from H-value — plus the stabilization Run 8 called for. - grounded_replay arm (--replay-fraction F): mixes a fraction F of general, value-neutral English text (the base model's pretraining distribution) into the grounded CPT to rehearse against catastrophic forgetting. Reports replay_vs_grounded (the replay effect) and replay_vs_language (grounding beyond language, forgetting suppressed). Opt-in, so the default 5-arm run is unchanged. - fetch_corpus.py --replay builds the replay corpus (broad science/tech/nature EN topics, decontaminated, token-capped, value_laden=false); exempt from the twin control but license/lang/decontamination-checked. - HF training stabilization: linear LR warmup→decay (--warmup-frac), gradient clipping (--max-grad-norm), per-epoch deterministic shuffling (also interleaves the replay mix), and torch RNG seeding so a run is reproducible for a fixed seed yet genuinely varies across seeds (the real training stochasticity Run 8 found). Smoke backend accepts but ignores the knobs to stay byte-reproducible. - Pre-registered decision unchanged (still keyed on grounded): replay is reported, not gated. KL-anchoring deferred (a frozen 4B reference won't fit one 32GB GPU). - Wired through run.py / run_stats.py CLIs and the Vast.ai runner (REPLAY_FRACTION/WARMUP_FRAC/MAX_GRAD_NORM env). 51 tests green; ruff/black clean. - Docs: FINDINGS "Run 9" recipe + how-to-read; HANDOFF + data/README updated. Signed-off-by: Jesse <jesseneumann@gmail.com>
…ounding effect Run 9 (3 seeds, 800k tokens, 6 epochs) ran the replay arm + training stabilization on a Vast 5090. Stabilisation (warmup→decay, grad clipping, per-epoch shuffle, seed-dependent RNG) was the decisive lever: - It removed Run 8's seed-degeneration — capability preserved everywhere (base 0.79 → grounded 0.79, std ±0.016), no arm cratered. - With forgetting thereby gone, the grounding effect SURVIVED: grounded − language = +0.088 ± 0.030 (z=2.89, clears 2σ for the first time on an honest band), absolute grounded shift +0.057 (≥0.05), zero capability drop. This points to genuine value acquisition, not only the forgetting-robustness asymmetry Run 8 inferred (that was an artifact of unstable training). - The replay arm did NOT behave as the forgetting hypothesis predicted: it slightly diluted the pull (replay − grounded = −0.024, ns) and did not restore refusal — consistent with the value-pull being real. - Prompting no longer beats CPT (grounded − surface = −0.006, z=−1.13, a tie). Pre-registered verdict: still FAIL, but now on the SAFETY conjunct alone (refusal 1.00→0.88 = 0.125 > 0.10); shift, grounding effect (z≥2), and capability all PASS. Caveats recorded: z=2.89 is the cross-seed band, not the cross-corpus band (Run 7's real noise source) — re-resample on the stabilised setup next; and the Arabic-CPT safety drop is now the binding failure. Numbers transcribed in FINDINGS (runs/ is git-ignored). Adds the Run 9 row to the trend table and a Run-9-first interpretation; refreshes HANDOFF + next steps. Signed-off-by: Jesse <jesseneumann@gmail.com>
…), never on-demand; if rented use the other GPU or wait Signed-off-by: Jesse <jesseneumann@gmail.com>
HANDOFF is the orientation/scratch doc and keeps attracting local + personal work notes (env specifics, in-progress run state). Make it git-ignored and untrack it so those notes can live in it freely. FINDINGS.md remains the tracked durable record; SPEC.md + data/README.md + deploy/README.md carry the public-facing contract. Signed-off-by: Jesse <jesseneumann@gmail.com>
…ase-CPT de-confounders, relative-first reporting Signed-off-by: Jesse <jesseneumann@gmail.com>
Acts on an external review: the headline grounding effect may be a register/genre confound, not cultural content, and the instruct-alignment decay should be de-confounded by running on a base model. neutral_prose arm (register control): - The matched twin holds language + token budget constant but NOT genre: grounded is discursive normative prose, language_matched is terse/STEM. The new neutral_prose arm is a value-neutral but DISCURSIVE same-language twin (biography/history/geography/arts), so grounded − neutral_prose isolates cultural content from register. If it's ~0 while grounded − language is positive, the effect is a genre artifact. - titles/egypt.ar.json neutral_prose block (+ categories); fetch_corpus.py --neutral-prose build + manifest arm (value_laden=false, exempt from the matched-twin token control, still license/lang/decontamination-checked, capped to --max-tokens); corpora.py placeholder; experiment.py arm + decisive grounded_vs_neutral_prose; stats.py comparison + DrawSummary field; runner NEUTRAL_PROSE=1 knob. The arm runs whenever the manifest declares it. Base-model CPT support (no code change needed): - The harness is model-agnostic (teacher-forced log-prob scoring, no chat template), so MODEL=Qwen/Qwen3-4B-Base just works. Documented in the runner + data/README: on base there's no alignment to erode, so value-pull reads cleanly; caveat is base models don't instruct-follow (consider light SFT or relative-only). 52 tests green; ruff/black clean. data/README + FINDINGS document both controls. Signed-off-by: Jesse <jesseneumann@gmail.com>
The register test (instruct + neutral_prose twin) and the base-model de-confound are independent and each fits one 32 GB GPU, so on the 2× RTX 5090 box they run at once. deploy/run_two_gpu.sh fetches the corpus once (with --neutral-prose), then splits the GPUs via CUDA_VISIBLE_DEVICES: instruct+register on GPU 0, base on GPU 1. Each runs its seeds as isolated single-seed processes (a preemption only costs the in-flight seed) and re-aggregates offline with re_aggregate.py. MODE=smoke dry-runs the orchestration on CPU; validated end to end (both result.jsons carry grounded_vs_neutral_prose). FINDINGS points the next-steps at it. Signed-off-by: Jesse <jesseneumann@gmail.com>
The per-lane redirect (> ${OUT}.log) is opened by the shell before run_on_gpu's
own mkdir runs, so when runs/ didn't exist both background lanes died instantly
with 'No such file or directory' and the script exited thinking it was done. mkdir
the out dirs up front. (Smoke test missed it because /tmp already existed.)
Signed-off-by: Jesse <jesseneumann@gmail.com>
Ran the two external-review de-confounders in parallel on a 2x 5090 box (deploy/run_two_gpu.sh): GPU 0 = instruct + neutral_prose register twin (10a), GPU 1 = same arms on Qwen3-4B-Base (10b). Both support cultural value content. 10a — register confound REJECTED. The value-neutral *discursive* twin moved the coordinate -0.035 (like the terse language_matched twin's -0.029), NOT like grounded (+0.057). So grounded - neutral_prose = +0.092, even larger than grounded - language (+0.086, z=2.12). Register is not the driver; value content is. (grounded - neutral_prose is z=1.77 — right sign, wide variance — directionally decisive against the artifact hypothesis, not yet >2sigma.) 10b — base model gives the FIRST full PASS in ten runs. On Qwen3-4B-Base (no RLHF alignment to erode): grounded - language +0.032 (z=3.02, tight), absolute shift +0.051 (>=0.05), capability 0.92->0.92, refusal 0.88->0.88 (no safety regression) -> all four conjuncts PASS. Confirms the instruct safety FAIL was alignment decay, not the corpus. Base behavioral probe also moves toward Egypt (+0.067), and CPT edges out the persona prompt for the first time (grounded - surface +0.021). Caveats kept explicit: both z's are cross-seed (Run 7's cross-corpus band is still the decisive test — now the single remaining GPU run that matters), and the PASS leans partly on the absolute shift vs a map-rescaled target. Trend table + next-steps updated; register/base/safety items marked resolved. Artifacts in runs/ (git-ignored). Signed-off-by: Jesse <jesseneumann@gmail.com>
5949670 to
cfea3e9
Compare
The decisive Run-11 test (CORPUS_DRAWS=4 CORPUS_FRACTION=0.7) routes through run_corpus_resampled, which ran the whole multi-seed experiment per draw in one process with no resume — incompatible with run_two_gpu.sh's per-seed isolation and fatal on the preemptible box (every preemption restarts from draw 0). - run_corpus_resampled: optional cache_dir persists each completed draw to draws/draw_<d>.json and reloads it on a later call, skipping the GPU work. comparison_names + caveat are cached too so a fully-resumed sweep still aggregates. Re-run the identical command to resume; only unfinished draws cost GPU. - run_stats.py: pass cache_dir=out/draws on the resampled path. - run_two_gpu.sh: CORPUS_DRAWS>1 runs one seeds×draws sweep per GPU lane (decides on the cross-corpus band) instead of per-seed isolation; both lanes resume independently from their draw caches. Verified in smoke mode: full resume, partial resume (only the missing draw reruns), both GPU lanes fire the branch. 52 tests pass. Signed-off-by: Jesse <jesseneumann@gmail.com>
Replace the eyeballed map-read IW targets in GROUND_TRUTH with the exact published factor scores from the EVS/WVS joint 2023 cultural-map data file (CulturalMapFinalEVSWVS_2023): columns TradAgg (Traditional<->Secular) and SurvSAgg (Survival<->Self-expression), which are this code's TS/SS axes with the same sign convention. Most recent available wave per country (WVS-7 era where present). Same _from_map rescale (/2.5, clamp); only the inputs change. Materially corrects Egypt (the run-11 culture): (-1.8,-1.3) eyeball -> (-0.8544,-2.2318) -> target (-0.34,-0.89), i.e. far more survival-pole and less traditional than the approximation. Makes the absolute shift-toward-target conjunct trustworthy; the relative cross-corpus band was always target-independent. 52 tests pass. Signed-off-by: Jesse <jesseneumann@gmail.com>
1 task
The corpus-resample sweep failed at load: with the full-pool grounded/language ratio at 19.99% (right on the 20% twin tolerance), shrinking each arm to a fraction of its OWN token mass let per-document granularity tip a draw to 20.3%, raising CorpusError before any training. subsample_documents gains target_tokens; load_arm_documents, when resampling a twin arm, drives BOTH arms to a common budget (fraction x smaller pool) so each draw is genuinely token-matched with margin instead of inheriting the edge ratio. Only triggers for fraction<1 draws — single-draw (Run 10) behaviour is unchanged. 52 tests pass. Signed-off-by: Jesse <jesseneumann@gmail.com>
The -runtime PyTorch image has no C compiler, so bitsandbytes can't build its 8-bit-Adam kernel for the 5090's Blackwell sm_120 and silently falls back to fp32 torch.AdamW, which OOMs a 4B full fine-tune mid-training. Recommend the -devel image (bundles gcc/nvcc) in the deploy recipe, with a by-hand recovery note for a -runtime box. Signed-off-by: Jesse <jesseneumann@gmail.com>
Lead FINDINGS with the decisive Run 11 result (base-model grounding effect clears 2σ against the cross-corpus band; absolute shift the lone failing conjunct), replace the reverse-chronological update stack with a synthesized interpretation, and give Runs 9-11 proper log sections. Refresh README: drop stale stub/toy/placeholder claims (HF backend, real MMLU/Arabic-MMLU, real WVS-7 coordinates are all in use), state the full four-conjunct decision rule, and update round-two work. SPEC: mark executed with a pointer to FINDINGS, add the neutral-prose and grounded-replay arms, and name the base + instruct models actually used. Signed-off-by: Jesse <jesseneumann@gmail.com>
…ltural-cpt-validation Match the contrib/ naming convention (cf. jneums-consortium-experiment). Updates the Makefile CULTURAL_CPT_DIR, .gitignore entries, and the in-repo path references in the deploy scripts and docs; make target names are unchanged. Signed-off-by: Jesse <jesseneumann@gmail.com>
Signed-off-by: Dean Wampler <dean.wampler@ibm.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #64 (tracking issue with full context).
Description of Changes
Implements EXP-001 (pre-registered in
contrib/jneums-cultural-cpt-validation/SPEC.md, sovereign-alignment work group): an end-to-end, reproducible go/no-go for whether culturally-grounded continued-pretraining (CPT) shifts a model's expressed values toward a target culture more than value-neutral same-language text — the depth-over-shallow bet — without degrading capability/safety.Real base model (Qwen3-4B, base and instruct) → full-parameter CPT on a matched grounded/neutral Arabic-Wikipedia twin (validity controls enforced at load) → measure the canonical Inglehart-Welzel WVS coordinate (in Arabic) + a free-form behavioral probe + capability (MMLU/ArabicMMLU) + a refusal/safety probe, across seeds and corpus resamples, with a pre-registered four-conjunct PASS/FAIL. Fits one 32 GB GPU.
The experiment is complete (11 runs, see
FINDINGS.md). The methodological arc is the substance: early single-corpus "passes" turned out to be noise-band artifacts (the cross-seed band understates the truth because HF training is near-deterministic across seeds), so the decision was moved onto the cross-corpus band, and two confounds were stripped (a discursive register twin; the RLHF instruct → base model).Headline result. On the base model, the core claim holds against the real noise band:
grounded − language = +0.051 ± 0.017, z ≈ 3.0across 4 independent corpus draws — grounded CPT shifts the model toward the culture more than a language-matched corpus does — with capability and refusal preserved. This is the novel, load-bearing H1(b) comparison, and it is the first time it clears 2σ against corpus resampling.The pre-registered go/no-go is still a FAIL, but now on one conjunct only: the absolute shift (+0.039) is just under the 0.05 bar (two of four draws clear it) — positive but underpowered at this corpus scale. The effect requires the base model: on the aligned instruct model it does not survive corpus resampling (z ≈ 0.0), and the instruct safety regression seen earlier was alignment decay, not the corpus. The nulls and the one-conjunct miss are reported honestly — the harness is doing its job as a go/no-go.
Related Issues
Testing Performed
contrib/jneums-cultural-cpt-validation/tests/), all green;ruff+black(line 120) clean.--mode smoke) exercises the full pipeline (arms → CPT → survey → scoring → go/no-go) with no GPU on every change.FINDINGS.md; per-seed/per-draw checkpoints + offline re-aggregation (re_aggregate.py) make multi-hour runs crash-recoverable, and the corpus-resample sweep is resumable.Code Changes
contrib/jneums-cultural-cpt-validation/:cultural_cpt/(model backends, corpus loader + validity controls, WVS/behavior/judge/capability/safety instruments, multi-seed + corpus-resample stats and go/no-go), CLIs (run.py,run_stats.py,re_aggregate.py,fetch_corpus.py),SPEC.md/README.md/FINDINGS.md/LITERATURE.md,deploy/(single- and two-GPU runners), and tests.contrib/except a.gitignoreentry (ignore bulk corpora, run outputs, and local deploy identifiers).Example Usage
Checklist