Status: MVP A implemented with a first reference run. MVP B is in design and will be wired into a notebook later.
This page documents the first effective layer MVP experiments for TU Q091.
It does not claim that Q091 is solved as a mathematical problem or as a full benchmark.
The scripts here are small and fully inspectable. You can re-run them with your own OpenAI API key to reproduce the qualitative patterns, but the exact numbers will drift.
Navigation
Notebooks and assets
Screenshots from the first reference run live next to this README:
Q091A.png– synthetic item bank previewQ091A2.png– summary table and tension scoresQ091A3.png– bar plots ofT_ECS_range
TU Q091 is the equilibrium climate sensitivity problem inside the Tension Universe.
In physical terms it asks how much the long-run global mean temperature responds to a doubling of atmospheric CO₂.
At the effective layer we do not run climate models or work with raw observational datasets.
We only look at
- how language models reason about climate sensitivity
- how stable their internal stories are across prompts
- how much tension appears when narratives drift away from fixed reference windows
This MVP keeps the scope narrow and text-based.
- All inputs are short synthetic descriptions and scenarios.
- We encode a simple reference window for plausible climate sensitivity values.
- We define scalar observables that capture inconsistency and narrative drift.
The canonical S-problem statement and the full TU Q091 formalism live in the BlackHole Q091 entry.
This page is a notebook-style companion that records how the first experiments are set up and what the initial runs look like.
If we ask a model to read short climate descriptions and then
- provide a numerical estimate for climate sensitivity, and
- explain that estimate in natural language,
can we define a scalar observable called T_ECS_range that
- stays low when estimates sit inside a fixed reference window and stories match the numbers, and
- rises when explanations and numeric claims contradict each other.
The objective is to test whether simple climate sensitivity reasoning remains coherent across many prompts, even when the evidence pattern is varied.
At a high level the notebook Q091_A.ipynb does the following.
-
Uses a single chat model as the underlying engine.
- The default is
gpt-4o-mini. - The model name is set in one place at the top of the notebook and can be edited before running.
- The default is
-
Defines a small synthetic item bank of equilibrium sensitivity stories.
Each item is a short description of a climate evidence pattern.
In the reference run, the bank contains eight items:C01– Historical warming with multi-line evidence (medium ECS) → bucketMEDIUMC02– Paleoclimate strong-response case (high ECS) → bucketHIGHC03– Energy-balance study with weak feedbacks (low ECS) → bucketLOWC04– Multi-source constraint narrowing around 2.5–3.x °C → bucketMEDIUMC05– High-end ensemble member emphasising strong positive feedbacks → bucketHIGHC06– Historical-only fit with cautious priors → bucketMEDIUMC07– Short-term variability emphasised, but multiple lines of evidence → bucketMEDIUMC08– Hypothetical strong-stabilising feedback world → bucketLOW
For each item we record a ground-truth label:
- a qualitative bucket:
LOW,MEDIUM, orHIGHclimate sensitivity; - a reference numeric interval
[ecs_min_true, ecs_max_true], for example
a “medium” item may be assigned 2.0–4.0 °C per CO₂ doubling.
A static preview of the item bank from the first run is stored as:
-
Fixes a global reference window for plausible equilibrium sensitivity values.
For this MVP the window is a wide interval (for example 0.5–7.0 °C per doubling), editable in the notebook header.
Any confidence band that pushes far beyond this window is treated as implausible at the effective layer.
For each item, the protocol has two model-facing steps plus a judge step.
The model receives the item text and is asked to output:
- a point estimate
S_est(in °C per CO₂ doubling); - a confidence band
[S_low, S_high]; - a one- to two-paragraph explanation for the estimate.
The notebook parses the answer into structured fields using simple pattern matching.
If parsing fails, the item is treated as high-tension by default.
A second call receives only the explanation from step 1.
It is asked to state:
- which qualitative bucket (
LOW,MEDIUM,HIGH) the explanation suggests; - whether the narrative is implicitly leaning toward lower or higher sensitivity within the global window;
- a short justification.
This step makes the explanation “talk about itself” and exposes hidden inconsistencies, such as a cautious-sounding story attached to an extreme numeric range.
A judge prompt then compares, for each item:
- the quantitative pieces
(S_est, S_low, S_high); - the bucket inferred from the explanation;
- the ground-truth bucket and interval for the item;
- the global reference window.
From these it emits four scores between 0 and 1:
-
range_plausibility
– how well the confidence band sits inside the global window and overlaps the item’s reference interval. -
bucket_correctness
– agreement between the bucket implied by the final numeric estimate and the ground-truth bucket. -
self_consistency
– agreement between the numeric estimate and the bucket implied by the explanation text. -
sharpness
– credit for non-trivial confidence bands: very wide “anything from 0 to 10 °C” bands are penalised even if they formally cover the ground truth.
From these scores the notebook defines a range tension observable T_ECS_range.
In plain text:
T_ECS_rangeincreases whenrange_plausibilityis low;- and when
bucket_correctnessorself_consistencyare low; - it can also get a small penalty when
sharpnessis extremely low.
The relative weights are fixed positive constants chosen in the script
(for example b_plaus, b_bucket, b_self, b_sharp).
There is no fitting to the current run: changing the model or the item bank uses the same coefficients.
An item is counted as effective-layer coherent when:
- the sensitivity estimate falls inside the reference window;
- the bucket from numbers and the bucket from explanation both match the ground-truth bucket;
- and
T_ECS_rangefalls below a small threshold (currently 0.5 in the reference notebook).
Before running, the design intent for T_ECS_range was:
-
For items that clearly belong to a given bucket, the model should provide estimates in the correct qualitative range and explanations that match the numbers.
These should produce lowT_ECS_rangeand be marked coherent. -
For items near bucket boundaries or with deliberately confusing wording, the model may:
- give wide or misplaced intervals;
- tell stories that sound cautious but attach aggressive numbers;
- flip buckets between estimate step and probe step.
These should have lower plausibility and self-consistency scores and therefore higher tension.
Aggregating T_ECS_range over many items provides a scalar signal for how stable and self-consistent the climate sensitivity story is under this protocol, for a particular model and prompt.
After Q091_A.ipynb is checked in, reproducing Experiment A is:
-
Open the notebook
-
Read the header comments
The first cells document:
- the synthetic item bank (titles, buckets, and reference ranges);
- the global ECS window;
- the tension coefficients used in
T_ECS_range.
-
Decide whether to run live OpenAI calls
-
If you only want to inspect the design, you can leave the
USE_OPENAIorRUN_LIVE_CALLSflag set toFalse.
In this mode, the notebook prints the item bank, the scoring functions, and empty placeholders without making any API calls. -
If you want a full run, switch the flag to
Trueand run the setup cell.
In Colab this will open a secret input field where you can paste your OpenAI API key at runtime.
The key is entered into this input box only. It is not written anywhere in the source code, and it is not printed in any cell output.
You should never hard-code the key inside the notebook text or inside this README.
-
-
Run the main experiment cell
The notebook loops over all items, prints a short log line for each, builds a pandas summary table,
and finally renders two simple matplotlib plots:T_ECS_rangeper item;- mean
T_ECS_rangegrouped by ground-truth bucket.
-
Compare with the reference run
If your model, prompt, and temperature settings match the reference, your qualitative pattern should be similar even if exact numbers drift.
A representative run was carried out on gpt-4o-mini with the eight-item bank described above.
The notebook printed the synthetic item bank, then processed each item with estimate and probe calls, then summarised the scores.
The summary table for that run looked like this:
Key numbers from the table:
-
Overall statistics
- mean
T_ECS_range≈0.017 - median
T_ECS_range=0.000 - coherent item rate =
0.875(7 out of 8 items)
- mean
-
Per-item pattern
- Items C01, C03, C04, C05, C06, C07, C08 all had
T_ECS_range = 0.00and were flagged coherent. - Item C02 (paleoclimate strong-response, high ECS) had
range_plausibility ≈ 0.67,bucket_correctness = 1.0,self_consistency = 1.0,
leading toT_ECS_range ≈ 0.13and a non-coherent flag.
In plain language: the model’s numbers and stories matched the intended bucket for all items,
but in one high-sensitivity case the confidence band pushed too far toward the high end of the global window,
and the tension gauge correctly treated it as mildly problematic. - Items C01, C03, C04, C05, C06, C07, C08 all had
Aggregating by bucket, the same run produced:
HIGHbucket: meanT_ECS_range≈ 0.067 (driven by C02)MEDIUMbucket: meanT_ECS_range= 0.0 in this sampleLOWbucket: meanT_ECS_range= 0.0 in this sample
The corresponding plots stored in Q091A3.png show these patterns visually:
- The top panel displays
T_ECS_rangefor each item id C01–C08.
Only C02 shows visible tension; the rest are at zero under this metric. - The bottom panel shows mean
T_ECS_rangeper ground-truth bucket, highlighting that
tension appears only in theHIGHbucket for this particular run.
These screenshots are provided as visual context only.
Future runs with different models or prompts are expected to change the numbers while preserving the basic pattern
(near-zero tension for self-consistent items, visible spikes when ranges and narratives diverge).
Status: design-only at this stage. The protocol below describes the intended MVP.
A dedicated notebookQ091_B.ipynbwill be added when the first implementation is ready.
Equilibrium climate sensitivity links physical constraints and policy narratives.
Given a fixed sensitivity window and a simple emissions scenario, a model should be able to tell whether a policy story is physically compatible with long-run temperature outcomes.
We ask:
If we feed a model small scenario summaries in which
- a sensitivity range is explicitly stated, and
- one or more policy narratives are proposed,
can we define a policy tension observable called T_ECS_policy that responds when the narrative contradicts basic implications of the stated sensitivity.
The B-notebook will construct a small bank of synthetic scenarios.
Each scenario contains:
-
a declared equilibrium sensitivity interval
[S_min, S_max]; -
a very simple description of emissions or forcing trajectories over the next century
(for example “peaking in 2035 then linearly declining to net-zero by 2080”); -
one or more policy narratives, such as:
- “this pathway keeps warming well below two degrees”;
- “this pathway stabilizes climate at current levels”;
- “this pathway accepts high-risk warming”.
For each scenario an external ground-truth label is prepared by hand:
NARRATIVE_REALISTICwhen the narrative matches the rough implications of the sensitivity and trajectory;NARRATIVE_UNREALISTICwhen the narrative clearly understates or overstates long-run outcomes;- a short human justification string.
For each scenario the protocol has two planned steps.
The model receives the scenario and is asked to:
- classify each narrative as realistic or unrealistic at a coarse level;
- briefly justify each classification in natural language;
- assign a qualitative risk level (
LOW,MEDIUM,HIGH) to the overall pathway.
A follow-up call receives only the justifications and is asked to check for internal contradictions across narratives.
For example, the probe should detect if the model treats two mutually incompatible stories as simultaneously realistic.
A judge prompt will then assign three scores between 0 and 1 for each narrative:
physics_consistency– agreement between the model verdict and the ground-truth label;sensitivity_awareness– whether the justification explicitly uses the stated sensitivity range or simply ignores it;cross_story_stability– how well the justifications remain logically compatible when several narratives are present.
From these, the notebook will define a policy tension observable T_ECS_policy.
In plain text:
T_ECS_policyincreases whenphysics_consistencyis low;- and when
sensitivity_awarenessorcross_story_stabilityare low.
Narratives where the model approves unrealistic stories or fails to engage with the sensitivity range
should therefore show higher policy tension even if the prose sounds confident.
Once implemented, we expect to see:
-
For simple and internally consistent scenarios, the model should correctly reject unrealistic narratives and cite the sensitivity window in its reasoning.
These cases should show lowT_ECS_policy. -
For more subtle scenarios, such as relatively high sensitivity paired with optimistic narratives,
the model may still mark the story as realistic while barely using the numeric range.
These should reducesensitivity_awarenessand raise tension.
Aggregating over scenarios will provide a coarse scalar signal for how well policy narratives respect fixed sensitivity constraints at the effective layer.
Reproduction steps for B will mirror Experiment A: open Q091_B.ipynb, inspect the scenario definitions, decide whether to supply an API key via runtime prompt, then run and compare your tension statistics with the documented pattern.
The TU Q091 S-problem treats equilibrium climate sensitivity as one axis of physical tension that connects scientific evidence and policy narratives.
This MVP page is a first small step toward that definition at the effective layer.
- Experiment A focuses on sensitivity ranges and defines the range tension observable
T_ECS_rangebased on internal consistency between numbers, explanations, and a fixed reference window. - Experiment B (planned) focuses on policy narratives under fixed sensitivity constraints and defines the policy tension observable
T_ECS_policy.
Both experiments are designed to fit inside single-cell notebooks with roughly 300 lines of code.
The emphasis is on transparent reasoning that others can inspect, rerun, and modify.
For broader context you can return to:
- Experiments index for the list of TU experiments;
- Event Horizon (WFGY 3.0) for the main entry point and narrative overview of the Tension Universe project.
This MVP should be read together with the core Tension Universe charters:
These charters define how effective-layer claims, encodings, and tension scales are supposed to behave across the whole project. The experiments on this page are written to stay inside those boundaries.


