Skip to content

Latest commit

 

History

History
692 lines (530 loc) · 29.6 KB

File metadata and controls

692 lines (530 loc) · 29.6 KB

Training

This document covers the full training infrastructure for WikiOracle's NanoChat integration: how raw data is preprocessed into structured training examples, how DegreeOfTruth governs learning, the dynamic systems interpretation, and the online training pipeline that ties it all together.

DegreeOfTruth (DoT)

DegreeOfTruth is a bipolar scalar on −1 .. +1 that measures how well a user's TruthSet agrees with the server's collected truth.

DegreeOfTruth = 2 × mean(agreement_i) − 1    for shared entries

where:

agreement_i = 1 − |server_trust_i − client_trust_i| / 2

The bipolar range encodes both true and false statements:

  • DoT = +1: the user's claims fully agree with the server — the exchange is true. Train at full learning rate.
  • DoT = −1: the user's claims fully contradict the server — the exchange is false. Train at full learning rate (learning what is not true is as valuable as learning what is true).
  • DoT $\approx$ 0: no shared entries, or perfect cancellation — nothing to learn. Skip training.

Both poles (+1 and −1) train at full strength via |DoT|; only the zero crossing results in a skip. The sign encodes direction (agree/disagree), not magnitude.

This is a placeholder. A future version should incorporate compute_derived_truth (operator propagation) before comparison.

Dynamic Systems Perspective

With a bipolar DegreeOfTruth (−1 .. +1), the training loop forms a dynamic equation with both poles and zeros: DoT = +1 and DoT = −1 are attracting poles where the system learns at full strength (truths and refuted falsehoods respectively), while DoT = 0 is a zero — an equilibrium point where no learning occurs.

This structure resembles a Hopfield network, where the energy landscape has stable attractors (memorised patterns) and unstable saddle points. In our system:

  • The TruthSet plays the role of the weight matrix, encoding the collective memory of what is true and what is false.
  • Each training step is a state transition that pushes the model toward one of the attractors (truth or refutation).
  • The zero crossing (DoT $\approx$ 0) is the energy barrier between attractors — the point of maximum uncertainty where the system has insufficient signal to commit to either direction.

As the server TruthSet accumulates entries from multiple users, the poles and zeros of this dynamic equation shift. The slow‑moving average merge ensures that attractors are stable under small perturbations (anti‑capture), while strong consensus can still move the landscape over time. This is analogous to the annealing process in Hopfield networks, where the energy landscape gradually settles into deeper minima as more patterns are stored.

Zero-Mean Balance

The DoT-annealed training algorithm has a zero-mean balance property: over many interactions with diverse DoT values, the expected net gradient contribution tends to zero. This arises because:

  • DoT = +1 and DoT = −1 both train at full strength but push in opposite directions (learning truth vs. learning falsehood).
  • DoT near 0 contributes nearly nothing (the sigmoid zero-crossing).
  • Over a diverse population of users, the average DoT tends toward zero if the TruthSet is balanced.

This is directly analogous to the symmetric energy wells of Hopfield networks. In a Hopfield network, stored patterns and their complements are both stable attractors. In our system, truths and refuted falsehoods are both stable attractors, and the zero-crossing is the energy barrier between them.

The EMA weight anchoring adds a third stabilizing force: regardless of the DoT distribution, the model is always gently pulled back toward its checkpoint state. This is analogous to the temperature decay in simulated annealing — as training progresses, the system becomes increasingly resistant to perturbation while allowing deeper learning within established wells.

Sigmoid Warmup as Annealing Schedule

The sigmoid warmup schedule serves as a cooling schedule for the annealing process. When online training is first enabled:

  1. The TruthSet is empty or sparse — DoT values are unreliable.
  2. The warmup suppresses the learning rate to near-zero.
  3. As interactions accumulate, the TruthSet gains signal.
  4. The warmup ramps up, allowing the model to learn from now-reliable DoT values.
  5. At steady state (step >> warmup_steps), the warmup factor is ~1.0 and training proceeds at full configured strength.

This mirrors the annealing schedule in physical systems: high temperature (high exploration, low commitment) → low temperature (low exploration, high commitment to learned patterns).

Sensation — Preprocessing Pipeline

bin/sensation.py transforms plain-text conversations into XML-tagged training data so that NanoChat learns WikiOracle's structured protocol. The name follows the epistemological pipeline: Sensation → Perception → Cognition — raw input data (sensation) is structured and tagged before it reaches the model (cognition).

Sensation works for both batch retagging of the NanoChat SFT corpus (identity_conversations.jsonl, ~14K conversations) and dynamic training examples from the online training pipeline.

JSONL Record Types

The output JSONL uses four record types, splitting the client identity into separate User and Server records:

Type Tag Purpose
user <User> User identity — username, uid, timestamp
server <Server> Server identity — name, version, timestamp
conversation <Conversation> Messages with <Q> (query) and <R> (response) pairs
truth <Truth> Extracted factual claims with trust and spacetime

Inside message content, user messages are wrapped in <Q>...</Q> and assistant messages in <R>...</R>. Each sentence within a message is independently classified as <fact> or <feeling>:

<R><feeling>That is a great question!</feeling>
<fact trust="0.5"><place>Paris</place>Paris is the capital of France.</fact>
<feeling>I hope that helps.</feeling></R>

Korzybski IS Detection

Alfred Korzybski (Science and Sanity, 1933) observed that the English copula "is" conflates several distinct relations:

  • IS of identity: "Socrates is a man"
  • IS of predication: "The sky is blue"
  • IS of existence: "There are eight planets"

Each asserts something verifiable about the world — a fact bound to a specific spacetime context. "The cup is on the table" is only true at a particular place and time; at a different spacetime it may not be.

The heuristic classifier in sensation.py detects these patterns:

Subtype Pattern Example
Identity "X is/are [a/an/the] Y" "Socrates is a man"
Predication "X is/are ADJ" "The sky is blue"
Existence "there is/are X" "There are 8 planets"
Mereological "X contains/includes Y" "Water contains hydrogen"
Quantity "X has/have N Y" "I have 3 cats"
Definition "X is called/known as Y" "A polygon is defined as..."

Sentences without an IS pattern, or with subjective markers ("I think", "maybe", "might be"), questions, or meta-discourse ("That's a great question") default to <feeling>. Auto-detected facts receive trust=0.5 — a conservative default that flags them for future verification. Spatiotemporal binding is expressed via <place> and <time> child elements inside <fact> or <feeling> tags.

Usage

Batch-convert a corpus:

make preprocess
# or: python bin/sensation.py corpus input.jsonl output.jsonl

Classify a single sentence:

python bin/sensation.py tag "Paris is the capital of France."
# → Classification: fact (identity)
# → Tagged: <Q><fact trust="0.5">Paris is the capital of France.</fact></Q>

In the online training pipeline, response.py calls preprocess_training_example() automatically before each /train POST, so all training examples are XML-tagged without manual intervention.

Online Training

Purpose

Enable continuous online training of NanoChat where every interaction can trigger a weight update. The learning rate is governed by DegreeOfTruth (see above) which evaluates how well the user's claims fit the collective evidence.

WikiOracle accomplishes this by:

  1. Maintaining a server‑owned TruthSet (truth.xml) that accumulates facts from all users.
  2. Computing a DegreeOfTruth (−1 .. +1) per interaction.
  3. Using |DegreeOfTruth| to modulate the learning rate of a one‑step online training pass in NanoChat.

Conversations are not stored on the server. Only knowledge truth entries are retained. The server TruthSet contains knowledge facts, operators, authorities, and references — no feelings, provider entries, or news facts (spatiotemporally bound observations).

News facts (those with <place> or <time> child elements carrying real values) are session-only. Persisting them would risk "worldline capture" — an observer could reconstruct a user's physical trajectory through spatiotemporal observations. The filter_knowledge_only() function in bin/truth.py enforces this boundary.

User Identity

Each user is identified by a pseudonymous GUID derived deterministically from client_name in the state file (UUID‑5 in the WikiOracle namespace). This GUID is stored at the root level of the user's state and used internally when merging truth entries into the server table.

Pipeline

The pipeline is staged so the user gets a response first; truth merging and training happen after the response is delivered.

Stage 1 — Respond

  1. Receive the user's query and TruthSet.
  2. Use truth entries for RAG as usual.
  3. Return the response to the user.

Stage 2 — Compute DegreeOfTruth

  1. Score the user's TruthSet against the server's current truth (formula above).

Stage 3 — Update server TruthSet

  1. Filter client truth per the Entanglement Policy (doc/Entanglement.md):
    • When store_concrete is false (default), only universal facts pass through (filter_knowledge_only()).
    • Entries with identifiable content are always filtered regardless of store_concrete (detect_identifiability()).
    • Feelings never reach the merge (_is_server_storable() rejects them).
    • Operators may compose facts or feelings as operands.
    • Feelings are also stripped from training messages (strip_feelings_from_training() in bin/sensation.py).
  2. Merge the surviving truth entries into the server TruthSet (truth.xml):
    • Match found: nudge the server entry's trust toward the incoming value using a slow‑moving average: server_trust += merge_rate × (client_trust − server_trust)
    • No match: insert the entry with the stated trust value.
    • Entries are restricted to facts, operators, authorities, and references. Feelings and provider entries are not stored.

Stage 3a — TruthSet Trimming

When the server TruthSet exceeds truth_max_entries (default 1000), the merge step trims the table by removing entries with |trust| closest to 0.0 (no information value), keeping entries with highest |trust| (strongest signal, positive or negative). This is checked during the merge step, not as a separate operation. The trim count is logged.

The truth_max_entries parameter is configurable per-user via the Settings dialog (see UserInterface.md) and defaults to 1000 in the chat config.

Stage 4 — Tag and Train

  1. Preprocess the training example through Sensation (sensation.py) to add <Q>/<R> and <fact>/<feeling> XML tags.
  2. Strip <feeling> blocks from training messages (strip_feelings_from_training() in bin/sensation.py) — feelings must never train model parameters (Entanglement policy, doc/Entanglement.md).
  3. Call NanoChat's online training endpoint (POST /train) with the tagged prompt, DegreeOfTruth, and training hyperparameters.

Device Configuration

Online training runs on the device specified by server.training.device in the config file (config.xml). Valid values:

  • cpu (default) — safe for the WikiOracle production server
  • cuda — use NVIDIA GPU if available
  • mps — Apple Metal Performance Shaders (macOS)
  • auto — probe CUDA → MPS → CPU and use the best available

The model is moved to the training device for the gradient step, then moved back to the inference device afterward.

Training Algorithm — DoT-Annealed Selective Training

The /train endpoint (bin/nanochat_ext.py) implements a carefully designed online training algorithm that prevents weight collapse while enabling meaningful learning from each interaction.

Consistent AdamW Optimizer

All parameter groups use AdamW consistently. The batch training regime uses Muon (Newton-Schulz orthogonalization) for transformer matrices, but online training uses AdamW for all groups because:

  1. No persistent optimizer state — the optimizer is created fresh each /train call. Muon's advantage comes from accumulated momentum; without persistent state, its Newton-Schulz orthogonalization provides limited benefit.
  2. Effectively sign-SGD — AdamW without momentum history is effectively sign-SGD: simpler, faster, well-understood.
  3. Independent interactions — each /train call is fully independent. A bad gradient cannot poison future steps because there is no momentum history to contaminate. The effective step size is bounded by lr.

Parameter Groups

The optimizer creates six parameter groups that mirror the production training regime in nanochat/nanochat/gpt.py:GPT.setup_optimizer():

Group Base LR Parameters
lm_head 0.0027 Output projection weights (language model head)
wte 0.136 Token embedding weights
value_embeds 0.136 Value residual stream embeddings
resid_lambdas 0.005 Per-layer residual scaling scalars
x0_lambdas 0.5 Skip-connection blending scalars
transformer_h 0.02 All transformer block matrix parameters (attention, MLP, norms)

Parameters not matching any named group are included in transformer_h. All groups use the same AdamW optimizer — no Muon.

Learning Rate Modulation

The effective learning rate for each parameter group is computed as:

lr_effective = lr_base × lr_scale

where:

lr_scale = (truth_weight × |DoT| + (1 - truth_weight)) × sigmoid_warmup(step)

The truth_weight parameter (0.0–1.0) controls how much DoT gates the learning rate:

truth_weight lr_effective Behavior
0.0 lr_base × warmup Vanilla SFT — full LR regardless of DoT. No truth bias.
0.5 `lr_base × (0.5 × DoT
1.0 `lr_base × DoT

The truth_weight replaces the former boolean rag checkbox in the Settings dialog. At truth_weight=0, the system trains on everything like vanilla SFT (feelings-only, no truth bias). At truth_weight=1, DoT governs learning completely.

The RAG flag (rag) remains separate conceptually: truth_weight > 0 controls whether truth entries influence training, while truth entries are sent to the provider for context whenever truth_weight > 0.

Sigmoid Warmup

The sigmoid warmup schedule prevents early random updates from corrupting the model when online training is first enabled:

sigmoid_warmup(step) = 1 / (1 + exp(-k × (step - midpoint)))
Step Warmup value Effect
0 ~0.007 Nearly zero — first interactions barely train
midpoint 0.5 Half strength
2 × midpoint ~0.993 Nearly full strength — training ramp is complete

The warmup_steps parameter (default 50) is the sigmoid midpoint. The steepness parameter k is fixed at 0.1.

This schedule ensures that the first few interactions after enabling online training have minimal weight impact, allowing the TruthSet to accumulate signal before the model commits to learning from it.

Gradient Clipping

After the backward pass, gradients are clipped to prevent catastrophic single-step weight changes:

torch.nn.utils.clip_grad_norm_(all_params, max_norm=grad_clip)

The grad_clip parameter (default 1.0) bounds the global gradient norm. The actual gradient norm after clipping is returned in the /train response as grad_norm for monitoring.

If grad_norm consistently hits the clip threshold, this indicates either: (a) the learning rate is too high for the current data, or (b) the training example is an outlier. Both are handled gracefully by clipping — the step still proceeds, just with bounded magnitude.

EMA Weight Anchoring (Anti-Capture)

After each optimizer step, model weights are blended back toward the checkpoint anchor using exponential moving average:

for p, anchor in zip(model.parameters(), anchor_params):
    p.data.lerp_(anchor, anchor_effective)

where:

anchor_effective = anchor_decay × truth_weight

The anchor weights are the original checkpoint weights, initialized on the first /train call. The anchor_decay parameter (default 0.001) controls the blend-back rate.

Key properties:

  • truth_weight=0: anchor_effective = 0 — no anchor pull. Weights drift freely (vanilla SFT mode).
  • truth_weight=1: anchor_effective = anchor_decay — full anchor pull. Weights are always tugged back toward the checkpoint.
  • The anchor prevents gradual drift while allowing meaningful learning. Over many steps, the weights can move significantly from the checkpoint, but each individual step is bounded.
  • Reset via make nano_restart — reloads both the model and the anchor, resetting the training state.

The anchor is stored in app.state.anchor_params as a list of cloned parameter tensors. The step count is stored in app.state.train_step_count.

Training Flow (Per /train Call)

POST /train
  |- Clamp DoT to [-1, +1], truth_weight to [0, 1]
  |- If truth_weight > 0 and |DoT| < 1e-6 -> skip (no signal)
  |- Acquire model worker from pool
  |- Tokenize messages (bos + user_start/end + assistant_start/end)
  |- If < 2 tokens -> skip (empty)
  |- Initialize EMA anchor weights (first call only)
  |- Move model to training device
  |- Build 6 parameter groups (fresh each call)
  |- Compute lr_scale = (tw x |DoT| + (1 - tw)) x sigmoid_warmup(step)
  |- Scale each group's LR: lr_base x lr_scale
  |- Create fresh AdamW optimizer
  |- Forward pass -> cross-entropy loss
  |- Backward pass
  |- Gradient clipping -> grad_norm
  |- Optimizer step
  |- EMA anchor blend-back (if truth_weight > 0)
  |- Increment step count
  |- Move model back to inference device
  `- Return {status, loss/gain, step, grad_norm}

The response uses key "loss" for positive DoT (learning truth) and "gain" for negative DoT (learning falsehood). Both are the cross-entropy loss value; the key name indicates the semantic direction.

Configuration Summary

Parameter Config path Default Description
truth_weight server.truthset.truth_weight 0.7 DoT gating strength (0=vanilla SFT, 1=full DoT)
warmup_steps server.training.warmup_steps 50 Sigmoid warmup midpoint
grad_clip server.training.grad_clip 1.0 Max gradient norm
anchor_decay server.training.anchor_decay 0.001 EMA blend-back rate
truth_max_entries server.training.truth_max_entries 1000 Max server TruthSet size
device server.training.device "cpu" Training device

See Config.md §Server for the full configuration reference.

SFT Corpus Preparation

The prepare_sft_corpus() function in bin/sensation.py provides batch conversion of the NanoChat SFT corpus to XML-tagged format with feelings stripped. This is the batch equivalent of preprocess_training_example() and is used when retraining NanoChat from scratch with WikiOracle's protocol:

make build_sft
# or: python bin/sensation.py sft input.jsonl output.jsonl

Each input line is a JSON array of {"role": ..., "content": ...} message dicts. Output lines are the same shape but with content XML-tagged using <Q>/<R> and <fact>/<feeling> tags, with feelings stripped (matching the online training pipeline).

Server TruthSet

The server TruthSet is stored as truth.xml in the same XML format used for state files (WikiOracle State). Each stored entry is a typed truth element:

<fact id="..." title="..." DoT="0.8" time="..." place="Athens">
  All men are mortal.
</fact>

Resolution: Before entries reach the server TruthSet, they are resolved by resolve_entries() in bin/truth.py:

  • <reference><fact src="domain">text</fact> (domain preserved for deeper lookup)
  • <authority> → list of <fact src="domain">content</fact> (fetched from remote, trust scaled)
  • <provider><feeling> (provider responses are treated as feelings until providers can report truth claims with DoT)
  • <fact>, <feeling>, <logic> → pass through unchanged

Entry types stored after resolution: <fact> (knowledge — no <place>/<time> children) and <logic> entries (operators wrapped in <logic><and|or|not|non>...</logic>).

Entry types not stored: <feeling>, <provider>, unresolved <reference>, unresolved <authority>, and (when store_concrete is false) news facts with <place> or <time> child elements. Content matching identifiability patterns (PII) is always excluded.

News facts are spatiotemporally bound — persisting them risks worldline capture. Whether to store them is a user choice controlled by store_concrete in config.xml (default false), consistent with Zero-Knowledge / Selective Disclosure principles. See filter_knowledge_only() in bin/truth.py and detect_identifiability() for PII detection. See doc/Entanglement.md for the full policy.

The server TruthSet includes the user GUID as a trust entry, so the server can track per‑user trust alongside factual claims.

Anti‑Capture

The online training system has multiple layers of defense against weight collapse and capture by any single user:

TruthSet level:

  • Entries are merged with a slow‑moving average (merge_rate, default 0.1), so no single user can instantly override collective truth.
  • Disproven entries naturally drift toward −1 as contradicting evidence accumulates from other users.
  • TruthSet trimming (truth_max_entries, default 1000) removes low-signal entries (|trust| near 0), keeping the TruthSet focused on strong signal.

Training level:

  • DegreeOfTruth gates the learning rate — claims that diverge from consensus (DoT near 0) have minimal training impact.
  • Gradient clipping (grad_clip, default 1.0) prevents any single training step from making catastrophic weight changes.
  • EMA weight anchoring (anchor_decay, default 0.001) continuously blends weights back toward the checkpoint, preventing gradual drift. The anchor pull is scaled by truth_weight, so at full truth-gating the model is always tugged back toward its initialized state.
  • Sigmoid warmup (warmup_steps, default 50) prevents the first few interactions from corrupting weights before the TruthSet has accumulated sufficient signal.
  • Fresh optimizer per call — no persistent momentum state means a bad gradient cannot poison future steps. Each interaction is independent.
  • Consistent AdamW — without accumulated Muon state, AdamW is effectively sign-SGD with bounded step size.

Operational level:

Manual rollback is available via Makefile targets:

  • make checkpoint-pull — rsync SFT checkpoints from the remote WikiOracle server to output/checkpoints/ for safekeeping.
  • make checkpoint-push — restore checkpoints from backup, then make wo-restart to reload weights.

The intended workflow: pull a checkpoint before enabling online training, then push to restore if capture is detected.

Dissonance Detection and Pluralistic Truth (TODO)

Detecting and resolving dissonance within the server TruthSet is left for future work. The goal is to support a higher‑dimensional truth‑space where contradictory claims can coexist when they originate from different perspectives.

For example: "the world was created in seven days" and "the world was created over millions of years" could both be maintained as true from their respective perspectives. This requires embedding perspective alongside truth value so that the TruthSet becomes a manifold rather than a flat list.

In the current consensus model, a DoT $\approx$ 0 simply means "nothing to learn" and the training step is skipped. In a future pluralistic model — where the same claim can be true in context c_1 and false in context c_2 — a DoT of 0 may instead indicate that user feedback is needed to disambiguate which context applies before training should proceed.

Possible approaches include context/perspective tags on entries, truth‑space embeddings with frame clustering, conditional truth values indexed by worldview, or explicit user prompts to resolve ambiguity when the TruthSet produces conflicting signals.

OpenClaw Integration

OpenClaw is WikiOracle's multi-channel front-end (Slack, Discord, Telegram, etc.). The TypeScript extension at openclaw/extensions/wikioracle/ registers WikiOracle as a native OpenClaw provider by spawning bin/wo for each query.

The extension provides three capabilities:

  1. Provider — WikiOracle appears in OpenClaw's provider selector alongside OpenAI, Anthropic, etc. Selecting it routes all messages through the WikiOracle server's full pipeline.

  2. Command (/wo <message>) — Direct CLI-style access from any OpenClaw channel, bypassing the agent/LLM layer.

  3. Tool (wikioracle_query) — Lets OpenClaw agents invoke WikiOracle programmatically during agentic runs.

The message flow through the training pipeline:

  1. OpenClaw receives a message from any channel (Slack/Discord/etc.).
  2. The wikioracle extension spawns bin/wo with the message.
  3. bin/wo sends POST /chat to the WikiOracle server.
  4. WikiOracle responds (Stage 1: provider routing, truth RAG).
  5. Stages 2–4 execute server-side: DoT computation, truth merge, Sensation preprocessing, and NanoChat online training.
  6. bin/wo prints the response to stdout; the extension captures it and relays it back to the originating channel.

In stateful mode (default), the WikiOracle server owns session state — conversation context persists across messages without client-side storage. In stateless mode, state is serialized to a local XML file via bin/wo -f.

The training pipeline treats OpenClaw messages identically to direct HTTP or web UI clients — the same DoT computation, TruthSet merge, Sensation preprocessing, and /train call apply. The bin/wo CLI is transparent to the training system.

Extension configuration

// In OpenClaw's config file (~/.openclaw/config.json5 or project-level)
{
  plugins: {
    entries: ["wikioracle"],
    wikioracle: {
      woPath: "/path/to/WikiOracle/bin/wo",
      serverUrl: "https://127.0.0.1:8888",
      insecure: true,    // skip TLS verification for local dev
      stateful: true,    // server owns session state
      token: "...",      // optional bearer token
    },
  },
}

See openclaw/extensions/wikioracle/openclaw.plugin.json for the full config schema.

Server TruthSet Visibility (Debug Mode)

When DEBUG_MODE is enabled and online training is active, the /chat response includes a server_truth key containing the server's TruthSet entries formatted as authority-tagged entries:

{
  "server_truth": [
    {
      "type": "authority",
      "id": "entry-id",
      "title": "...",
      "trust": 0.8,
      "content": "<fact>...</fact>",
      "source": "wikioracle",
      "_server_origin": true
    }
  ]
}

The client displays these entries in the TruthSet with a visual marker (server badge, different border style). Entries with _server_origin: true are stripped from the TruthSet before sending queries to prevent loopback — the client never sends server truth back to the server.

The server_id field in config.xml under <server> provides a stable identifier for this server instance (default: "wikioracle").