Skip to content

Feat/stand alone demo#30

Draft
hanneshapke wants to merge 27 commits intomainfrom
feat/stand-alone-demo
Draft

Feat/stand alone demo#30
hanneshapke wants to merge 27 commits intomainfrom
feat/stand-alone-demo

Conversation

@hanneshapke
Copy link
Copy Markdown
Collaborator

No description provided.

hanneshapke and others added 9 commits March 26, 2026 14:46
- Switch subject model from Gemma-3-4B to Nemotron-3-Nano-30B
- Load SAE from HuggingFace Hub (davidnet/kiji-inspector-...) via
  SAE.from_pretrained(repo_id=...) instead of local file paths
- Add build_ui_data() to transform SAE analysis into UI-ready JSON
- Add index.html: interactive explainer showing tool decisions,
  SAE feature bars, comparison chart, and contrast theme cards
- HTML loads output/ui_data.json (live data) with mock fallback

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The final recommendation prompt concatenated all 12 prior analyses,
creating a sequence too long for an 80GB GPU with the 30B model.
- Per-tool generation: 300 → 150 tokens
- Final prompt context: truncated to last 4000 chars
- Final generation: 800 → 500 tokens

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The naive Mamba path (no causal-conv1d on aarch64) materializes huge
intermediate tensors. Phase 4 explanation prompts are long and OOM
on single 80GB GPUs. Now skipped by default, opt in with --explain.

Also restores per-tool generation to 300 tokens and adds
torch.cuda.empty_cache() between phases.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Allows using a Docker vLLM server (or any OpenAI-compatible endpoint)
instead of in-process vLLM, bypassing torch/vllm ABI compatibility issues.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Non-MoE models (like Gemma 4) fail when this flag is set.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@hanneshapke hanneshapke marked this pull request as draft April 3, 2026 21:16
hanneshapke and others added 18 commits April 6, 2026 09:48
Filters out contrastive pairs where anchor_tool == contrast_tool,
as these provide no decision signal for the SAE.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Source vllm from nightly cu129 wheel index
- Switch pytorch indexes from cu128 to cu129
- Pin transformers==5.5.0 with uv override for gemma4 support
- Add unsafe-best-match index strategy for cross-index resolution

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The LLM sometimes returns a bare int/string instead of a JSON object.
Now catches ValueError and AttributeError alongside JSONDecodeError.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Allows running the demo with a locally trained SAE checkpoint
instead of downloading from HuggingFace Hub.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Load contrastive_features.json from pipeline output to annotate each
active SAE feature with its associated training themes (diy_vs_professional,
urgent_vs_planned, etc.) and compute per-step theme activation scores.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Split CrewAI into 4 single-tool agents + direct HFEngine synthesis
  to avoid Nemotron struggling with ReAct multi-step loops
- Custom prompt templates for tool agents, bypassing CrewAI's default
  ReAct scaffolding that confused the local model
- Fix step label detection: match tool names case-insensitively across
  all message content instead of fragile keyword matching
- Use completion-style synthesis prompt ("Complete this sentence:")
  to prevent model from meta-reasoning about format
- Only extract layer 20 activations (the SAE layer), not all 5 layers
- Default --sae-local-dir to output_merged/ for local SAE checkpoint
- Fix contrastive features path lookup for output_merged layout
- UI: rename "What the AI Noticed" to "What was the model thinking"
- UI: move explanation sentence above feature bars
- UI: sort features by activation strength descending
- UI: new index.html design (old saved as index_old.html)
- Generate tool-specific explanations tied to contrastive themes
- Pin vllm to v0.18.0 wheel, cap requires-python <3.14

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The model was echoing CrewAI scaffolding instead of generating a
recommendation. Reworded both synthesis prompts (CrewAI and scripted
flows) to use positive instructions, label research data as "Key
findings" so the model treats it as reference material, and remove
negation-based instructions that backfire on smaller models.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The model was echoing CrewAI ReAct patterns (Thought/Action/Observation)
instead of generating recommendations because the research context
passed to the synthesis prompt contained those scaffolding lines.
Added _strip_scaffolding() to remove them from both the CrewAI and
scripted flows before building the final synthesis prompt.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous approach stripped ReAct line patterns but left behind
CrewAI task descriptions, tool instructions, and expected_output text
embedded in the task output strings. The model was still picking up
on these as instructions. Now _strip_scaffolding() extracts only the
text after "Final Answer:" from each task output, with a fallback to
line-stripping if no Final Answer is found. Also applies stripping
per-task before joining, so each answer is isolated cleanly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The model was echoing ReAct scaffolding because the research context
passed to synthesis contained CrewAI task output (full of Thought/Action
patterns and meta-instructions). Now both the CrewAI and scripted flows
build synthesis context directly from the raw tool data dictionaries
(_TOOL_SOURCES), which contain only clean JSON. This completely
eliminates the source of scaffolding contamination.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Nemotron starts a <think> block and meta-reasons about the instructions
instead of writing the recommendation. Fix: append "The agent recommends"
after the chat template so it appears as the start of the assistant's
response. The model must continue the sentence with actual content
instead of reasoning about what to do. Also simplified system prompt
and moved the task instruction into the user message.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Nemotron emits <think>...</think> blocks even with pre-filled assistant
response. Added _clean_model_output() to strip both closed and unclosed
think tags. Also bumped synthesis max_tokens from 120 to 256 so the
recommendation doesn't get cut off mid-sentence.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The pre-fill bypasses the opening <think> tag but the model still emits
a closing </think>, which leaked into the recommendation text.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant