|
| 1 | +# CLAUDE.md - Technical Notes for LLM Council |
| 2 | + |
| 3 | +This file contains technical details, architectural decisions, and important implementation notes for future development sessions. |
| 4 | + |
| 5 | +## Project Overview |
| 6 | + |
| 7 | +LLM Council is a 3-stage deliberation system where multiple LLMs collaboratively answer user questions. The key innovation is anonymized peer review in Stage 2, preventing models from playing favorites. |
| 8 | + |
| 9 | +## Architecture |
| 10 | + |
| 11 | +### Backend Structure (`backend/`) |
| 12 | + |
| 13 | +**`config.py`** |
| 14 | +- Contains `COUNCIL_MODELS` (list of OpenRouter model identifiers) |
| 15 | +- Contains `CHAIRMAN_MODEL` (model that synthesizes final answer) |
| 16 | +- Uses environment variable `OPENROUTER_API_KEY` from `.env` |
| 17 | +- Backend runs on **port 8001** (NOT 8000 - user had another app on 8000) |
| 18 | + |
| 19 | +**`openrouter.py`** |
| 20 | +- `query_model()`: Single async model query |
| 21 | +- `query_models_parallel()`: Parallel queries using `asyncio.gather()` |
| 22 | +- Returns dict with 'content' and optional 'reasoning_details' |
| 23 | +- Graceful degradation: returns None on failure, continues with successful responses |
| 24 | + |
| 25 | +**`council.py`** - The Core Logic |
| 26 | +- `stage1_collect_responses()`: Parallel queries to all council models |
| 27 | +- `stage2_collect_rankings()`: |
| 28 | + - Anonymizes responses as "Response A, B, C, etc." |
| 29 | + - Creates `label_to_model` mapping for de-anonymization |
| 30 | + - Prompts models to evaluate and rank (with strict format requirements) |
| 31 | + - Returns tuple: (rankings_list, label_to_model_dict) |
| 32 | + - Each ranking includes both raw text and `parsed_ranking` list |
| 33 | +- `stage3_synthesize_final()`: Chairman synthesizes from all responses + rankings |
| 34 | +- `parse_ranking_from_text()`: Extracts "FINAL RANKING:" section, handles both numbered lists and plain format |
| 35 | +- `calculate_aggregate_rankings()`: Computes average rank position across all peer evaluations |
| 36 | + |
| 37 | +**`storage.py`** |
| 38 | +- JSON-based conversation storage in `data/conversations/` |
| 39 | +- Each conversation: `{id, created_at, messages[]}` |
| 40 | +- Assistant messages contain: `{role, stage1, stage2, stage3}` |
| 41 | +- Note: metadata (label_to_model, aggregate_rankings) is NOT persisted to storage, only returned via API |
| 42 | + |
| 43 | +**`main.py`** |
| 44 | +- FastAPI app with CORS enabled for localhost:5173 and localhost:3000 |
| 45 | +- POST `/api/conversations/{id}/message` returns metadata in addition to stages |
| 46 | +- Metadata includes: label_to_model mapping and aggregate_rankings |
| 47 | + |
| 48 | +### Frontend Structure (`frontend/src/`) |
| 49 | + |
| 50 | +**`App.jsx`** |
| 51 | +- Main orchestration: manages conversations list and current conversation |
| 52 | +- Handles message sending and metadata storage |
| 53 | +- Important: metadata is stored in the UI state for display but not persisted to backend JSON |
| 54 | + |
| 55 | +**`components/ChatInterface.jsx`** |
| 56 | +- Multiline textarea (3 rows, resizable) |
| 57 | +- Enter to send, Shift+Enter for new line |
| 58 | +- User messages wrapped in markdown-content class for padding |
| 59 | + |
| 60 | +**`components/Stage1.jsx`** |
| 61 | +- Tab view of individual model responses |
| 62 | +- ReactMarkdown rendering with markdown-content wrapper |
| 63 | + |
| 64 | +**`components/Stage2.jsx`** |
| 65 | +- **Critical Feature**: Tab view showing RAW evaluation text from each model |
| 66 | +- De-anonymization happens CLIENT-SIDE for display (models receive anonymous labels) |
| 67 | +- Shows "Extracted Ranking" below each evaluation so users can validate parsing |
| 68 | +- Aggregate rankings shown with average position and vote count |
| 69 | +- Explanatory text clarifies that boldface model names are for readability only |
| 70 | + |
| 71 | +**`components/Stage3.jsx`** |
| 72 | +- Final synthesized answer from chairman |
| 73 | +- Green-tinted background (#f0fff0) to highlight conclusion |
| 74 | + |
| 75 | +**Styling (`*.css`)** |
| 76 | +- Light mode theme (not dark mode) |
| 77 | +- Primary color: #4a90e2 (blue) |
| 78 | +- Global markdown styling in `index.css` with `.markdown-content` class |
| 79 | +- 12px padding on all markdown content to prevent cluttered appearance |
| 80 | + |
| 81 | +## Key Design Decisions |
| 82 | + |
| 83 | +### Stage 2 Prompt Format |
| 84 | +The Stage 2 prompt is very specific to ensure parseable output: |
| 85 | +``` |
| 86 | +1. Evaluate each response individually first |
| 87 | +2. Provide "FINAL RANKING:" header |
| 88 | +3. Numbered list format: "1. Response C", "2. Response A", etc. |
| 89 | +4. No additional text after ranking section |
| 90 | +``` |
| 91 | + |
| 92 | +This strict format allows reliable parsing while still getting thoughtful evaluations. |
| 93 | + |
| 94 | +### De-anonymization Strategy |
| 95 | +- Models receive: "Response A", "Response B", etc. |
| 96 | +- Backend creates mapping: `{"Response A": "openai/gpt-5.1", ...}` |
| 97 | +- Frontend displays model names in **bold** for readability |
| 98 | +- Users see explanation that original evaluation used anonymous labels |
| 99 | +- This prevents bias while maintaining transparency |
| 100 | + |
| 101 | +### Error Handling Philosophy |
| 102 | +- Continue with successful responses if some models fail (graceful degradation) |
| 103 | +- Never fail the entire request due to single model failure |
| 104 | +- Log errors but don't expose to user unless all models fail |
| 105 | + |
| 106 | +### UI/UX Transparency |
| 107 | +- All raw outputs are inspectable via tabs |
| 108 | +- Parsed rankings shown below raw text for validation |
| 109 | +- Users can verify system's interpretation of model outputs |
| 110 | +- This builds trust and allows debugging of edge cases |
| 111 | + |
| 112 | +## Important Implementation Details |
| 113 | + |
| 114 | +### Relative Imports |
| 115 | +All backend modules use relative imports (e.g., `from .config import ...`) not absolute imports. This is critical for Python's module system to work correctly when running as `python -m backend.main`. |
| 116 | + |
| 117 | +### Port Configuration |
| 118 | +- Backend: 8001 (changed from 8000 to avoid conflict) |
| 119 | +- Frontend: 5173 (Vite default) |
| 120 | +- Update both `backend/main.py` and `frontend/src/api.js` if changing |
| 121 | + |
| 122 | +### Markdown Rendering |
| 123 | +All ReactMarkdown components must be wrapped in `<div className="markdown-content">` for proper spacing. This class is defined globally in `index.css`. |
| 124 | + |
| 125 | +### Model Configuration |
| 126 | +Models are hardcoded in `backend/config.py`. Chairman can be same or different from council members. The current default is Gemini as chairman per user preference. |
| 127 | + |
| 128 | +## Common Gotchas |
| 129 | + |
| 130 | +1. **Module Import Errors**: Always run backend as `python -m backend.main` from project root, not from backend directory |
| 131 | +2. **CORS Issues**: Frontend must match allowed origins in `main.py` CORS middleware |
| 132 | +3. **Ranking Parse Failures**: If models don't follow format, fallback regex extracts any "Response X" patterns in order |
| 133 | +4. **Missing Metadata**: Metadata is ephemeral (not persisted), only available in API responses |
| 134 | + |
| 135 | +## Future Enhancement Ideas |
| 136 | + |
| 137 | +- Configurable council/chairman via UI instead of config file |
| 138 | +- Streaming responses instead of batch loading |
| 139 | +- Export conversations to markdown/PDF |
| 140 | +- Model performance analytics over time |
| 141 | +- Custom ranking criteria (not just accuracy/insight) |
| 142 | +- Support for reasoning models (o1, etc.) with special handling |
| 143 | + |
| 144 | +## Testing Notes |
| 145 | + |
| 146 | +Use `test_openrouter.py` to verify API connectivity and test different model identifiers before adding to council. The script tests both streaming and non-streaming modes. |
| 147 | + |
| 148 | +## Data Flow Summary |
| 149 | + |
| 150 | +``` |
| 151 | +User Query |
| 152 | + ↓ |
| 153 | +Stage 1: Parallel queries → [individual responses] |
| 154 | + ↓ |
| 155 | +Stage 2: Anonymize → Parallel ranking queries → [evaluations + parsed rankings] |
| 156 | + ↓ |
| 157 | +Aggregate Rankings Calculation → [sorted by avg position] |
| 158 | + ↓ |
| 159 | +Stage 3: Chairman synthesis with full context |
| 160 | + ↓ |
| 161 | +Return: {stage1, stage2, stage3, metadata} |
| 162 | + ↓ |
| 163 | +Frontend: Display with tabs + validation UI |
| 164 | +``` |
| 165 | + |
| 166 | +The entire flow is async/parallel where possible to minimize latency. |
0 commit comments