|
| 1 | +# Debug Mode Guide |
| 2 | + |
| 3 | +Debug mode provides detailed logging of all model interactions during benchmarking, helping you diagnose issues with model performance, prompt formatting, and answer evaluation. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +When debug mode is enabled, alex-treBENCH logs: |
| 8 | +- **Exact prompts** sent to models |
| 9 | +- **Raw responses** from models |
| 10 | +- **Parsed answers** extracted from responses |
| 11 | +- **Grading details** including fuzzy match scores |
| 12 | +- **Performance metrics** like response times and costs |
| 13 | +- **Error details** when requests fail |
| 14 | + |
| 15 | +## Enabling Debug Mode |
| 16 | + |
| 17 | +### Command Line Flags |
| 18 | + |
| 19 | +```bash |
| 20 | +# Enable full debug logging |
| 21 | +alex benchmark run --debug --model openai/gpt-4 --size quick |
| 22 | + |
| 23 | +# Log only incorrect answers and errors (recommended for analysis) |
| 24 | +alex benchmark run --debug --debug-errors-only --model anthropic/claude-3-haiku --size standard |
| 25 | + |
| 26 | +# Debug with custom benchmark settings |
| 27 | +alex benchmark run --debug --model openai/gpt-3.5-turbo --size comprehensive --grading-mode strict |
| 28 | +``` |
| 29 | + |
| 30 | +### Configuration File |
| 31 | + |
| 32 | +You can also enable debug mode permanently in `config/default.yaml`: |
| 33 | + |
| 34 | +```yaml |
| 35 | +logging: |
| 36 | + debug: |
| 37 | + enabled: true # Enable debug logging |
| 38 | + log_dir: "logs/debug" # Output directory |
| 39 | + log_prompts: true # Log formatted prompts |
| 40 | + log_responses: true # Log model responses |
| 41 | + log_grading: true # Log grading details |
| 42 | + log_errors_only: false # If true, only log incorrect answers |
| 43 | + include_tokens: true # Include token counts |
| 44 | + include_costs: true # Include cost information |
| 45 | +``` |
| 46 | +
|
| 47 | +## Output Files |
| 48 | +
|
| 49 | +Debug mode creates two types of files in `logs/debug/`: |
| 50 | + |
| 51 | +### 1. JSON Lines Files (`model_interactions_*.jsonl`) |
| 52 | + |
| 53 | +Structured data for programmatic analysis: |
| 54 | + |
| 55 | +```json |
| 56 | +{ |
| 57 | + "timestamp": "2025-09-04T14:28:08.992383", |
| 58 | + "benchmark_id": null, |
| 59 | + "question_id": "q_35309_-2597327464560468404", |
| 60 | + "model_name": "openai/gpt-3.5-turbo", |
| 61 | + "category": "THAT'S MY DRINK", |
| 62 | + "value": 600, |
| 63 | + "question_text": "'Gin, lime & club soda:Gin this guy'", |
| 64 | + "correct_answer": "Rickey", |
| 65 | + "formatted_prompt": "You are a Jeopardy! contestant...", |
| 66 | + "raw_response": "What is a Gin Rickey?", |
| 67 | + "parsed_answer": "What is a gin rickey?", |
| 68 | + "is_correct": true, |
| 69 | + "match_score": 1.0, |
| 70 | + "match_type": "fuzzy", |
| 71 | + "confidence_score": 1.0, |
| 72 | + "response_time_ms": 751.0, |
| 73 | + "cost_usd": 4.95e-05, |
| 74 | + "tokens_input": 78, |
| 75 | + "tokens_output": 7, |
| 76 | + "grading_details": { |
| 77 | + "fuzzy_threshold": 0.8, |
| 78 | + "semantic_threshold": 0.7, |
| 79 | + "mode": "JEOPARDY", |
| 80 | + "match_details": { |
| 81 | + "ratio": 0.46, |
| 82 | + "partial_ratio": 1.0, |
| 83 | + "token_sort_ratio": 0.46, |
| 84 | + "token_set_ratio": 1.0, |
| 85 | + "best_score": 1.0 |
| 86 | + } |
| 87 | + }, |
| 88 | + "error": null |
| 89 | +} |
| 90 | +``` |
| 91 | + |
| 92 | +### 2. Summary Log Files (`debug_summary_*.log`) |
| 93 | + |
| 94 | +Human-readable format for quick analysis: |
| 95 | + |
| 96 | +``` |
| 97 | +2025-09-04 14:28:08,233 - DEBUG - PROMPT Qq_205799 [openai/gpt-3.5-turbo]: |
| 98 | + Question: 'Classic French dressing also has this 1-word name, from a key ingredient' |
| 99 | + Prompt: |
| 100 | + You are a Jeopardy! contestant. Respond to each clue in the form of a question... |
| 101 | + |
| 102 | + Category: SALAD DRESSING |
| 103 | + Value: $500 |
| 104 | + |
| 105 | + Clue: 'Classic French dressing also has this 1-word name, from a key ingredient' |
| 106 | + |
| 107 | + Response: |
| 108 | +
|
| 109 | +2025-09-04 14:28:10,519 - DEBUG - ✗ INCORRECT Qq_114081 [openai/gpt-3.5-turbo]: |
| 110 | + Answer: What is bleach? |
| 111 | + Expected: bleach (or chlorine) |
| 112 | + Score: 0.600 (fuzzy) |
| 113 | + Question: 'Lethal gases are released when you combine some toilet bowl cleansers with this common stain remover' |
| 114 | + Response: What is bleach? |
| 115 | + Parsed: What is bleach? |
| 116 | +``` |
| 117 | + |
| 118 | +## Common Analysis Tasks |
| 119 | + |
| 120 | +### Finding Models Getting 0% Accuracy |
| 121 | + |
| 122 | +```bash |
| 123 | +# Run debug mode |
| 124 | +alex benchmark run --debug --debug-errors-only --model suspicious-model --size quick |
| 125 | +
|
| 126 | +# Check for systematic issues |
| 127 | +grep "✗ INCORRECT" logs/debug/debug_summary_*.log | head -10 |
| 128 | +
|
| 129 | +# Look for parsing failures |
| 130 | +grep "ERROR" logs/debug/debug_summary_*.log |
| 131 | +
|
| 132 | +# Analyze JSON data programmatically |
| 133 | +python -c " |
| 134 | +import json |
| 135 | +with open('logs/debug/model_interactions_*.jsonl') as f: |
| 136 | + for line in f: |
| 137 | + data = json.loads(line) |
| 138 | + if not data['is_correct']: |
| 139 | + print(f'Q: {data[\"question_text\"]}') |
| 140 | + print(f'Expected: {data[\"correct_answer\"]}') |
| 141 | + print(f'Got: {data[\"raw_response\"]}') |
| 142 | + print(f'Score: {data[\"match_score\"]}') |
| 143 | + print('---') |
| 144 | +" |
| 145 | +``` |
| 146 | + |
| 147 | +### Analyzing Prompt Issues |
| 148 | + |
| 149 | +```bash |
| 150 | +# Look for questions where the model consistently fails |
| 151 | +grep -B5 -A5 "Score: 0.000" logs/debug/debug_summary_*.log |
| 152 | +
|
| 153 | +# Check if prompt format is causing confusion |
| 154 | +grep -A15 "PROMPT" logs/debug/debug_summary_*.log | grep -A15 "specific-category" |
| 155 | +``` |
| 156 | + |
| 157 | +### Performance Analysis |
| 158 | + |
| 159 | +```bash |
| 160 | +# Find slow responses |
| 161 | +grep "Time: [5-9][0-9][0-9][0-9]ms\|Time: [0-9][0-9][0-9][0-9][0-9]ms" logs/debug/debug_summary_*.log |
| 162 | +
|
| 163 | +# Cost analysis |
| 164 | +python -c " |
| 165 | +import json |
| 166 | +total_cost = 0 |
| 167 | +count = 0 |
| 168 | +with open('logs/debug/model_interactions_*.jsonl') as f: |
| 169 | + for line in f: |
| 170 | + data = json.loads(line) |
| 171 | + total_cost += data['cost_usd'] |
| 172 | + count += 1 |
| 173 | +print(f'Average cost per question: ${total_cost/count:.6f}') |
| 174 | +print(f'Total cost: ${total_cost:.6f}') |
| 175 | +" |
| 176 | +``` |
| 177 | + |
| 178 | +## Debug Mode Options |
| 179 | + |
| 180 | +| Flag | Description | Use Case | |
| 181 | +|------|-------------|----------| |
| 182 | +| `--debug` | Enable full debug logging | Comprehensive analysis of all interactions | |
| 183 | +| `--debug-errors-only` | Log only incorrect answers | Focus on problematic questions and responses | |
| 184 | +| No flags | Standard logging only | Normal benchmarking without debug overhead | |
| 185 | + |
| 186 | +## Performance Impact |
| 187 | + |
| 188 | +Debug mode has minimal performance impact: |
| 189 | +- **File I/O**: Small overhead for writing logs (~1-2% slower) |
| 190 | +- **Memory**: Negligible increase |
| 191 | +- **Network**: No impact on API calls |
| 192 | +- **Storage**: ~2MB per 1000 questions (varies by response length) |
| 193 | + |
| 194 | +## Troubleshooting Common Issues |
| 195 | + |
| 196 | +### Issue: Model Getting 0% Score |
| 197 | + |
| 198 | +**Symptoms**: All questions marked incorrect despite seemingly correct answers |
| 199 | + |
| 200 | +**Debug Steps**: |
| 201 | +1. Run with `--debug --debug-errors-only` |
| 202 | +2. Check if responses are in correct Jeopardy format |
| 203 | +3. Look at fuzzy match scores - scores >0.6 suggest format issues |
| 204 | +4. Verify the grading mode matches your expectations |
| 205 | + |
| 206 | +**Example Analysis**: |
| 207 | +```bash |
| 208 | +alex benchmark run --debug --debug-errors-only --model problematic-model --size quick |
| 209 | +grep -A3 "Expected:" logs/debug/debug_summary_*.log | head -20 |
| 210 | +``` |
| 211 | + |
| 212 | +### Issue: Inconsistent Performance |
| 213 | + |
| 214 | +**Symptoms**: Same model performs differently across runs |
| 215 | + |
| 216 | +**Debug Steps**: |
| 217 | +1. Compare prompts between runs to ensure consistency |
| 218 | +2. Check response times for API issues |
| 219 | +3. Look for error patterns in specific categories |
| 220 | + |
| 221 | +### Issue: Unexpected Costs |
| 222 | + |
| 223 | +**Symptoms**: Costs higher than expected |
| 224 | + |
| 225 | +**Debug Steps**: |
| 226 | +1. Check token counts in JSON logs |
| 227 | +2. Look for unusually long responses |
| 228 | +3. Verify model pricing in debug output |
| 229 | + |
| 230 | +## File Management |
| 231 | + |
| 232 | +Debug logs can accumulate quickly. Consider: |
| 233 | + |
| 234 | +```bash |
| 235 | +# Clean old debug logs (keep last 10 files) |
| 236 | +ls -t logs/debug/*.log | tail -n +11 | xargs rm -f |
| 237 | +
|
| 238 | +# Archive debug logs by date |
| 239 | +mkdir -p logs/archive/$(date +%Y-%m-%d) |
| 240 | +mv logs/debug/* logs/archive/$(date +%Y-%m-%d)/ |
| 241 | +
|
| 242 | +# Compress large debug files |
| 243 | +gzip logs/debug/*.jsonl |
| 244 | +``` |
| 245 | + |
| 246 | +## Integration with External Tools |
| 247 | + |
| 248 | +### Python Analysis |
| 249 | + |
| 250 | +```python |
| 251 | +import json |
| 252 | +import pandas as pd |
| 253 | +
|
| 254 | +# Load debug data into pandas |
| 255 | +data = [] |
| 256 | +with open('logs/debug/model_interactions_20250904_142805.jsonl') as f: |
| 257 | + for line in f: |
| 258 | + data.append(json.loads(line)) |
| 259 | +
|
| 260 | +df = pd.DataFrame(data) |
| 261 | +
|
| 262 | +# Analysis examples |
| 263 | +print(f"Accuracy by category:") |
| 264 | +print(df.groupby('category')['is_correct'].mean().sort_values()) |
| 265 | +
|
| 266 | +print(f"Average response time: {df['response_time_ms'].mean():.1f}ms") |
| 267 | +print(f"Questions with low match scores:") |
| 268 | +print(df[df['match_score'] < 0.5][['question_text', 'correct_answer', 'raw_response', 'match_score']]) |
| 269 | +``` |
| 270 | + |
| 271 | +### Jupyter Notebook Integration |
| 272 | + |
| 273 | +Debug JSON files work excellently with Jupyter notebooks for interactive analysis, visualization, and model comparison. |
| 274 | + |
| 275 | +## Best Practices |
| 276 | + |
| 277 | +1. **Use `--debug-errors-only` first** - focuses on problems without overwhelming detail |
| 278 | +2. **Run small samples initially** - debug with `--size quick` before full benchmarks |
| 279 | +3. **Compare models systematically** - use same debug settings across model comparisons |
| 280 | +4. **Archive important debug sessions** - keep debug logs for models that will be used in production |
| 281 | +5. **Monitor disk space** - debug logs for large benchmarks can consume significant storage |
| 282 | + |
| 283 | +## Security Considerations |
| 284 | + |
| 285 | +Debug logs contain: |
| 286 | +- Full question text (public Jeopardy data) |
| 287 | +- Model responses |
| 288 | +- API timing and cost data |
| 289 | + |
| 290 | +**No sensitive data is logged**, but consider access controls for debug directories in shared environments. |
0 commit comments