Skip to content

Commit 8544f3b

Browse files
authored
Merge pull request #2 from Kilo-Org/add-full-prompt-logging
2 parents d14bc6e + 73bf537 commit 8544f3b

File tree

6 files changed

+813
-3
lines changed

6 files changed

+813
-3
lines changed

config/default.yaml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -175,6 +175,15 @@ logging:
175175
file: "logs/benchmark.log"
176176
max_size: "10MB"
177177
backup_count: 5
178+
debug:
179+
enabled: false # Set to true to enable detailed debug logging
180+
log_dir: "logs/debug"
181+
log_prompts: true
182+
log_responses: true
183+
log_grading: true
184+
log_errors_only: false # If true, only logs incorrect answers and errors
185+
include_tokens: true
186+
include_costs: true
178187

179188
kaggle:
180189
dataset: "aravindram11/jeopardy-dataset-updated"

docs/DEBUG_MODE.md

Lines changed: 290 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,290 @@
1+
# Debug Mode Guide
2+
3+
Debug mode provides detailed logging of all model interactions during benchmarking, helping you diagnose issues with model performance, prompt formatting, and answer evaluation.
4+
5+
## Overview
6+
7+
When debug mode is enabled, alex-treBENCH logs:
8+
- **Exact prompts** sent to models
9+
- **Raw responses** from models
10+
- **Parsed answers** extracted from responses
11+
- **Grading details** including fuzzy match scores
12+
- **Performance metrics** like response times and costs
13+
- **Error details** when requests fail
14+
15+
## Enabling Debug Mode
16+
17+
### Command Line Flags
18+
19+
```bash
20+
# Enable full debug logging
21+
alex benchmark run --debug --model openai/gpt-4 --size quick
22+
23+
# Log only incorrect answers and errors (recommended for analysis)
24+
alex benchmark run --debug --debug-errors-only --model anthropic/claude-3-haiku --size standard
25+
26+
# Debug with custom benchmark settings
27+
alex benchmark run --debug --model openai/gpt-3.5-turbo --size comprehensive --grading-mode strict
28+
```
29+
30+
### Configuration File
31+
32+
You can also enable debug mode permanently in `config/default.yaml`:
33+
34+
```yaml
35+
logging:
36+
debug:
37+
enabled: true # Enable debug logging
38+
log_dir: "logs/debug" # Output directory
39+
log_prompts: true # Log formatted prompts
40+
log_responses: true # Log model responses
41+
log_grading: true # Log grading details
42+
log_errors_only: false # If true, only log incorrect answers
43+
include_tokens: true # Include token counts
44+
include_costs: true # Include cost information
45+
```
46+
47+
## Output Files
48+
49+
Debug mode creates two types of files in `logs/debug/`:
50+
51+
### 1. JSON Lines Files (`model_interactions_*.jsonl`)
52+
53+
Structured data for programmatic analysis:
54+
55+
```json
56+
{
57+
"timestamp": "2025-09-04T14:28:08.992383",
58+
"benchmark_id": null,
59+
"question_id": "q_35309_-2597327464560468404",
60+
"model_name": "openai/gpt-3.5-turbo",
61+
"category": "THAT'S MY DRINK",
62+
"value": 600,
63+
"question_text": "'Gin, lime & club soda:Gin this guy'",
64+
"correct_answer": "Rickey",
65+
"formatted_prompt": "You are a Jeopardy! contestant...",
66+
"raw_response": "What is a Gin Rickey?",
67+
"parsed_answer": "What is a gin rickey?",
68+
"is_correct": true,
69+
"match_score": 1.0,
70+
"match_type": "fuzzy",
71+
"confidence_score": 1.0,
72+
"response_time_ms": 751.0,
73+
"cost_usd": 4.95e-05,
74+
"tokens_input": 78,
75+
"tokens_output": 7,
76+
"grading_details": {
77+
"fuzzy_threshold": 0.8,
78+
"semantic_threshold": 0.7,
79+
"mode": "JEOPARDY",
80+
"match_details": {
81+
"ratio": 0.46,
82+
"partial_ratio": 1.0,
83+
"token_sort_ratio": 0.46,
84+
"token_set_ratio": 1.0,
85+
"best_score": 1.0
86+
}
87+
},
88+
"error": null
89+
}
90+
```
91+
92+
### 2. Summary Log Files (`debug_summary_*.log`)
93+
94+
Human-readable format for quick analysis:
95+
96+
```
97+
2025-09-04 14:28:08,233 - DEBUG - PROMPT Qq_205799 [openai/gpt-3.5-turbo]:
98+
Question: 'Classic French dressing also has this 1-word name, from a key ingredient'
99+
Prompt:
100+
You are a Jeopardy! contestant. Respond to each clue in the form of a question...
101+
102+
Category: SALAD DRESSING
103+
Value: $500
104+
105+
Clue: 'Classic French dressing also has this 1-word name, from a key ingredient'
106+
107+
Response:
108+
109+
2025-09-04 14:28:10,519 - DEBUG - ✗ INCORRECT Qq_114081 [openai/gpt-3.5-turbo]:
110+
Answer: What is bleach?
111+
Expected: bleach (or chlorine)
112+
Score: 0.600 (fuzzy)
113+
Question: 'Lethal gases are released when you combine some toilet bowl cleansers with this common stain remover'
114+
Response: What is bleach?
115+
Parsed: What is bleach?
116+
```
117+
118+
## Common Analysis Tasks
119+
120+
### Finding Models Getting 0% Accuracy
121+
122+
```bash
123+
# Run debug mode
124+
alex benchmark run --debug --debug-errors-only --model suspicious-model --size quick
125+
126+
# Check for systematic issues
127+
grep "✗ INCORRECT" logs/debug/debug_summary_*.log | head -10
128+
129+
# Look for parsing failures
130+
grep "ERROR" logs/debug/debug_summary_*.log
131+
132+
# Analyze JSON data programmatically
133+
python -c "
134+
import json
135+
with open('logs/debug/model_interactions_*.jsonl') as f:
136+
for line in f:
137+
data = json.loads(line)
138+
if not data['is_correct']:
139+
print(f'Q: {data[\"question_text\"]}')
140+
print(f'Expected: {data[\"correct_answer\"]}')
141+
print(f'Got: {data[\"raw_response\"]}')
142+
print(f'Score: {data[\"match_score\"]}')
143+
print('---')
144+
"
145+
```
146+
147+
### Analyzing Prompt Issues
148+
149+
```bash
150+
# Look for questions where the model consistently fails
151+
grep -B5 -A5 "Score: 0.000" logs/debug/debug_summary_*.log
152+
153+
# Check if prompt format is causing confusion
154+
grep -A15 "PROMPT" logs/debug/debug_summary_*.log | grep -A15 "specific-category"
155+
```
156+
157+
### Performance Analysis
158+
159+
```bash
160+
# Find slow responses
161+
grep "Time: [5-9][0-9][0-9][0-9]ms\|Time: [0-9][0-9][0-9][0-9][0-9]ms" logs/debug/debug_summary_*.log
162+
163+
# Cost analysis
164+
python -c "
165+
import json
166+
total_cost = 0
167+
count = 0
168+
with open('logs/debug/model_interactions_*.jsonl') as f:
169+
for line in f:
170+
data = json.loads(line)
171+
total_cost += data['cost_usd']
172+
count += 1
173+
print(f'Average cost per question: ${total_cost/count:.6f}')
174+
print(f'Total cost: ${total_cost:.6f}')
175+
"
176+
```
177+
178+
## Debug Mode Options
179+
180+
| Flag | Description | Use Case |
181+
|------|-------------|----------|
182+
| `--debug` | Enable full debug logging | Comprehensive analysis of all interactions |
183+
| `--debug-errors-only` | Log only incorrect answers | Focus on problematic questions and responses |
184+
| No flags | Standard logging only | Normal benchmarking without debug overhead |
185+
186+
## Performance Impact
187+
188+
Debug mode has minimal performance impact:
189+
- **File I/O**: Small overhead for writing logs (~1-2% slower)
190+
- **Memory**: Negligible increase
191+
- **Network**: No impact on API calls
192+
- **Storage**: ~2MB per 1000 questions (varies by response length)
193+
194+
## Troubleshooting Common Issues
195+
196+
### Issue: Model Getting 0% Score
197+
198+
**Symptoms**: All questions marked incorrect despite seemingly correct answers
199+
200+
**Debug Steps**:
201+
1. Run with `--debug --debug-errors-only`
202+
2. Check if responses are in correct Jeopardy format
203+
3. Look at fuzzy match scores - scores >0.6 suggest format issues
204+
4. Verify the grading mode matches your expectations
205+
206+
**Example Analysis**:
207+
```bash
208+
alex benchmark run --debug --debug-errors-only --model problematic-model --size quick
209+
grep -A3 "Expected:" logs/debug/debug_summary_*.log | head -20
210+
```
211+
212+
### Issue: Inconsistent Performance
213+
214+
**Symptoms**: Same model performs differently across runs
215+
216+
**Debug Steps**:
217+
1. Compare prompts between runs to ensure consistency
218+
2. Check response times for API issues
219+
3. Look for error patterns in specific categories
220+
221+
### Issue: Unexpected Costs
222+
223+
**Symptoms**: Costs higher than expected
224+
225+
**Debug Steps**:
226+
1. Check token counts in JSON logs
227+
2. Look for unusually long responses
228+
3. Verify model pricing in debug output
229+
230+
## File Management
231+
232+
Debug logs can accumulate quickly. Consider:
233+
234+
```bash
235+
# Clean old debug logs (keep last 10 files)
236+
ls -t logs/debug/*.log | tail -n +11 | xargs rm -f
237+
238+
# Archive debug logs by date
239+
mkdir -p logs/archive/$(date +%Y-%m-%d)
240+
mv logs/debug/* logs/archive/$(date +%Y-%m-%d)/
241+
242+
# Compress large debug files
243+
gzip logs/debug/*.jsonl
244+
```
245+
246+
## Integration with External Tools
247+
248+
### Python Analysis
249+
250+
```python
251+
import json
252+
import pandas as pd
253+
254+
# Load debug data into pandas
255+
data = []
256+
with open('logs/debug/model_interactions_20250904_142805.jsonl') as f:
257+
for line in f:
258+
data.append(json.loads(line))
259+
260+
df = pd.DataFrame(data)
261+
262+
# Analysis examples
263+
print(f"Accuracy by category:")
264+
print(df.groupby('category')['is_correct'].mean().sort_values())
265+
266+
print(f"Average response time: {df['response_time_ms'].mean():.1f}ms")
267+
print(f"Questions with low match scores:")
268+
print(df[df['match_score'] < 0.5][['question_text', 'correct_answer', 'raw_response', 'match_score']])
269+
```
270+
271+
### Jupyter Notebook Integration
272+
273+
Debug JSON files work excellently with Jupyter notebooks for interactive analysis, visualization, and model comparison.
274+
275+
## Best Practices
276+
277+
1. **Use `--debug-errors-only` first** - focuses on problems without overwhelming detail
278+
2. **Run small samples initially** - debug with `--size quick` before full benchmarks
279+
3. **Compare models systematically** - use same debug settings across model comparisons
280+
4. **Archive important debug sessions** - keep debug logs for models that will be used in production
281+
5. **Monitor disk space** - debug logs for large benchmarks can consume significant storage
282+
283+
## Security Considerations
284+
285+
Debug logs contain:
286+
- Full question text (public Jeopardy data)
287+
- Model responses
288+
- API timing and cost data
289+
290+
**No sensitive data is logged**, but consider access controls for debug directories in shared environments.

0 commit comments

Comments
 (0)