Skip to content

Commit 9d1616f

Browse files
committed
Unified analysis engine with comprehensive reports
- Consolidated analysis into single analysis.py module - Added 9 visualization types: accuracy curves, heatmaps, bar charts, scatter plots - Enhanced EXECUTIVE_REPORT.md with full breakdown by subject, μ level, and statistics - Added McNemar significance tests with visualization - Added question difficulty tiers (Easy/Medium/Hard/Chameleon Breakers) - Added TF-IDF error clustering and taxonomy - Added docker-compose.yml for easier deployment - Updated README with complete analysis output documentation - Fixed pyproject.toml dependencies (added mistralai, scikit-learn) - Projects folder kept for user data (gitignored except .gitkeep)
1 parent d3830d5 commit 9d1616f

11 files changed

Lines changed: 1489 additions & 1642 deletions

File tree

.gitignore

37 Bytes
Binary file not shown.

Projects/.gitkeep

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# This folder is for user projects
2+
# Create new projects with: python cli.py init
3+

README.md

Lines changed: 40 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -166,17 +166,23 @@ Chameleon/
166166

167167
## 📈 Analysis Output
168168

169-
After running `python cli.py analyze --project YourProject`, all outputs are saved to `Projects/YourProject/results/analysis/`:
169+
After running `python cli.py analyze --project YourProject`, all outputs are saved to `Projects/YourProject/results/analysis/` (~23 files):
170170

171-
### 📊 Core Metrics
171+
### 📊 Core Metrics (Data + Charts)
172172

173173
| File | Description | Key Insight |
174174
|------|-------------|-------------|
175-
| `01_accuracy_by_miu.csv/png` | Accuracy curve across μ levels | How quickly does accuracy degrade? |
176-
| `02_accuracy_by_subject_miu.csv` | Per-subject breakdown | Which subjects are most vulnerable? |
175+
| `01_accuracy_by_miu.csv` | Accuracy data by μ level | Raw numbers for each distortion level |
176+
| `01_accuracy_by_miu.png` | 📈 **Line chart**: accuracy vs distortion | Visualize degradation curve |
177+
| `02_accuracy_by_subject_miu.csv` | Per-subject accuracy data | Which subjects are most vulnerable? |
178+
| `02_subject_ranking.png` | 📊 **Bar chart**: subject performance | Rank subjects by baseline accuracy |
179+
| `02_subject_miu_heatmap.png` | 🔥 **Heatmap**: absolute accuracy (Subject × μ) | See accuracy patterns |
180+
| `02_degradation_heatmap.png` | 🔥 **Heatmap**: % degradation from baseline | Identify vulnerable subjects |
177181
| `03_chameleon_robustness_index.csv` | CRI scores (global + per-subject) | Single metric for model ranking |
178-
| `04_elasticity.csv/png` | Linear regression of degradation | Quantify fragility with slope |
179-
| `05_model_comparison.csv/png` | Head-to-head comparison table | Compare all metrics in one view |
182+
| `04_elasticity.csv` | Degradation slope data | Quantify fragility numerically |
183+
| `04_elasticity.png` | 📈 **Scatter + regression**: degradation rate | Visualize slope |
184+
| `05_model_comparison.csv` | Head-to-head comparison table | Compare all metrics |
185+
| `05_model_comparison.png` | 📊 **Scatter plot**: CRI vs accuracy | Compare models visually |
180186

181187
### 🔬 Error Analysis
182188

@@ -190,17 +196,20 @@ After running `python cli.py analyze --project YourProject`, all outputs are sav
190196
| File | Description | Key Insight |
191197
|------|-------------|-------------|
192198
| `08_bootstrap_intervals.csv` | 95% confidence intervals (500 samples) | Are differences statistically significant? |
193-
| `mcnemar_distortion_results.csv` | McNemar's test: μ=0 vs μ>0 | Paired significance testing |
194-
| `mcnemar_subject_results.csv` | Per-subject McNemar tests | Subject-specific significance |
195-
| `mcnemar_pairwise_results.csv` | Adjacent μ level comparisons | Which μ jumps matter most? |
199+
| `11_mcnemar_distortion.csv` | McNemar's test: μ=0 vs each μ>0 | Paired significance testing |
200+
| `11_mcnemar_distortion.png` | 📊 **Bar chart**: baseline vs distorted (* = p<0.05) | Visualize significant differences |
201+
| `12_mcnemar_subject.csv` | Per-subject McNemar tests | Subject-specific significance |
202+
| `12_mcnemar_subject.png` | 📊 **Bar chart**: per-subject significance | Which subjects show real degradation? |
196203

197204
### 🎯 Advanced Analysis
198205

199206
| File | Description | Key Insight |
200207
|------|-------------|-------------|
201-
| `09_delta_accuracy_heatmap.csv/png` | Subject × μ degradation matrix | Visual: Red = high degradation |
208+
| `09_delta_accuracy_heatmap.csv` | Subject × μ degradation matrix (data) | Raw delta values |
209+
| `09_delta_accuracy_heatmap.png` | 🔥 **Heatmap**: change from baseline | Visual: Red = high degradation |
202210
| `10_question_difficulty_tiers.json` | Easy/Medium/Hard/Chameleon Breakers | Find pattern-matching evidence |
203-
| `11_executive_summary.md` | **START HERE** - Full findings report | Comprehensive interpretation |
211+
| `13_key_insights.png` | 📊 **4-panel summary**: curve + bars + pie + stats | Quick visual overview |
212+
| `EXECUTIVE_REPORT.md` | 📄 **START HERE** - Full findings report | Comprehensive interpretation |
204213

205214
---
206215

@@ -233,6 +242,26 @@ Linear regression of accuracy vs μ:
233242
234243
## 🐳 Docker Usage
235244

245+
### Option 1: Docker Compose (Recommended)
246+
247+
```bash
248+
# Set your API keys in .env or export them
249+
export MISTRAL_API_KEY="your-mistral-key"
250+
export OPENAI_API_KEY="your-openai-key"
251+
252+
# Build and run
253+
docker-compose build
254+
docker-compose run chameleon python cli.py init
255+
docker-compose run chameleon python cli.py distort -p MyProject
256+
docker-compose run chameleon python cli.py evaluate -p MyProject
257+
docker-compose run chameleon python cli.py analyze -p MyProject
258+
259+
# Or run analysis only (no API keys needed)
260+
PROJECT=MyProject docker-compose run analyze
261+
```
262+
263+
### Option 2: Docker Direct
264+
236265
```bash
237266
# Build
238267
docker build -t chameleon .

chameleon/analysis/__init__.py

Lines changed: 9 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,13 @@
1-
"""Analysis module - Metrics, statistical tests, and visualizations."""
1+
"""
2+
Chameleon Analysis Module
3+
=========================
4+
Analysis engine for LLM robustness evaluation.
25
3-
from chameleon.analysis.metrics import (
4-
calculate_accuracy,
5-
calculate_accuracy_by_group,
6-
calculate_degradation,
7-
)
8-
from chameleon.analysis.mcnemar import (
9-
mcnemar_test,
10-
analyze_distortion_significance,
11-
analyze_subject_significance,
12-
)
13-
from chameleon.analysis.visualizations import (
14-
create_degradation_heatmap,
15-
create_accuracy_plots,
16-
create_key_insights_summary,
17-
)
18-
from chameleon.analysis.run_analysis import run_full_analysis
19-
from chameleon.analysis.synergy_engine import run_synergy_analysis
6+
Main entry point: run_analysis()
7+
"""
8+
9+
from chameleon.analysis.analysis import run_analysis
2010

2111
__all__ = [
22-
"calculate_accuracy",
23-
"calculate_accuracy_by_group",
24-
"calculate_degradation",
25-
"mcnemar_test",
26-
"analyze_distortion_significance",
27-
"analyze_subject_significance",
28-
"create_degradation_heatmap",
29-
"create_accuracy_plots",
30-
"create_key_insights_summary",
31-
"run_full_analysis",
32-
"run_synergy_analysis",
12+
"run_analysis",
3313
]
34-
35-

0 commit comments

Comments
 (0)